Tag Archives: webgraph

Dataset of 53.5 billion clicks available

IU Click Collection System
IU Click Collection System

To foster the study of the structure and dynamics of Web traffic networks, we are making available to the research community a large Click Dataset of 13 53.5 billion HTTP requests collected at Indiana University. Between 2006 and 2010, our system generated data at a rate of about 60 million requests per day, or about 30 GB/day of raw data. We hope that this data will help develop a better understanding of user behavior online and create more realistic models of Web traffic. The potential applications of this data include improved designs for networks, sites, and server software; more accurate forecasting of traffic trends; classification of sites based on the patterns of activity they inspire; and improved ranking algorithms for search results.

 

Paper on Web traffic modeling presented at WAW 2010

Mark Meiss
Dr. Mark Meiss

On December 16, Mark Meiss presented our paper “Modeling Traffic on the Web Graph” (with Bruno, José, Sandro, and Fil) at the 7th Workshop on Algorithms and Models for the Web Graph (WAW 2010), at Stanford. In this paper we introduce an agent-based model that explains many statistical features of aggregate and individual Web traffic data through realistic elements such as bookmarks, tabbed browsing, and topical interests.

WebGraph++

Introduction

This software is a translation into C++ of the excellent Webgraph library by P. Boldi and S. Vigna. The original library, written in Java, is easy to use but hampered by some requirements of the Java virtual machine. This C++ translation attempts to preserve much of the ease of use (through integration with the Boost Graph Library), but bypass requirements imposed by a virtual machine.

Like the original Webgraph library, this work is available under the GNU General Public License.

This software is considered still in alpha stage. There are probably bugs. If you find any, please let Jacob know (unless you feel like fixing it!)

To build the library, run make from the main directory (of course, you must have Boost installed). This builds a file called libwebgraph.ar that must be linked in to any projects that use WebGraph++.

Download

You can get the WebGraph++ library on Github. It should compile on any system on which Boost can be compiled. It requires no libraries other than Boost. I’ve only tested it on the Linux and Mac OS X platforms; if you run into trouble getting it to work anywhere else, please let me (Jacob) know. Note that this is basically alpha software and since I’m not currently in active use of it, I probably won’t be able to respond for requests for complicated help. However, if you make major improvements or fixes it would be great if you could let me know so I could reincorporate them. Thanks!

Examples

The bv_graph::graph class (which is the main class in this library) models the Boost Graph concepts VertexListGraph and EdgeListGraph, and can thus be used with any algorithm that can work with these concepts. Code is provided that can be used to convert a graph in AsciiGraph format to a format that can be used with Webgraph. The following are some simple examples:

  • compress_webgraph.cpp: This is the source code for the compression example – to take an AsciiGraph and compress it.
  • print_graph.cpp – Print a webgraph’s vertices and edges – shows how to get and use edge iterators and vertex iterators.

About

This translation of the original Webgraph library is by Jacob Ratkiewicz. He got his Ph.D. in Computer Science at Indiana University. His advisor was Filippo Menczer; he was a member of his NaN research group.

This material is based upon work supported by the National Science Foundation under award No. IIS-0348940. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Web Traffic Analysis & Modeling

structure of a logical web sessionWe study the structure and dynamics of Web traffic networks based on data from HTTP requests made by users at Indiana University. Gathering anonymized requests directly from the network rather than relying on server logs and browser instrumentation allows us to examine large volumes of traffic data while minimizing biases associated with other data sources. It also gives us valuable referrer information that we can use to reconstruct the subset of the Web graph actually traversed by users.

Our Web traffic (click) dataset is available!

Our goal is to develop a better understanding of user behavior online and creating more realistic models of Web traffic. The potential applications of this analysis include improved designs for networks, sites, and server software; more accurate forecasting of traffic trends; classification of sites based on the patterns of activity they inspire; and improved ranking algorithms for search results.

Among our more intriguing findings are that server traffic (as measured by number of clicks) and site popularity (as measured by distinct users) both follow distributions so broad that they lack any well-defined mean. Actual Web traffic turns out to violate three assumptions of the random surfer model: users don’t start from any page at random, they don’t follow outgoing links with equal probability, and their probability of jumping is dependent on their current location. Search engines appear to be directly responsible for a smaller share of Web traffic than often supposed. These results were presented at WSDM2008 (paper | talk).

Another paper (also here; presented at Hypertext 2009) examined the conventional notion of a Web session as a sequence of requests terminated by an inactivity timeout. Such a definition turns out to yield statistics dependent primarily on the timeout value selected, which we find to be arbitrary. For that reason, we have proposed logical sessions defined by the target and referrer URLs present in a user’s Web requests.

Inspired by these findings, we designed a model of Web surfing able to recreate not only the broad distribution of traffic, but also the basic statistics of logical sessions. Late breaking results were presented at WSDM2009. Our final report in the ABC model was presented at WAW 2010.

Project Participants

Mark Meiss
Mark Meiss
Bruno Gonçalves
Bruno Gonçalves
Fil Menczer, PI
Fil Menczer
Sandro Flammini
Sandro Flammini
Jose Ramasco
Jose Ramasco
Alex Vespignani
Alex Vespignani
Santo Fortunato
Santo Fortunato

Support

Mark Meiss was supported by the Advanced Network Management Laboratory, one of the Pervasive Technology Labs established at Indiana University with the assistance of the Lilly Endowment.
Nsf_logo This research was also supported in part by the National Science Foundation under awards 0348940, 0513650, and 0705676.
DHS Logo This research was also supported in part from the Institute for Information Infrastructure Protection research program. The I3P is managed by Dartmouth College and supported under Award Number 2003-TK-TX-0003 from the U.S. DHS, Science and Technology Directorate.

Opinions, findings, conclusions, recommendations or points of view of this group are those of the authors and do not necessarily represent the official position of the U.S. Department of Homeland Security, Science and Technology Directorate, I3P, National Science Foundation, or Indiana University.