IU Click Collection System

IU Click Collection System

To foster the study of the structure and dynamics of Web traffic networks, we make available a large dataset (‘Click Dataset’) of about 53.5 billion HTTP requests made by users at Indiana University. Gathering anonymized requests directly from the network rather than relying on server logs and browser instrumentation allows one to examine large volumes of traffic data while minimizing biases associated with other data sources. It also provides one with valuable referrer information to reconstruct the subset of the Web graph actually traversed by users. The goal is to develop a better understanding of user behavior online and create more realistic models of Web traffic. The potential applications of this data include improved designs for networks, sites, and server software; more accurate forecasting of traffic trends; classification of sites based on the patterns of activity they inspire; and improved ranking algorithms for search results.

The data was generated by applying a Berkeley Packet Filter to a mirror of the traffic passing through the border router of Indiana University. This filter matched all traffic destined for TCP port 80. A long-running collection process used the pcap library to gather these packets, then applied a small set of regular expressions to their payloads to determine whether they contained HTTP GET requests. If a packet did contain a request, the collection system logged a record with the following fields:

  • a timestamp
  • the requested URL
  • the referring URL
  • a boolean classification of the user agent (browser or bot)
  • a boolean flag for whether the request was generated inside or outside IU.

Some important notes:

  1. Traffic generated outside IU only includes requests from outside IU for pages inside IU. Traffic generated inside IU only includes requests from people at IU (about 100,000 users) for resources outside IU. These two sets of requests have very different sampling biases.
  2. No distinguishing information about the client system was retained: no MAC or IP addresses nor any unique index were ever recorded.
  3. There was no attempt at stream reassembly, and server responses were not analyzed.

During collection, the system generated data at a rate of about 60 million requests per day, or about 30 GB/day of raw data. The data was collected between Sep 2006 and May 2010. Data is missing for about 275 days. The dataset has two collections:

  1. raw: About 25 billion requests, where  only the host name of the referrer is retained. Collected between 26 Sep 2006 and 3 Mar 2008; missing 98 days of data, including the entire month of Jun 2007. Approximately 0.85 TB, compressed.
  2. raw-url: About 28.6 billion requests, where the full referrer URL is retained. Collected between 3 Mar 2008 and 31 May 2010; missing 179 days of data, including the entire months of Dec 2008, Jan 2009, and Feb 2009. Approximately 1.5 TB, compressed.

The dataset is broken into hourly files. The initial line of each file has a set of flags that can be ignored. Each record looks like this:

 XXXXADreferrer
 host
 path

where XXXX is the timestamp (32-bit Unix epoch in seconds, in little endian order), A is the user-agent flag (“B” for browser or “?” for other, including bots), D is the direction flag (“I” for external traffic to IU, “O” for internal traffic to outside IU), referrer is the referrer hostname or URL (terminated by newline), host is the target hostname (terminated by newline), and path is the target path (terminated by newline). For further details, please refer to the paper below.

 

Frequently Asked Questions

How can I acknowledge use of this data?

The data was collected by Mark Meiss, with support from Indiana University. Collecting and making this data publicly available took a lot of work. If you use this data, acknowledge it by citing the following paper in your publications:

@inproceedings{Meiss08WSDM,
  title = {Ranking Web Sites with Real User Traffic},
  author = {Meiss, M. and Menczer, F. and Fortunato, S. and Flammini, A. and Vespignani, A.},
  booktitle = {Proc. First ACM International Conference on Web Search and Data Mining (WSDM)},
  url = {http://informatics.indiana.edu/fil/Papers/click.pdf},
  biburl = {http://www.bibsonomy.org/bibtex/2cfe4752489f4d3a0ab34927e72643dfd/fil},
  pages = {65--75},
  year = 2008
}

The following paper may also be of interest (however the dataset used there is not available due to IRB limitations):

@incollection{Meiss2010WAW,
  title = {Modeling Traffic on the Web Graph},
  author = {Meiss, M. and Goncalves, B. and Ramasco, J. and Flammini, A. and Menczer, F.},
  booktitle = {Proc. 7th Workshop on Algorithms and Models for the Web Graph (WAW)},
  series = {Lecture Notes in Computer Science},
  url = {http://informatics.indiana.edu/fil/Papers/abc.pdf},
  biburl = {http://www.bibsonomy.org/bibtex/2153a97ee31620b74be37bb341f268dc1/fil},
  pages = {50--61},
  volume = 6516,
  year = 2010
}

Is the data available to commercial entities? Independent researchers?

The dataset is made available for research use only. Therefore, we are only allowed to consider requests from established academic or industry research labs/organizations with a proven track record of research published in peer-reviewed venues. It is sometimes hard to determine whether a particular individual, group, or organization can be considered a research lab. Many corporations have R&D labs, which may produce white papers and the like. An organization may employ people who conduct or have conducted research. Such situations do not imply that we can share the dataset with these kinds of organizations. As it is not feasible for the data steward to make fine distinctions, we will apply simple rules of thumb. If research (and publication in peer-reviewed venues) is not the primary purpose of your organization, you will probably not qualify. This means that with rare exceptions, we will only be able to share the dataset with university research labs, or industry research labs whose work is autonomous from the for-profit activities of their corporate owners (such as MSR, IBM Research, Yahoo Research, etc).

Can you tell me more about the data or show me a sample?

Unfortunately we do not have the resources to provide more information than is available in this page or the publications from our group (see above).

Does each HTTP request record correspond to a human click?

No. Many HTTP requests are generated by bots (such as search engines and other crawlers), or by browsers fetching resources embedded in a requested page (javascript, css, images and other media, etc.).

How can I tell if any two requests are from the same person?

You cannot, by protocol design.

What about human subjects and privacy?

The dataset has been approved by the Indiana University IRB for “non-human subjects research” (protocol 1110007144).

How can I get the data?

The Click Dataset is large (~2.5 TB compressed), which requires that it be transferred on a physical hard drive. You will have to provide the drive as well as pre-paid return shipment. Additionally,  the dataset might potentially contain bits of stray personal data. Therefore you will have to sign a data security agreement. We require that you follow these instructions to request the data.