IU Click Collection System

IU Click Collection System

To foster the study of the structure and dynamics of Web traffic networks, we are making available to the research community a large Click Dataset of 13 53.5 billion HTTP requests collected at Indiana University. Between 2006 and 2010, our system generated data at a rate of about 60 million requests per day, or about 30 GB/day of raw data. We hope that this data will help develop a better understanding of user behavior online and create more realistic models of Web traffic. The potential applications of this data include improved designs for networks, sites, and server software; more accurate forecasting of traffic trends; classification of sites based on the patterns of activity they inspire; and improved ranking algorithms for search results.

 

IU Click Collection System

IU Click Collection System

To foster the study of the structure and dynamics of Web traffic networks, we make available a large dataset (‘Click Dataset’) of about 53.5 billion HTTP requests made by users at Indiana University. Gathering anonymized requests directly from the network rather than relying on server logs and browser instrumentation allows one to examine large volumes of traffic data while minimizing biases associated with other data sources. It also provides one with valuable referrer information to reconstruct the subset of the Web graph actually traversed by users. The goal is to develop a better understanding of user behavior online and create more realistic models of Web traffic. The potential applications of this data include improved designs for networks, sites, and server software; more accurate forecasting of traffic trends; classification of sites based on the patterns of activity they inspire; and improved ranking algorithms for search results.

The data was generated by applying a Berkeley Packet Filter to a mirror of the traffic passing through the border router of Indiana University. This filter matched all traffic destined for TCP port 80. A long-running collection process used the pcap library to gather these packets, then applied a small set of regular expressions to their payloads to determine whether they contained HTTP GET requests. If a packet did contain a request, the collection system logged a record with the following fields:

  • a timestamp
  • the requested URL
  • the referring URL
  • a boolean classification of the user agent (browser or bot)
  • a boolean flag for whether the request was generated inside or outside IU.

Some important notes:

  1. Traffic generated outside IU only includes requests from outside IU for pages inside IU. Traffic generated inside IU only includes requests from people at IU (about 100,000 users) for resources outside IU. These two sets of requests have very different sampling biases.
  2. No distinguishing information about the client system was retained: no MAC or IP addresses nor any unique index were ever recorded.
  3. There was no attempt at stream reassembly, and server responses were not analyzed.

During collection, the system generated data at a rate of about 60 million requests per day, or about 30 GB/day of raw data. The data was collected between Sep 2006 and May 2010. Data is missing for about 275 days. The dataset has two collections:

  1. raw: About 25 billion requests, where  only the host name of the referrer is retained. Collected between 26 Sep 2006 and 3 Mar 2008; missing 98 days of data, including the entire month of Jun 2007. Approximately 0.85 TB, compressed.
  2. raw-url: About 28.6 billion requests, where the full referrer URL is retained. Collected between 3 Mar 2008 and 31 May 2010; missing 179 days of data, including the entire months of Dec 2008, Jan 2009, and Feb 2009. Approximately 1.5 TB, compressed.

The dataset is broken into hourly files. The initial line of each file has a set of flags that can be ignored. Each record looks like this:

 XXXXADreferrer
 host
 path

where XXXX is the timestamp (32-bit Unix epoch in seconds, in little endian order), A is the user-agent flag (“B” for browser or “?” for other, including bots), D is the direction flag (“I” for external traffic to IU, “O” for internal traffic to outside IU), referrer is the referrer hostname or URL (terminated by newline), host is the target hostname (terminated by newline), and path is the target path (terminated by newline). For further details, please refer to the paper below.

 

Frequently Asked Questions

How can I acknowledge use of this data?

The data was collected by Mark Meiss, with support from Indiana University. Collecting and making this data publicly available took a lot of work. If you use this data, acknowledge it by citing the following paper in your publications:

@inproceedings{Meiss08WSDM,
  title = {Ranking Web Sites with Real User Traffic},
  author = {Meiss, M. and Menczer, F. and Fortunato, S. and Flammini, A. and Vespignani, A.},
  booktitle = {Proc. First ACM International Conference on Web Search and Data Mining (WSDM)},
  url = {http://informatics.indiana.edu/fil/Papers/click.pdf},
  biburl = {http://www.bibsonomy.org/bibtex/2cfe4752489f4d3a0ab34927e72643dfd/fil},
  pages = {65--75},
  year = 2008
}

The following paper may also be of interest (however the dataset used there is not available due to IRB limitations):

@incollection{Meiss2010WAW,
  title = {Modeling Traffic on the Web Graph},
  author = {Meiss, M. and Goncalves, B. and Ramasco, J. and Flammini, A. and Menczer, F.},
  booktitle = {Proc. 7th Workshop on Algorithms and Models for the Web Graph (WAW)},
  series = {Lecture Notes in Computer Science},
  url = {http://informatics.indiana.edu/fil/Papers/abc.pdf},
  biburl = {http://www.bibsonomy.org/bibtex/2153a97ee31620b74be37bb341f268dc1/fil},
  pages = {50--61},
  volume = 6516,
  year = 2010
}

Is the data available to commercial entities? Independent researchers?

The dataset is made available for research use only. Therefore, we are only allowed to consider requests from established academic or industry research labs/organizations with a proven track record of research published in peer-reviewed venues. It is sometimes hard to determine whether a particular individual, group, or organization can be considered a research lab. Many corporations have R&D labs, which may produce white papers and the like. An organization may employ people who conduct or have conducted research. Such situations do not imply that we can share the dataset with these kinds of organizations. As it is not feasible for the data steward to make fine distinctions, we will apply simple rules of thumb. If research (and publication in peer-reviewed venues) is not the primary purpose of your organization, you will probably not qualify. This means that with rare exceptions, we will only be able to share the dataset with university research labs, or industry research labs whose work is autonomous from the for-profit activities of their corporate owners (such as MSR, IBM Research, Yahoo Research, etc).

Can you tell me more about the data or show me a sample?

Unfortunately we do not have the resources to provide more information than is available in this page or the publications from our group (see above).

Does each HTTP request record correspond to a human click?

No. Many HTTP requests are generated by bots (such as search engines and other crawlers), or by browsers fetching resources embedded in a requested page (javascript, css, images and other media, etc.).

How can I tell if any two requests are from the same person?

You cannot, by protocol design.

What about human subjects and privacy?

The dataset has been approved by the Indiana University IRB for “non-human subjects research” (protocol 1110007144).

How can I get the data?

The Click Dataset is large (~2.5 TB compressed), which requires that it be transferred on a physical hard drive. You will have to provide the drive as well as pre-paid return shipment. Additionally,  the dataset might potentially contain bits of stray personal data. Therefore you will have to sign a data security agreement. We require that you follow these instructions to request the data.

WSL

Web Science Lab

In a wide range of areas, including digital libraries, knowledge management, data mining, social media, electronic commerce, and Semantic Web, Web technologies are becoming ever more important for the sharing of data and metadata and for the management of knowledge. As such, the Web Science Lab (WSL) works to develop better methods to model, share, link and integrate data and information to enhance knowledge discovery and dissemination.

Website:  swl.slis.indiana.edu

WSTNet

We welcome the Web Science Lab to our center! This underscores our ongoing collaborations in the emerging discipline of Web Science. Since February 2012, CNetS is a member of WSTNet, an international network bringing together world-class research laboratories to support the Web Science research and education program. The Web Science Network of Laboratories combines some of the world’s leading academic researchers in Web Science, with  academic programs that enhance the already growing influence of Web Science. The member labs, from institutions that also include USC, MIT, Northwestern, Oxford, and Southampton among others, provide valuable support for the ongoing development of Web Science. Contributions from the labs include the organization and hosting of summer schools, workshops and meetings, including the WebSci conference series.

Mark Meiss

Dr. Mark Meiss

On December 16, Mark Meiss presented our paper “Modeling Traffic on the Web Graph” (with Bruno, José, Sandro, and Fil) at the 7th Workshop on Algorithms and Models for the Web Graph (WAW 2010), at Stanford. In this paper we introduce an agent-based model that explains many statistical features of aggregate and individual Web traffic data through realistic elements such as bookmarks, tabbed browsing, and topical interests.

Online popularity can be thought of as analogous to an earthquake; it is sudden, unpredictable, and the effects are severe. While shifts in online popularity are not inherently destructive – consider the unprecedented magnitude of online giving via Twitter following the disaster in Haiti – they indicate radical swings in society’s collective attention. Given the increasingly profound effect that large-scale opinion formation has on important phenomena like public policy, culture, and advertising profits, understanding this behavior is essential to understanding how the world operates.

In this paper by Ratkiewicz and colleagues, the authors put forth a web-wide analysis that includes large-scale data sets of the online behaviors of millions of people. The paper offers a novel model that is is capable of reproducing all of the observed dynamics of online popularity through a mechanism that causes sudden, nonlinear bursts of collective attention. These results have been mentioned in the APS and PhysOrg websites.

Yahoo!Fil received a $10k Yahoo! Faculty Research and Engagement award to study methodologies to infer user intents and goals based on social similarity measures. This project is in collaboration with Debora Donato of Yahoo! Labs, and Luca Aiello and Giancarlo Ruffo of the University of Torino, Italy.

scholarometer statsScholarometer is becoming a more mature tool.  The idea behind scholarometer — crowdsourcing  scholarly data — was presented at the Web Science 2010 Conference in Raleigh, North Carolina, along with some promising preliminary results. Recently acquired functionality includes a Chrome version, percentile calculations for all impact measures, export of bibliographic data in various standard formats, heuristics to determine reliable tags and detect ambiguous names, etc. Next up: an API to share annotation and impact data, and an interactive visualization for the interdisciplinary network.

realtimesearchThe IEEE Spectrum piece Real-Time Search Stumbles Out of the Gate discusses the recent integration of real-time search features, such as Twitter and other microblog entries, into major search engines. Professor Filippo Menczer, CNetS associate director, comments in the article on the challenges posed by real-time search. Here is an excerpt of the interview:

IU’s Menczer suggests that with all this user-generated content, the environment is more complex than the one Google’s PageRank algorithm had to deal with. While search used to be about relationships between pages, he explains, now it’s about relationships between ”people, tags, Web pages, ratings, votes, and direct social links….It may not be that page A points to page B but rather that user John follows Mary and replies to the tweet of Jane and retweets it.” That makes it ”a more complicated ecosystem,” he says, ”but a very rich one,” and search engines will need ”more sophisticated ways to extract data from these relationships” [...]

Read more…

Impact metrics based on user queries

Impact metrics based on user queries

CNetS graduate student Diep Thi Hoang and associate director Filippo Menczer have developed a tool (called Scholarometer, previously Tenurometer in beta version) for evaluating the impact of scholars in their field. Scholarometer uses the h-index, which combines the scholarly output with the influence of the work, but adds the universal h-index proposed by Radicchi et al. to compare the impact of research in different disciplines. This is enabled by a social mechanism in which users of the tool collaborate to tag the disciplines of the scholars. “We have computer scientists, physicists, social scientists, people from many different backgrounds, who publish in lots of different areas,” says Menczer. However, the various communities have different citation methods and different publishing traditions, making it difficult to compare the influence of a sociologist and a computer scientist, for example. The universal h-index controls for differences in the publishing traditions, as well as the amount of research scholars in various fields have to produce to make an impact. Menczer is especially excited about the potential to help show how the disciplines are merging into one another. More from Inside Higher Ed… (Also picked up by ACM TechNews and CACM.)