Tag Archives: traffic

Datasets

Web Science 2014 Data Challenge

The datasets described below are used in the Web Science 2014 Data Challenge. For more, information, please the call for participation. For updates, see the Data Challenge section of the Web Science 2014 website.

There are 4 datasets in this collection. Each is available as a .tar.gz file containing either .json or .csv files. When the JSON format is used, each .json file contains a single JSON object. The format of that object is dependent on the dataset. See below for details. The datasets have been prepared by Dimitar Nikolov.
clicks

1. Web Traffic

A collection of Web (HTTP) requests for the month of November 2009. This is a small sample of the larger click dataset, documented here. (More on Web Traffic project).

JSON object format:

{
    'timestamp': 123456789, # Unix timestamp
    'from': '...', # the referrer host
    'to': '...', # the target host
    'count': 1234 # the number of request between the referrer and target hosts that occurred within the given hour
}

The data has been aggregated for every hour of the day. Thus, if more than one request occurred from the same referrer host to the same target host between, say, 2pm and 3pm, this is reflected in the ‘count’ field of the JSON object with a timestamp for 2pm, rather than by a different JSON object with a different timestamp.

Dataset statistics:

  • Dataset size: 235M requests
  • File size: 2.7GB uncompressed
  • Time period: Nov 1, 2009 – Nov 22, 2009

Data: web-clicks-nov-2009.tgz (321MB)

If you use this dataset in your research, please cite either or both of these papers:

@inproceedings{Meiss08WSDM,
    title = {Ranking Web Sites with Real User Traffic},
    author = {Meiss, M. and Menczer, F. and Fortunato, S. and Flammini, A. and Vespignani, A.},
    booktitle = {Proc. First ACM International Conference on Web Search and Data Mining (WSDM)},
    url = {http://informatics.indiana.edu/fil/Papers/click.pdf},
    pages = {65--75},
    year = 2008
}
@incollection{Meiss2010WAW,
    title = {Modeling Traffic on the Web Graph},
    author = {Meiss, M. and Goncalves, B. and Ramasco, J. and Flammini, A. and Menczer, F.},
    booktitle = {Proc. 7th Workshop on Algorithms and Models for the Web Graph (WAW)},
    series = {Lecture Notes in Computer Science},
    url = {http://informatics.indiana.edu/fil/Papers/abc.pdf},
    pages = {50--61},
    volume = 6516,
    year = 2010
}

tcot

2. Twitter

A collection of records extracted from tweets for the month of November 2012 containing both #hashtags and URLs as part of the tweet. (More on Truthy project)

JSON object format:

{
    'timestamp': 123456789, # Unix timestamp
    'user_id': 12345, # an integer uniquely identifying the user who tweeted
    'hashtags': ['...', '...', '...'], # a list of hashtags used in the tweet
    'urls': ['...', '...', '...'] # a list of links used in the tweet
}

Dataset statistics:

  • Dataset size: 27.8M tweets
  • File size: 3.5GB uncompressed
  • Time Period: Nov 1, 2012 – Nov 30, 2012

Data: tweets-nov-2012.json.gz (865MB)

If you use this dataset in your research, please cite either or both of these papers:

@inproceedings{McKelvey:2013:DPS:2487788.2488174,
    author = {McKelvey, Karissa and Menczer, Filippo},
    title = {Design and prototyping of a social media observatory},
    booktitle = {Proceedings of the 22nd international conference on World Wide Web companion},
    series = {WWW '13 Companion},
    pages = {1351--1358},
    url = {http://dl.acm.org/citation.cfm?id=2487788.2488174},
    year = 2013
}
@inproceedings{McKelvey2013cscw,
    Author = {Karissa McKelvey and Filippo Menczer},
    Title = {{Truthy: Enabling the Study of Online Social Networks}},
    Booktitle = {Proc. 16th ACM Conference on Computer Supported Cooperative Work and Social Computing Companion (CSCW)},
    Url = {http://arxiv.org/abs/1212.4565},
    Year = 2013
}

givealink-logo

3. Social Bookmarking

A collection of bookmarks from GiveALink.org for the month of November 2009. (More on GiveALink project)

JSON object format:

{
    'timestamp': 123456789, # Unix timestamp for when the URL was posted
    'url': '...', # the URL that was bookmarked
    'hashtags': ['...', '...', '...'] # a set of tags attached to the URL by the (anonymous) user
}

Dataset statistics:

  • Dataset size: 61,665 posts (approximately 430,000 triples)
  • File size: 12MB uncompressed
  • Time period: Nov 1, 2009 – Nov 30, 2009

Data: givealink-nov-2009.tgz (2MB)

If you use this dataset in your research, please cite either or both of these papers:

@inproceedings{Markines06GAL,
    author = {Markines, B. and Stoilova, L. and Menczer, F.},
    title = {Bookmark hierarchies and collaborative recommendation},
    booktitle = {Proc. 21st National Conference on Artificial Intelligence (AAAI-06)},
    pages = {1375--1380},
    publisher = {AAAI Press},
    url = {http://www.aaai.org/Papers/AAAI/2006/AAAI06-216.pdf},
    year = 2006
}
@inproceedings{Stoilova05GAL,
    Author = {Stoilova, Lubomira and Holloway, Todd and Markines, Ben and Maguitman, Ana G. and Menczer, Filippo},
    Title = {GiveALink: Mining a Semantic Network of Bookmarks for Web Search and Recommendation},
    Booktitle = {Proc. KDD Workshop on Link Discovery: Issues, Approaches and Applications (LinkKDD)},
    Url = {http://informatics.indiana.edu/fil/Papers/givealink-linkkdd.pdf},
    Year = 2005
}

co-author-network

4. Publications

Metadata for the complete set of all PubMed records through 2012 (with part of 2013 available as well), including title, authors, and year of publication. All data provided originates from NLM’s PubMed database (as downloaded April 24, 2013 from the NLM FTP site) and was retrieved via the Scholarly Database.

CSV format:

PubMed ID1,title1,year of publication1,author1|author2|author3|…
PubMed ID2,title2,year of publication2,author4|author1|author5|…

Dataset statistics:

  • Dataset size: 21.5 mil publications and 10.8 mil authors
  • File size: 3.1GB uncompressed
  • Time period: 1809 – 2013

Data: publications-1809-2013.tar.gz (1.4GB)

If you use this dataset in your research, please cite either or both of these papers:

@inproceedings{Light2013ISSI,
  author    = {Light, Robert P., David E. Polley and Katy Börner},
  title     = {Open Data and Open Code for Big Science of Science Studies},
  booktitle = {Proceedings of International Society of Scientometrics and Informetrics Conference},
  year      = {2013},
  pages     = {1342--1356},
  url       = {http://cns.iu.edu/docs/publications/2013-light-sdb-sci2-issi.pdf}
}
@article{Rowe2009Scien,
  author  = {Rowe, Gavin La, Sumeet Adinath Ambre, John W. Burgoon, Weimao Ke, and Katy Börner},
  title   = {The Scholarly Database and its Utility for Scientometrics Research"},
  journal = {Scientometrics},
  year   = {2009},
  volume = {79},
  number = {2},
  month  = {May},
  url    = {http://cns.iu.edu/docs/publications/2009-larowe-sdb.pdf}
}

Dataset of 53.5 billion clicks available

IU Click Collection System
IU Click Collection System

To foster the study of the structure and dynamics of Web traffic networks, we are making available to the research community a large Click Dataset of 13 53.5 billion HTTP requests collected at Indiana University. Between 2006 and 2010, our system generated data at a rate of about 60 million requests per day, or about 30 GB/day of raw data. We hope that this data will help develop a better understanding of user behavior online and create more realistic models of Web traffic. The potential applications of this data include improved designs for networks, sites, and server software; more accurate forecasting of traffic trends; classification of sites based on the patterns of activity they inspire; and improved ranking algorithms for search results.

 

Click Dataset

IU Click Collection System
IU Click Collection System

To foster the study of the structure and dynamics of Web traffic networks, we make available a large dataset (‘Click Dataset’) of about 53.5 billion HTTP requests made by users at Indiana University. Gathering anonymized requests directly from the network rather than relying on server logs and browser instrumentation allows one to examine large volumes of traffic data while minimizing biases associated with other data sources. It also provides one with valuable referrer information to reconstruct the subset of the Web graph actually traversed by users. The goal is to develop a better understanding of user behavior online and create more realistic models of Web traffic. The potential applications of this data include improved designs for networks, sites, and server software; more accurate forecasting of traffic trends; classification of sites based on the patterns of activity they inspire; and improved ranking algorithms for search results.

The data was generated by applying a Berkeley Packet Filter to a mirror of the traffic passing through the border router of Indiana University. This filter matched all traffic destined for TCP port 80. A long-running collection process used the pcap library to gather these packets, then applied a small set of regular expressions to their payloads to determine whether they contained HTTP GET requests. If a packet did contain a request, the collection system logged a record with the following fields:

  • a timestamp
  • the requested URL
  • the referring URL
  • a boolean classification of the user agent (browser or bot)
  • a boolean flag for whether the request was generated inside or outside IU.

Some important notes:

  1. Traffic generated outside IU only includes requests from outside IU for pages inside IU. Traffic generated inside IU only includes requests from people at IU (about 100,000 users) for resources outside IU. These two sets of requests have very different sampling biases.
  2. No distinguishing information about the client system was retained: no MAC or IP addresses nor any unique index were ever recorded.
  3. There was no attempt at stream reassembly, and server responses were not analyzed.

During collection, the system generated data at a rate of about 60 million requests per day, or about 30 GB/day of raw data. The data was collected between Sep 2006 and May 2010. Data is missing for about 275 days. The dataset has two collections:

  1. raw: About 25 billion requests, where  only the host name of the referrer is retained. Collected between 26 Sep 2006 and 3 Mar 2008; missing 98 days of data, including the entire month of Jun 2007. Approximately 0.85 TB, compressed.
  2. raw-url: About 28.6 billion requests, where the full referrer URL is retained. Collected between 3 Mar 2008 and 31 May 2010; missing 179 days of data, including the entire months of Dec 2008, Jan 2009, and Feb 2009. Approximately 1.5 TB, compressed.

The dataset is broken into hourly files. The initial line of each file has a set of flags that can be ignored. Each record looks like this:

 XXXXADreferrer
 host
 path

where XXXX is the timestamp (32-bit Unix epoch in seconds, in little endian order), A is the user-agent flag (“B” for browser or “?” for other, including bots), D is the direction flag (“I” for external traffic to IU, “O” for internal traffic to outside IU), referrer is the referrer hostname or URL (terminated by newline), host is the target hostname (terminated by newline), and path is the target path (terminated by newline). For further details, please refer to the paper below.

 

Frequently Asked Questions

How can I acknowledge use of this data?

The data was collected by Mark Meiss, with support from Indiana University. Collecting and making this data publicly available took a lot of work. If you use this data, acknowledge it by citing the following paper in your publications:

@inproceedings{Meiss08WSDM,
  title = {Ranking Web Sites with Real User Traffic},
  author = {Meiss, M. and Menczer, F. and Fortunato, S. and Flammini, A. and Vespignani, A.},
  booktitle = {Proc. First ACM International Conference on Web Search and Data Mining (WSDM)},
  url = {http://informatics.indiana.edu/fil/Papers/click.pdf},
  biburl = {http://www.bibsonomy.org/bibtex/2cfe4752489f4d3a0ab34927e72643dfd/fil},
  pages = {65--75},
  year = 2008
}

The following paper may also be of interest (however the dataset used there is not available due to IRB limitations):

@incollection{Meiss2010WAW,
  title = {Modeling Traffic on the Web Graph},
  author = {Meiss, M. and Goncalves, B. and Ramasco, J. and Flammini, A. and Menczer, F.},
  booktitle = {Proc. 7th Workshop on Algorithms and Models for the Web Graph (WAW)},
  series = {Lecture Notes in Computer Science},
  url = {http://informatics.indiana.edu/fil/Papers/abc.pdf},
  biburl = {http://www.bibsonomy.org/bibtex/2153a97ee31620b74be37bb341f268dc1/fil},
  pages = {50--61},
  volume = 6516,
  year = 2010
}

Is the data available to commercial entities? Independent researchers?

The dataset is made available for research use only. Therefore, we are only allowed to consider requests from established academic or industry research labs/organizations with a proven track record of research published in peer-reviewed venues. It is sometimes hard to determine whether a particular individual, group, or organization can be considered a research lab. Many corporations have R&D labs, which may produce white papers and the like. An organization may employ people who conduct or have conducted research. Such situations do not imply that we can share the dataset with these kinds of organizations. As it is not feasible for the data steward to make fine distinctions, we will apply simple rules of thumb. If research (and publication in peer-reviewed venues) is not the primary purpose of your organization, you will probably not qualify. This means that with rare exceptions, we will only be able to share the dataset with university research labs, or industry research labs whose work is autonomous from the for-profit activities of their corporate owners (such as MSR, IBM Research, Yahoo Research, etc).

Can you tell me more about the data or show me a sample?

Unfortunately we do not have the resources to provide more information than is available in this page or the publications from our group (see above).

Does each HTTP request record correspond to a human click?

No. Many HTTP requests are generated by bots (such as search engines and other crawlers), or by browsers fetching resources embedded in a requested page (javascript, css, images and other media, etc.).

How can I tell if any two requests are from the same person?

You cannot, by protocol design.

What about human subjects and privacy?

The dataset has been approved by the Indiana University IRB for “non-human subjects research” (protocol 1110007144).

How can I get the data?

The Click Dataset is large (~2.5 TB compressed), which requires that it be transferred on a physical hard drive. You will have to provide the drive as well as pre-paid return shipment. Additionally,  the dataset might potentially contain bits of stray personal data. Therefore you will have to sign a data security agreement. We require that you follow these instructions to request the data.

Paper on Web traffic modeling presented at WAW 2010

Mark Meiss
Dr. Mark Meiss

On December 16, Mark Meiss presented our paper “Modeling Traffic on the Web Graph” (with Bruno, José, Sandro, and Fil) at the 7th Workshop on Algorithms and Models for the Web Graph (WAW 2010), at Stanford. In this paper we introduce an agent-based model that explains many statistical features of aggregate and individual Web traffic data through realistic elements such as bookmarks, tabbed browsing, and topical interests.

New article published in Physical Review Letters puts forth new model of online popularity dynamics

Online popularity can be thought of as analogous to an earthquake; it is sudden, unpredictable, and the effects are severe. While shifts in online popularity are not inherently destructive – consider the unprecedented magnitude of online giving via Twitter following the disaster in Haiti – they indicate radical swings in society’s collective attention. Given the increasingly profound effect that large-scale opinion formation has on important phenomena like public policy, culture, and advertising profits, understanding this behavior is essential to understanding how the world operates.

In this paper by Ratkiewicz and colleagues, the authors put forth a web-wide analysis that includes large-scale data sets of the online behaviors of millions of people. The paper offers a novel model that is is capable of reproducing all of the observed dynamics of online popularity through a mechanism that causes sudden, nonlinear bursts of collective attention. These results have been mentioned in the APS and PhysOrg websites.

NaN’s strong presence at HT09

ht09NaN had a strong presence at Hypertext 2009 in Torino:

NaN talk for May 5, 2009: Mark Meiss, pre-Alpha Thesis Defense

This week at NaN I’ll be running through a very preliminary version of my thesis talk, “Structural Mining of Large-Scale Behavioral Data from the Internet,” which is all about the things you can discover using network flow data and Web clicks. The big things that I’ll be looking to get as feedback have to do with organization and pruning — I’m not terribly confident of the order in which I present the material, and I have a LOT more things to say than time to say it in, so I can use some suggestions on what to keep and what to just point to the actual document for.

(And, yes, there will be cookies.)

Web Traffic Analysis & Modeling

structure of a logical web sessionWe study the structure and dynamics of Web traffic networks based on data from HTTP requests made by users at Indiana University. Gathering anonymized requests directly from the network rather than relying on server logs and browser instrumentation allows us to examine large volumes of traffic data while minimizing biases associated with other data sources. It also gives us valuable referrer information that we can use to reconstruct the subset of the Web graph actually traversed by users.

Our Web traffic (click) dataset is available!

Our goal is to develop a better understanding of user behavior online and creating more realistic models of Web traffic. The potential applications of this analysis include improved designs for networks, sites, and server software; more accurate forecasting of traffic trends; classification of sites based on the patterns of activity they inspire; and improved ranking algorithms for search results.

Among our more intriguing findings are that server traffic (as measured by number of clicks) and site popularity (as measured by distinct users) both follow distributions so broad that they lack any well-defined mean. Actual Web traffic turns out to violate three assumptions of the random surfer model: users don’t start from any page at random, they don’t follow outgoing links with equal probability, and their probability of jumping is dependent on their current location. Search engines appear to be directly responsible for a smaller share of Web traffic than often supposed. These results were presented at WSDM2008 (paper | talk).

Another paper (also here; presented at Hypertext 2009) examined the conventional notion of a Web session as a sequence of requests terminated by an inactivity timeout. Such a definition turns out to yield statistics dependent primarily on the timeout value selected, which we find to be arbitrary. For that reason, we have proposed logical sessions defined by the target and referrer URLs present in a user’s Web requests.

Inspired by these findings, we designed a model of Web surfing able to recreate not only the broad distribution of traffic, but also the basic statistics of logical sessions. Late breaking results were presented at WSDM2009. Our final report in the ABC model was presented at WAW 2010.

Project Participants

Mark Meiss
Mark Meiss
Bruno Gonçalves
Bruno Gonçalves
Fil Menczer, PI
Fil Menczer
Sandro Flammini
Sandro Flammini
Jose Ramasco
Jose Ramasco
Alex Vespignani
Alex Vespignani
Santo Fortunato
Santo Fortunato

Support

Mark Meiss was supported by the Advanced Network Management Laboratory, one of the Pervasive Technology Labs established at Indiana University with the assistance of the Lilly Endowment.
Nsf_logo This research was also supported in part by the National Science Foundation under awards 0348940, 0513650, and 0705676.
DHS Logo This research was also supported in part from the Institute for Information Infrastructure Protection research program. The I3P is managed by Dartmouth College and supported under Award Number 2003-TK-TX-0003 from the U.S. DHS, Science and Technology Directorate.

Opinions, findings, conclusions, recommendations or points of view of this group are those of the authors and do not necessarily represent the official position of the U.S. Department of Homeland Security, Science and Technology Directorate, I3P, National Science Foundation, or Indiana University.

Talks: Torino, Padova, Genova, Roma

This sabbatical is providing wonderful opportunities for me to present our work and establish/strengthen collaborations with several groups in Italy. Recently I have given invited seminars on social search at the Department of Informatics at the University of Torino (hosts Matteo Sereno and Mino Anglano) and on Web traffic at the Department of Math at the University of Padova (host Massimo Marchiori). In the next few weeks I will give a talk on social search at the Department of Informatics and Information Science at the University of Genova (host Marina Ribaudo) and one on search engine bias and Web modeling at my old stomping ground, the Institute of Cognitive Sciences and Technologies of the National Research Council in Rome (host my undergraduate advisor and mentor Domenico Parisi).

Mark’s talk at WSDM, Stanford

Mark Meiss presented our work on Web click traffic analysis at the First International Conference on Web Search and Data Mining (WSDM 2008) on the beautiful Stanford campus. His talk was very well received, as reported in Greg Linden’s blog. The quality of the conference itself was very good, with several excellent presentations. Those I found most interesting were Qiaozhu Mei’s talk on search log entropy, Nick Craswell on click position bias, Carlos Castillo on social media, and Paul Heymann on social bookmarks. I think future WSDM conferences should have an award for the best-delivered talk. This year I would have voted for Carlos Castillo. The next WSDM will be in Barcelona.