Tag Archives: web

Talk by Ricardo Baeza-Yates: Data and Algorithmic Bias in the Web

Ricardo Baeza-YatesSpeaker: Ricardo Baeza-Yates, Universitat Pompeu Fabra, Spain & Universidad de Chile
Title: Data and Algorithmic Bias in the Web
Date: 04/22/2016
Time: 9am
Room: Info East 122
Abstract: The Web is the largest public big data repository that humankind has created. In this overwhelming data ocean we need to be aware of the quality and in particular, of biases that exist in this data, such as redundancy, spam, etc. These biases affect the algorithms that we design to improve the user experience. This problem is further exacerbated by biases that are added by these algorithms, especially in the context of search and recommendation systems. They include ranking bias, presentation bias, position bias, etc. We give several examples and their relation to sparsity, novelty, and privacy, stressing the importance of the user context to avoid these biases.
Bio: Ricardo Baeza-Yates areas of expertise are information retrieval, web search and data mining, data science and algorithms. He was VP of Research at Yahoo Labs, based in Barcelona, Spain, and later in Sunnyvale, California, from January 2006 to February 2016. He is part time Professor at DTIC of the Universitat Pompeu Fabra, in Barcelona, Spain. Until 2004 he was Professor and founding director of the Center for Web Research at the Dept. of Computing Science of the University of Chile. He obtained a Ph.D. in CS from the University of Waterloo, Canada, in 1989. He is co-author of the best-seller Modern Information Retrieval textbook published by Addison-Wesley in 2011 (2nd ed), that won the ASIST 2012 Book of the Year award. From 2002 to 2004 he was elected to the board of governors of the IEEE Computer Society and in 2012 he was elected for the ACM Council. Since 2010 is a founding member of the Chilean Academy of Engineering. In 2009 he was named ACM Fellow and in 2011 IEEE Fellow, among other awards and distinctions.

WebSci14

websci14We are excited to announce that the ACM Web Science 2014 Conference will be hosted by our center on the beautiful IUB campus  June 23–26, 2014. Web Science studies the vast information network of people, communities, organizations, applications, and policies that shape and are shaped by the Web, the largest artifact constructed by humans in history. Computing, physical, and social sciences come together, complementing each other in understanding how the Web affects our interactions and behaviors. Previous editions of the conference were held in Athens, Raleigh, Koblenz, Evanston, and Paris. The conference is organized on behalf of the Web Science Trust by general co-chairs Fil Menczer, Jim Hendler, and Bill Dutton. Follow us on Twitter and see you in Bloomington!

Datasets

Web Science 2014 Data Challenge

The datasets described below are used in the Web Science 2014 Data Challenge. For more, information, please the call for participation. For updates, see the Data Challenge section of the Web Science 2014 website.

There are 4 datasets in this collection. Each is available as a .tar.gz file containing either .json or .csv files. When the JSON format is used, each .json file contains a single JSON object. The format of that object is dependent on the dataset. See below for details. The datasets have been prepared by Dimitar Nikolov.
clicks

1. Web Traffic

A collection of Web (HTTP) requests for the month of November 2009. This is a small sample of the larger click dataset, documented here. (More on Web Traffic project).

JSON object format:

{
    'timestamp': 123456789, # Unix timestamp
    'from': '...', # the referrer host
    'to': '...', # the target host
    'count': 1234 # the number of request between the referrer and target hosts that occurred within the given hour
}

The data has been aggregated for every hour of the day. Thus, if more than one request occurred from the same referrer host to the same target host between, say, 2pm and 3pm, this is reflected in the ‘count’ field of the JSON object with a timestamp for 2pm, rather than by a different JSON object with a different timestamp.

Dataset statistics:

  • Dataset size: 235M requests
  • File size: 2.7GB uncompressed
  • Time period: Nov 1, 2009 – Nov 22, 2009

Data: web-clicks-nov-2009.tgz (321MB)

If you use this dataset in your research, please cite either or both of these papers:

@inproceedings{Meiss08WSDM,
    title = {Ranking Web Sites with Real User Traffic},
    author = {Meiss, M. and Menczer, F. and Fortunato, S. and Flammini, A. and Vespignani, A.},
    booktitle = {Proc. First ACM International Conference on Web Search and Data Mining (WSDM)},
    url = {http://informatics.indiana.edu/fil/Papers/click.pdf},
    pages = {65--75},
    year = 2008
}
@incollection{Meiss2010WAW,
    title = {Modeling Traffic on the Web Graph},
    author = {Meiss, M. and Goncalves, B. and Ramasco, J. and Flammini, A. and Menczer, F.},
    booktitle = {Proc. 7th Workshop on Algorithms and Models for the Web Graph (WAW)},
    series = {Lecture Notes in Computer Science},
    url = {http://informatics.indiana.edu/fil/Papers/abc.pdf},
    pages = {50--61},
    volume = 6516,
    year = 2010
}

tcot

2. Twitter

A collection of records extracted from tweets for the month of November 2012 containing both #hashtags and URLs as part of the tweet. (More on Truthy project)

JSON object format:

{
    'timestamp': 123456789, # Unix timestamp
    'user_id': 12345, # an integer uniquely identifying the user who tweeted
    'hashtags': ['...', '...', '...'], # a list of hashtags used in the tweet
    'urls': ['...', '...', '...'] # a list of links used in the tweet
}

Dataset statistics:

  • Dataset size: 27.8M tweets
  • File size: 3.5GB uncompressed
  • Time Period: Nov 1, 2012 – Nov 30, 2012

Data: tweets-nov-2012.json.gz (865MB)

If you use this dataset in your research, please cite either or both of these papers:

@inproceedings{McKelvey:2013:DPS:2487788.2488174,
    author = {McKelvey, Karissa and Menczer, Filippo},
    title = {Design and prototyping of a social media observatory},
    booktitle = {Proceedings of the 22nd international conference on World Wide Web companion},
    series = {WWW '13 Companion},
    pages = {1351--1358},
    url = {http://dl.acm.org/citation.cfm?id=2487788.2488174},
    year = 2013
}
@inproceedings{McKelvey2013cscw,
    Author = {Karissa McKelvey and Filippo Menczer},
    Title = {{Truthy: Enabling the Study of Online Social Networks}},
    Booktitle = {Proc. 16th ACM Conference on Computer Supported Cooperative Work and Social Computing Companion (CSCW)},
    Url = {http://arxiv.org/abs/1212.4565},
    Year = 2013
}

givealink-logo

3. Social Bookmarking

A collection of bookmarks from GiveALink.org for the month of November 2009. (More on GiveALink project)

JSON object format:

{
    'timestamp': 123456789, # Unix timestamp for when the URL was posted
    'url': '...', # the URL that was bookmarked
    'hashtags': ['...', '...', '...'] # a set of tags attached to the URL by the (anonymous) user
}

Dataset statistics:

  • Dataset size: 61,665 posts (approximately 430,000 triples)
  • File size: 12MB uncompressed
  • Time period: Nov 1, 2009 – Nov 30, 2009

Data: givealink-nov-2009.tgz (2MB)

If you use this dataset in your research, please cite either or both of these papers:

@inproceedings{Markines06GAL,
    author = {Markines, B. and Stoilova, L. and Menczer, F.},
    title = {Bookmark hierarchies and collaborative recommendation},
    booktitle = {Proc. 21st National Conference on Artificial Intelligence (AAAI-06)},
    pages = {1375--1380},
    publisher = {AAAI Press},
    url = {http://www.aaai.org/Papers/AAAI/2006/AAAI06-216.pdf},
    year = 2006
}
@inproceedings{Stoilova05GAL,
    Author = {Stoilova, Lubomira and Holloway, Todd and Markines, Ben and Maguitman, Ana G. and Menczer, Filippo},
    Title = {GiveALink: Mining a Semantic Network of Bookmarks for Web Search and Recommendation},
    Booktitle = {Proc. KDD Workshop on Link Discovery: Issues, Approaches and Applications (LinkKDD)},
    Url = {http://informatics.indiana.edu/fil/Papers/givealink-linkkdd.pdf},
    Year = 2005
}

co-author-network

4. Publications

Metadata for the complete set of all PubMed records through 2012 (with part of 2013 available as well), including title, authors, and year of publication. All data provided originates from NLM’s PubMed database (as downloaded April 24, 2013 from the NLM FTP site) and was retrieved via the Scholarly Database.

CSV format:

PubMed ID1,title1,year of publication1,author1|author2|author3|…
PubMed ID2,title2,year of publication2,author4|author1|author5|…

Dataset statistics:

  • Dataset size: 21.5 mil publications and 10.8 mil authors
  • File size: 3.1GB uncompressed
  • Time period: 1809 – 2013

Data: publications-1809-2013.tar.gz (1.4GB)

If you use this dataset in your research, please cite either or both of these papers:

@inproceedings{Light2013ISSI,
  author    = {Light, Robert P., David E. Polley and Katy Börner},
  title     = {Open Data and Open Code for Big Science of Science Studies},
  booktitle = {Proceedings of International Society of Scientometrics and Informetrics Conference},
  year      = {2013},
  pages     = {1342--1356},
  url       = {http://cns.iu.edu/docs/publications/2013-light-sdb-sci2-issi.pdf}
}
@article{Rowe2009Scien,
  author  = {Rowe, Gavin La, Sumeet Adinath Ambre, John W. Burgoon, Weimao Ke, and Katy Börner},
  title   = {The Scholarly Database and its Utility for Scientometrics Research"},
  journal = {Scientometrics},
  year   = {2009},
  volume = {79},
  number = {2},
  month  = {May},
  url    = {http://cns.iu.edu/docs/publications/2009-larowe-sdb.pdf}
}

Dataset of 53.5 billion clicks available

IU Click Collection System
IU Click Collection System

To foster the study of the structure and dynamics of Web traffic networks, we are making available to the research community a large Click Dataset of 13 53.5 billion HTTP requests collected at Indiana University. Between 2006 and 2010, our system generated data at a rate of about 60 million requests per day, or about 30 GB/day of raw data. We hope that this data will help develop a better understanding of user behavior online and create more realistic models of Web traffic. The potential applications of this data include improved designs for networks, sites, and server software; more accurate forecasting of traffic trends; classification of sites based on the patterns of activity they inspire; and improved ranking algorithms for search results.

 

Click Dataset

IU Click Collection System
IU Click Collection System

To foster the study of the structure and dynamics of Web traffic networks, we make available a large dataset (‘Click Dataset’) of about 53.5 billion HTTP requests made by users at Indiana University. Gathering anonymized requests directly from the network rather than relying on server logs and browser instrumentation allows one to examine large volumes of traffic data while minimizing biases associated with other data sources. It also provides one with valuable referrer information to reconstruct the subset of the Web graph actually traversed by users. The goal is to develop a better understanding of user behavior online and create more realistic models of Web traffic. The potential applications of this data include improved designs for networks, sites, and server software; more accurate forecasting of traffic trends; classification of sites based on the patterns of activity they inspire; and improved ranking algorithms for search results.

The data was generated by applying a Berkeley Packet Filter to a mirror of the traffic passing through the border router of Indiana University. This filter matched all traffic destined for TCP port 80. A long-running collection process used the pcap library to gather these packets, then applied a small set of regular expressions to their payloads to determine whether they contained HTTP GET requests. If a packet did contain a request, the collection system logged a record with the following fields:

  • a timestamp
  • the requested URL
  • the referring URL
  • a boolean classification of the user agent (browser or bot)
  • a boolean flag for whether the request was generated inside or outside IU.

Some important notes:

  1. Traffic generated outside IU only includes requests from outside IU for pages inside IU. Traffic generated inside IU only includes requests from people at IU (about 100,000 users) for resources outside IU. These two sets of requests have very different sampling biases.
  2. No distinguishing information about the client system was retained: no MAC or IP addresses nor any unique index were ever recorded.
  3. There was no attempt at stream reassembly, and server responses were not analyzed.

During collection, the system generated data at a rate of about 60 million requests per day, or about 30 GB/day of raw data. The data was collected between Sep 2006 and May 2010. Data is missing for about 275 days. The dataset has two collections:

  1. raw: About 25 billion requests, where  only the host name of the referrer is retained. Collected between 26 Sep 2006 and 3 Mar 2008; missing 98 days of data, including the entire month of Jun 2007. Approximately 0.85 TB, compressed.
  2. raw-url: About 28.6 billion requests, where the full referrer URL is retained. Collected between 3 Mar 2008 and 31 May 2010; missing 179 days of data, including the entire months of Dec 2008, Jan 2009, and Feb 2009. Approximately 1.5 TB, compressed.

The dataset is broken into hourly files. The initial line of each file has a set of flags that can be ignored. Each record looks like this:

 XXXXADreferrer
 host
 path

where XXXX is the timestamp (32-bit Unix epoch in seconds, in little endian order), A is the user-agent flag (“B” for browser or “?” for other, including bots), D is the direction flag (“I” for external traffic to IU, “O” for internal traffic to outside IU), referrer is the referrer hostname or URL (terminated by newline), host is the target hostname (terminated by newline), and path is the target path (terminated by newline). For further details, please refer to the paper below.

 

Frequently Asked Questions

How can I acknowledge use of this data?

The data was collected by Mark Meiss, with support from Indiana University. Collecting and making this data publicly available took a lot of work. If you use this data, acknowledge it by citing the following paper in your publications:

@inproceedings{Meiss08WSDM,
  title = {Ranking Web Sites with Real User Traffic},
  author = {Meiss, M. and Menczer, F. and Fortunato, S. and Flammini, A. and Vespignani, A.},
  booktitle = {Proc. First ACM International Conference on Web Search and Data Mining (WSDM)},
  url = {http://informatics.indiana.edu/fil/Papers/click.pdf},
  biburl = {http://www.bibsonomy.org/bibtex/2cfe4752489f4d3a0ab34927e72643dfd/fil},
  pages = {65--75},
  year = 2008
}

The following paper may also be of interest (however the dataset used there is not available due to IRB limitations):

@incollection{Meiss2010WAW,
  title = {Modeling Traffic on the Web Graph},
  author = {Meiss, M. and Goncalves, B. and Ramasco, J. and Flammini, A. and Menczer, F.},
  booktitle = {Proc. 7th Workshop on Algorithms and Models for the Web Graph (WAW)},
  series = {Lecture Notes in Computer Science},
  url = {http://informatics.indiana.edu/fil/Papers/abc.pdf},
  biburl = {http://www.bibsonomy.org/bibtex/2153a97ee31620b74be37bb341f268dc1/fil},
  pages = {50--61},
  volume = 6516,
  year = 2010
}

Is the data available to commercial entities? Independent researchers?

The dataset is made available for research use only. Therefore, we are only allowed to consider requests from established academic or industry research labs/organizations with a proven track record of research published in peer-reviewed venues. It is sometimes hard to determine whether a particular individual, group, or organization can be considered a research lab. Many corporations have R&D labs, which may produce white papers and the like. An organization may employ people who conduct or have conducted research. Such situations do not imply that we can share the dataset with these kinds of organizations. As it is not feasible for the data steward to make fine distinctions, we will apply simple rules of thumb. If research (and publication in peer-reviewed venues) is not the primary purpose of your organization, you will probably not qualify. This means that with rare exceptions, we will only be able to share the dataset with university research labs, or industry research labs whose work is autonomous from the for-profit activities of their corporate owners (such as MSR, IBM Research, Yahoo Research, etc).

Can you tell me more about the data or show me a sample?

Unfortunately we do not have the resources to provide more information than is available in this page or the publications from our group (see above).

Does each HTTP request record correspond to a human click?

No. Many HTTP requests are generated by bots (such as search engines and other crawlers), or by browsers fetching resources embedded in a requested page (javascript, css, images and other media, etc.).

How can I tell if any two requests are from the same person?

You cannot, by protocol design.

What about human subjects and privacy?

The dataset has been approved by the Indiana University IRB for “non-human subjects research” (protocol 1110007144).

How can I get the data?

The Click Dataset is large (~2.5 TB compressed), which requires that it be transferred on a physical hard drive. You will have to provide the drive as well as pre-paid return shipment. Additionally,  the dataset might potentially contain bits of stray personal data. Therefore you will have to sign a data security agreement. We require that you follow these instructions to request the data.

WSL

Web Science Lab

In a wide range of areas, including digital libraries, knowledge management, data mining, social media, electronic commerce, and Semantic Web, Web technologies are becoming ever more important for the sharing of data and metadata and for the management of knowledge. As such, the Web Science Lab (WSL) works to develop better methods to model, share, link and integrate data and information to enhance knowledge discovery and dissemination.

Website:  swl.slis.indiana.edu

Web Science Lab and Web Science Network

WSTNet

We welcome the Web Science Lab to our center! This underscores our ongoing collaborations in the emerging discipline of Web Science. Since February 2012, CNetS is a member of WSTNet, an international network bringing together world-class research laboratories to support the Web Science research and education program. The Web Science Network of Laboratories combines some of the world’s leading academic researchers in Web Science, with  academic programs that enhance the already growing influence of Web Science. The member labs, from institutions that also include USC, MIT, Northwestern, Oxford, and Southampton among others, provide valuable support for the ongoing development of Web Science. Contributions from the labs include the organization and hosting of summer schools, workshops and meetings, including the WebSci conference series.

Paper on Web traffic modeling presented at WAW 2010

Mark Meiss
Dr. Mark Meiss

On December 16, Mark Meiss presented our paper “Modeling Traffic on the Web Graph” (with Bruno, José, Sandro, and Fil) at the 7th Workshop on Algorithms and Models for the Web Graph (WAW 2010), at Stanford. In this paper we introduce an agent-based model that explains many statistical features of aggregate and individual Web traffic data through realistic elements such as bookmarks, tabbed browsing, and topical interests.

New article published in Physical Review Letters puts forth new model of online popularity dynamics

Online popularity can be thought of as analogous to an earthquake; it is sudden, unpredictable, and the effects are severe. While shifts in online popularity are not inherently destructive – consider the unprecedented magnitude of online giving via Twitter following the disaster in Haiti – they indicate radical swings in society’s collective attention. Given the increasingly profound effect that large-scale opinion formation has on important phenomena like public policy, culture, and advertising profits, understanding this behavior is essential to understanding how the world operates.

In this paper by Ratkiewicz and colleagues, the authors put forth a web-wide analysis that includes large-scale data sets of the online behaviors of millions of people. The paper offers a novel model that is is capable of reproducing all of the observed dynamics of online popularity through a mechanism that causes sudden, nonlinear bursts of collective attention. These results have been mentioned in the APS and PhysOrg websites.