Tag Archives: bias

Talk by Ricardo Baeza-Yates: Data and Algorithmic Bias in the Web

Ricardo Baeza-YatesSpeaker: Ricardo Baeza-Yates, Universitat Pompeu Fabra, Spain & Universidad de Chile
Title: Data and Algorithmic Bias in the Web
Date: 04/22/2016
Time: 9am
Room: Info East 122
Abstract: The Web is the largest public big data repository that humankind has created. In this overwhelming data ocean we need to be aware of the quality and in particular, of biases that exist in this data, such as redundancy, spam, etc. These biases affect the algorithms that we design to improve the user experience. This problem is further exacerbated by biases that are added by these algorithms, especially in the context of search and recommendation systems. They include ranking bias, presentation bias, position bias, etc. We give several examples and their relation to sparsity, novelty, and privacy, stressing the importance of the user context to avoid these biases.
Bio: Ricardo Baeza-Yates areas of expertise are information retrieval, web search and data mining, data science and algorithms. He was VP of Research at Yahoo Labs, based in Barcelona, Spain, and later in Sunnyvale, California, from January 2006 to February 2016. He is part time Professor at DTIC of the Universitat Pompeu Fabra, in Barcelona, Spain. Until 2004 he was Professor and founding director of the Center for Web Research at the Dept. of Computing Science of the University of Chile. He obtained a Ph.D. in CS from the University of Waterloo, Canada, in 1989. He is co-author of the best-seller Modern Information Retrieval textbook published by Addison-Wesley in 2011 (2nd ed), that won the ASIST 2012 Book of the Year award. From 2002 to 2004 he was elected to the board of governors of the IEEE Computer Society and in 2012 he was elected for the ACM Council. Since 2010 is a founding member of the Chilean Academy of Engineering. In 2009 he was named ACM Fellow and in 2011 IEEE Fellow, among other awards and distinctions.


The aim of this project is to characterize, study and model various sources of bias that emerge from the complex network structure of the Web, social media, and search engines.

Social bias

social bubblesSocial media have become a prevalent channel to access information, spread ideas, and influence opinions. However, it has been suggested that social and algorithmic filtering may cause exposure to less diverse points of view. In the paper Measuring Online Social Bubbles we quantitatively measure this kind of social bias at the collective level by mining a massive datasets of web clicks. Our analysis shows that collectively, people access information from a significantly narrower spectrum of sources through social media and email, compared to a search baseline. The significance of this finding for individual exposure is revealed by investigating the relationship between the diversity of information sources experienced by users at both the collective and individual levels in two datasets where individual users can be analyzed—Twitter posts and search logs. There is a strong correlation between collective and individual diversity, supporting the notion that when we use social media we find ourselves inside “social bubbles.” Our results could lead to a deeper understanding of how technology biases our exposure to new information. A release about this work got some press coverage and an extended version of this paper is in preparation.

Gender bias in wikipedia

Contributing to the writing of history has never been as easy as it is today. Anyone with access to the Web is able to play a part on Wikipedia, an open and free encyclopedia, and arguably one of the primary sources of knowledge on the Web. In our paper First Women, Second Sex: Gender Bias in Wikipedia we study gender bias in Wikipedia in terms of how women and men are characterized in their biographies. To do so, we analyze biographical content in three aspects: meta-data, language, and network structure. Our results show that, indeed, there are differences in characterization and structure. Some of these differences are reflected from the off-line world documented by Wikipedia, but other differences can be attributed to gender bias in Wikipedia content. We contextualize these differences in social theory and discuss their implications for Wikipedia policy. This work was covered in Wikimedia Research Newsletter. An extended journal version titled Women through the glass ceiling: gender asymmetries in Wikipedia also shows that women in Wikipedia are more notable than men, which we interpret as the outcome of a subtle glass ceiling effect.

Popularity bias in search


The feedback loops between users searching information, users creating content, and the ranking algorithms of search engines that mediate between them, lead to surprising results. We are studying how all these systems and communities influence and feed on each other in a dynamic information ecology, and how these interactions affect their evolution and their impact on the global processes of information discovery, retrieval, and utilization.

For example, studying the relationship between Web traffic and PageRank, we have shown that given the heterogeneity of topical interests expressed by search queries, search engines mitigate the popularity bias generated by the rich-get-richer structure of the Web graph. These results, dispelling the feared Googlearchy affect, have been published in Proc. Natl. Acad. Sci. USA, presented at the WAW 2006 keynote (slides), and generated some media attention. You can see some movies demonstrating the finding. The result also inspired a robust rank-based model of scale-free network growth, published in Phys. Rev. Lett. (press release).

We also study sources of bias that stem from legal, political, or economic factors. The CENSEARCHIP tool visualizes the differences between results obtained from different search engines, or different country versions of a search engine. This tool, based on a technique described in this paper in First Monday, generated a lot of reactions in the media and the blogosphere (press release).

Project Participants (some of them)

Fil Menczer, PI
Fil Menczer
Sandro Flammini
Sandro Flammini
Alex Vespignani
Alex Vespignani
Santo Fortunato
Santo Fortunato
Mark Meiss
Mark Meiss
Dimitar Nikolov
Dimitar Nikolov


Pervasive Technology Labs at Indiana University Mark Meiss is supported by the Advanced Network Management Laboratory, which is one of the Pervasive Technology Labs established at Indiana University with the assistance of the Lilly Endowment.
Volkswagen Foundation Santo Fortunato was supported by a Volkswagen Foundation grant.
Nsf_logo This research is also supported in part by the National Science Foundation under awards 0348940, 0513650, and 0705676.

Opinions, findings, conclusions, recommendations or points of view of this group are those of the authors and do not necessarily represent the official position of the National Science Foundation, the Volkswagen Foundation, or Indiana University.

Talks: Torino, Padova, Genova, Roma

This sabbatical is providing wonderful opportunities for me to present our work and establish/strengthen collaborations with several groups in Italy. Recently I have given invited seminars on social search at the Department of Informatics at the University of Torino (hosts Matteo Sereno and Mino Anglano) and on Web traffic at the Department of Math at the University of Padova (host Massimo Marchiori). In the next few weeks I will give a talk on social search at the Department of Informatics and Information Science at the University of Genova (host Marina Ribaudo) and one on search engine bias and Web modeling at my old stomping ground, the Institute of Cognitive Sciences and Technologies of the National Research Council in Rome (host my undergraduate advisor and mentor Domenico Parisi).

CSI Piemonte

No, it’s not an Italian spin-off of the popular TV show. CSI Piemonte is organizing a meeting on Understanding Complexity: a Journey through Science to be held November 22-23 at the Lingotto Convention Center here in Torino. We will have demos and posters on 6S, GiveALink, and the egalitarian effect of search engines. I look forward in particular to seeing my good old friend Dario and my mentor, Domenico.

Researchers throttle notion of search engine dominance

egalSearch engines are not biased towards well-known Web sites. In fact, they actually produce an egalitarian effect as to where traffic is directed, say researchers at the Indiana University School of Informatics. Their study, Topical interests and the mitigation of search engine bias, appears in the Aug. 7-11 issue of the Proceedings of the National Academy of Sciences and challenges the “Googlearchy” theory – the perception that search engines push Web traffic toward popular sites, thus creating a monopoly over lesser-known sites.

The study was cited by New Scientist, MIT Technology Review, Scientific American MIND, New Scientist Online, UPI, VNUnet, Forskning & Framsteg (Sweden), Sole 24 Ore (Italy), Ars Technica, and Slashdot. Interviews aired on BBC World Service (MP3), Deutschlandradio (MP3), WFHB (MP3), and WFIU. Earlier, preliminary reports of our findings appeared in The Economist, Slashdot, PhysicsWeb, IDS, Le Scienze (Italian Edition of Scientific American), and IEEE Spectrum Online (see also our piece in IEEE Spectrum). Radio interviews were broadcast by Italian Radio (MP3 in Italian) and Swiss Radio (MP3 in Italian). Other news sources that picked up the story include Monsters and Critics, PhysOrg, TechNews Daily, Political Gateway, Daily India, ACM TechNews (Aug 9, Aug 28 2006), IT Week, Science Daily, EurekAlert, computing, LaboratoryTalk, PC World, SDA Asia, What PC, BrightSurf, PC Authority, TRN, and hundreds of blogs.


CenSEARCHip received intense coverage including in Slashdot, Network World, PhysOrg, IDS, ACM TechNews, Technology News Daily, Computer World, CCNews, ePrairie, PC World, LaboratoryTalk, Search Engine Journal, USA Today, dozens of new sources around the world (including France, Sweden, Norway, Poland, Russia, Italy, Mexico, etc.), and many blogs around the world (list from technorati or google). A radio interview aired on WFIU, WIBC and other NPR affiliates (20 March 2006).