Fil received a $10k Yahoo! Faculty Research and Engagement award to study methodologies to infer user intents and goals based on social similarity measures. This project is in collaboration with Debora Donato of Yahoo! Labs, and Luca Aiello and Giancarlo Ruffo of the University of Torino, Italy.
The IEEE Spectrum piece Real-Time Search Stumbles Out of the Gate discusses the recent integration of real-time search features, such as Twitter and other microblog entries, into major search engines. Professor Filippo Menczer, CNetS associate director, comments in the article on the challenges posed by real-time search. Here is an excerpt of the interview:
IU’s Menczer suggests that with all this user-generated content, the environment is more complex than the one Google’s PageRank algorithm had to deal with. While search used to be about relationships between pages, he explains, now it’s about relationships between ”people, tags, Web pages, ratings, votes, and direct social links….It may not be that page A points to page B but rather that user John follows Mary and replies to the tweet of Jane and retweets it.” That makes it ”a more complicated ecosystem,” he says, ”but a very rich one,” and search engines will need ”more sophisticated ways to extract data from these relationships” […]
The aim of this project is to characterize, study and model various sources of bias that emerge from the complex network structure of the Web, social media, and search engines. Some of the questions we’re currently exploring concern how social, cognitive, and algorithmic biases lead to the emergence of information overload and online echo chambers that make us more vulnerable to abuse and manipulation.
Social and cognitive biases
Social media have become a prevalent channel to access information, spread ideas, and influence opinions. However, it has been suggested that social and algorithmic filtering may cause exposure to less diverse points of view. In the paper Measuring Online Social Bubbles we quantitatively measure this kind of social bias at the collective level by mining a massive datasets of web clicks. Our analysis shows that collectively, people access information from a significantly narrower spectrum of sources through social media and email, compared to a search baseline. The significance of this finding for individual exposure is revealed by investigating the relationship between the diversity of information sources experienced by users at both the collective and individual levels in two datasets where individual users can be analyzed—Twitter posts and search logs. There is a strong correlation between collective and individual diversity, supporting the notion that when we use social media we find ourselves inside “social bubbles.” Our results could lead to a deeper understanding of how technology biases our exposure to new information. A release about this work got some press coverage and an extended version of this paper is in preparation.
We also find that the combination of social media mechanisms and cognitive biases such as limited attention and information overload may explain the viral spread of low-quality information, such as the digital misinformation that threatens our democracy. We develop a stylized model of an online social network, where individual agents prefer quality information, but have behavioral limitations in managing a heavy flow of information. The model predicts that in realistic conditions, low-quality information is just as likely to go viral, providing an interpretation for the high volume of misinformation we observe online.
Contributing to the writing of history has never been as easy as it is today. Anyone with access to the Web is able to play a part on Wikipedia, an open and free encyclopedia, and arguably one of the primary sources of knowledge on the Web. In our paper First Women, Second Sex: Gender Bias in Wikipedia we study gender bias in Wikipedia in terms of how women and men are characterized in their biographies. To do so, we analyze biographical content in three aspects: meta-data, language, and network structure. Our results show that, indeed, there are differences in characterization and structure. Some of these differences are reflected from the off-line world documented by Wikipedia, but other differences can be attributed to gender bias in Wikipedia content. We contextualize these differences in social theory and discuss their implications for Wikipedia policy. This work was covered in Wikimedia Research Newsletter. An extended journal version titled Women through the glass ceiling: gender asymmetries in Wikipedia also shows that women in Wikipedia are more notable than men, which we interpret as the outcome of a subtle glass ceiling effect.
The feedback loops between users searching information, users creating content, and the ranking algorithms of search engines that mediate between them, lead to surprising results. We are studying how all these systems and communities influence and feed on each other in a dynamic information ecology, and how these interactions affect their evolution and their impact on the global processes of information discovery, retrieval, and utilization.
For example, studying the relationship between Web traffic and PageRank, we have shown that given the heterogeneity of topical interests expressed by search queries, search engines mitigate the popularity bias generated by the rich-get-richer structure of the Web graph. These results, dispelling the feared Googlearchy affect, have been published in Proc. Natl. Acad. Sci. USA, presented at the WAW 2006 keynote (slides), and generated some media attention. You can see some movies demonstrating the finding. The result also inspired a robust rank-based model of scale-free network growth, published in Phys. Rev. Lett. (press release).
Most recently we have identified the conditions in which popularity may be a viable proxy for quality content by studying a simple model of cultural market endowed with an intrinsic notion of quality. A parameter representing the cognitive cost of exploration controls the critical trade-off between quality and popularity. There is a regime of intermediate exploration cost where an optimal balance exists, such that choosing what is popular actually promotes high-quality items to the top. Outside of these limits, however, popularity bias is more likely to hinder quality.
We also studied sources of bias that stem from legal, political, or economic factors. The CENSEARCHIP tool visualizes the differences between results obtained from different search engines, or different country versions of a search engine. This tool, based on a technique described in this paper in First Monday, generated a lot of reactions in the media and the blogosphere (press release).
Mark Meiss was supported by the Advanced Network Management Laboratory, one of the Pervasive Technology Labs established at Indiana University with funding from the Lilly Endowment.
Santo Fortunato was supported by a Volkswagen Foundation grant.
Diego Fregolente was supported by the J.S. McDonnell Foundation.
This research was also supported in part by the National Science Foundation under awards 0348940, 0513650, and 0705676.
Opinions, findings, conclusions, recommendations or points of view of this group are those of the authors and do not necessarily represent the official position of the National Science Foundation, the Volkswagen Foundation, the McDonnell Foundation, or Indiana University.
The goal of Sixearch (carl.cs.indiana.edu/6S) is to provide an open-source platform for developing a context aware personalized peer-to-peer (P2P) distributed information retrieval system. The application currently supports collaborative Web search with scalability.
Sixearch uses the idea of modeling neighbor nodes by their content but without assuming the presence of special directory hubs. As shown on the left, each peer is both a (limited) directory hub and a content provider; it has its own topical crawler guided by its user’s information content and local search engine. Peers communication is built on JXTA platform. When a user submits a query, it is first matched against the local engine, and then routed to neighbor peers to obtain more results. Ideally, the peer network should lead to the emergence of a clustered topology by intelligent collaboration between the peers. While traditional search engines such as Google and Yahoo provide access to very large document collections, the Sixearch P2P Web search application provides a complementary way for users to actively and collaboratively share their own document collections. However, the Sixearch framework allows traditional search engines to naturally be included as peers; such peers would quickly emerge as reliable, trustworthy, and general authority nodes.
The right figure displays a screenshot of the queries being sent among peers. Peer interactions are visualized by an applet. The area of each node is proportional to the size of its Web index. The edges represent the queries exchanged between two peers. The connectivity of each peer is an indirect measure of centrality, authority, and/or reliability of the peer as learned by the other peers.
Our work on Sixearch has been published in AAAI Magazine (preprint), and presented at Hyperterxt 2009 (demo), ACM SAC2009 (paper), RIAO2007 (demo), ACM CIKM P2PIR2006 (paper), WTAS2005 (paper), WWW2005 (poster), and WWW2004 (poster). Unfortunately, we have also been the victims of shameless plagiarism.
Visit Sixearch to learn more and download the application or contribute to it!
Members & Collaborators
|Sixearch project is based upon work supported by the the National Science Foundation under award No. IIS-0133124 and IIS-0348940 CAREER: Scalable Search Engines Via Adaptive Topic-Driven Crawlers.|
|Recently, Sixearch project has received the IBM 2007 UIMA award on 6S: A Collaborative Web Search Network.|
News | People | Publications | Meetings | Mailing list | Talks | Pics | Computing | Data | Coffee
Networks & agents Network (NaN)
NaN is a research group exploring the modeling, simulation, and analysis of complex social and information networks, and the human and artificial agents who live in these networks. Broadly speaking our research spans network science, data science, web science, and computational social science. Recently our focus has been on modeling the dynamic processes that occur online (how information networks grow and evolve, how memes go viral, how social media can be manipulated for the spread of misinformation, how attention bursts and other traffic patterns emerge, etc.) and on the design of tools to make the Web and social media ‘better’ (more trustworthy, reliable, intelligent, autonomous, robust, personalized, contextual, scalable, adaptive, and so on). We collaborate with colleagues at the IU Network Science Institute (IUNI), ISI Foundation, Yahoo Research, and many other institutions.
Active NaN projects
Archived NaN projects
This sabbatical is providing wonderful opportunities for me to present our work and establish/strengthen collaborations with several groups in Italy. Recently I have given invited seminars on social search at the Department of Informatics at the University of Torino (hosts Matteo Sereno and Mino Anglano) and on Web traffic at the Department of Math at the University of Padova (host Massimo Marchiori). In the next few weeks I will give a talk on social search at the Department of Informatics and Information Science at the University of Genova (host Marina Ribaudo) and one on search engine bias and Web modeling at my old stomping ground, the Institute of Cognitive Sciences and Technologies of the National Research Council in Rome (host my undergraduate advisor and mentor Domenico Parisi).
I just got back from a visit to Yahoo! Research Silicon Valley. I gave two talks presenting our work on social search and web traffic analysis, and met lots of interesting people. They have an amazing group and of course mountains of data to lust after. Hopefully this will lead to collaborations in the future, given the many intersecting research interests.
I will give a talk on social search at the Workshop on Social Data Mining and Knowledge Building, part III of the Mathematics of Knowledge and Search Engines program. The workshop, organized by IPAM, will be held 5–9 November 2007 at UCLA. Joining me as speakers are Luis Rocha and Stan Wasserman from IU and Santo Fortunato and Jose Ramasco from ISI/CNLL. Should be fun!