Tag Archives: Truthy

Hoaxy: A Platform for Tracking Online Misinformation

diffusion networks of hoaxes in Twitter
Misinformation (yellow/brown) spreads within the healthy (blue) Twittersphere network. Left: chemtrails conspiracies mix with conversations about the sky. Right: antivax campaigns penetrate discussions about the flu.

UPDATE (21 Dec 2016): we just launched Hoaxy, our open platform to visualize the online spread of claims and fact checking.

Continue reading Hoaxy: A Platform for Tracking Online Misinformation

DESPIC team presents Bot Or Not demo and six posters at DoD meeting

IU Bot or Bot poster The DESPIC team at the Center for Complex Systems and Networks Research (CNetS) presented a demo of a new tool named BotOrNot at a DoD meeting held in Arlington, Virginia on April 23-25, 2014.  BotOrNot (truthy.indiana.edu/botornot) is a tool to automatically detect whether a given Twitter user is a social bot or a human. Trained on Twitter bots collected by our lab and the infolab at Texas A&M University, BotOrNot analyzes over a thousand features from the user’s friendship network, content, and temporal information in real time and estimates the degree to which the account may be a bot. In addition to the demo, the DESPIC team (including colleagues at the University of Michigan)  presented several posters on Scalable Architecture for Social Media ObservatoryMeme Clustering in  Streaming DataPersuasion Detection in Social StreamsHigh-Resolution Anomaly Detection in Social Streams, and Early Detection and Analysis of Rumors. See more coverage of BotOrNot on PCWorld, IDS, BBCPolitico, and MIT Technology Review.

Datasets

Web Science 2014 Data Challenge

The datasets described below are used in the Web Science 2014 Data Challenge. For more, information, please the call for participation. For updates, see the Data Challenge section of the Web Science 2014 website.

There are 4 datasets in this collection. Each is available as a .tar.gz file containing either .json or .csv files. When the JSON format is used, each .json file contains a single JSON object. The format of that object is dependent on the dataset. See below for details. The datasets have been prepared by Dimitar Nikolov.
clicks

1. Web Traffic

A collection of Web (HTTP) requests for the month of November 2009. This is a small sample of the larger click dataset, documented here. (More on Web Traffic project).

JSON object format:

{
    'timestamp': 123456789, # Unix timestamp
    'from': '...', # the referrer host
    'to': '...', # the target host
    'count': 1234 # the number of request between the referrer and target hosts that occurred within the given hour
}

The data has been aggregated for every hour of the day. Thus, if more than one request occurred from the same referrer host to the same target host between, say, 2pm and 3pm, this is reflected in the ‘count’ field of the JSON object with a timestamp for 2pm, rather than by a different JSON object with a different timestamp.

Dataset statistics:

  • Dataset size: 235M requests
  • File size: 2.7GB uncompressed
  • Time period: Nov 1, 2009 – Nov 22, 2009

Data: web-clicks-nov-2009.tgz (321MB)

If you use this dataset in your research, please cite either or both of these papers:

@inproceedings{Meiss08WSDM,
    title = {Ranking Web Sites with Real User Traffic},
    author = {Meiss, M. and Menczer, F. and Fortunato, S. and Flammini, A. and Vespignani, A.},
    booktitle = {Proc. First ACM International Conference on Web Search and Data Mining (WSDM)},
    url = {http://informatics.indiana.edu/fil/Papers/click.pdf},
    pages = {65--75},
    year = 2008
}
@incollection{Meiss2010WAW,
    title = {Modeling Traffic on the Web Graph},
    author = {Meiss, M. and Goncalves, B. and Ramasco, J. and Flammini, A. and Menczer, F.},
    booktitle = {Proc. 7th Workshop on Algorithms and Models for the Web Graph (WAW)},
    series = {Lecture Notes in Computer Science},
    url = {http://informatics.indiana.edu/fil/Papers/abc.pdf},
    pages = {50--61},
    volume = 6516,
    year = 2010
}

tcot

2. Twitter

A collection of records extracted from tweets for the month of November 2012 containing both #hashtags and URLs as part of the tweet. (More on Truthy project)

JSON object format:

{
    'timestamp': 123456789, # Unix timestamp
    'user_id': 12345, # an integer uniquely identifying the user who tweeted
    'hashtags': ['...', '...', '...'], # a list of hashtags used in the tweet
    'urls': ['...', '...', '...'] # a list of links used in the tweet
}

Dataset statistics:

  • Dataset size: 27.8M tweets
  • File size: 3.5GB uncompressed
  • Time Period: Nov 1, 2012 – Nov 30, 2012

Data: tweets-nov-2012.json.gz (865MB)

If you use this dataset in your research, please cite either or both of these papers:

@inproceedings{McKelvey:2013:DPS:2487788.2488174,
    author = {McKelvey, Karissa and Menczer, Filippo},
    title = {Design and prototyping of a social media observatory},
    booktitle = {Proceedings of the 22nd international conference on World Wide Web companion},
    series = {WWW '13 Companion},
    pages = {1351--1358},
    url = {http://dl.acm.org/citation.cfm?id=2487788.2488174},
    year = 2013
}
@inproceedings{McKelvey2013cscw,
    Author = {Karissa McKelvey and Filippo Menczer},
    Title = {{Truthy: Enabling the Study of Online Social Networks}},
    Booktitle = {Proc. 16th ACM Conference on Computer Supported Cooperative Work and Social Computing Companion (CSCW)},
    Url = {http://arxiv.org/abs/1212.4565},
    Year = 2013
}

givealink-logo

3. Social Bookmarking

A collection of bookmarks from GiveALink.org for the month of November 2009. (More on GiveALink project)

JSON object format:

{
    'timestamp': 123456789, # Unix timestamp for when the URL was posted
    'url': '...', # the URL that was bookmarked
    'hashtags': ['...', '...', '...'] # a set of tags attached to the URL by the (anonymous) user
}

Dataset statistics:

  • Dataset size: 61,665 posts (approximately 430,000 triples)
  • File size: 12MB uncompressed
  • Time period: Nov 1, 2009 – Nov 30, 2009

Data: givealink-nov-2009.tgz (2MB)

If you use this dataset in your research, please cite either or both of these papers:

@inproceedings{Markines06GAL,
    author = {Markines, B. and Stoilova, L. and Menczer, F.},
    title = {Bookmark hierarchies and collaborative recommendation},
    booktitle = {Proc. 21st National Conference on Artificial Intelligence (AAAI-06)},
    pages = {1375--1380},
    publisher = {AAAI Press},
    url = {http://www.aaai.org/Papers/AAAI/2006/AAAI06-216.pdf},
    year = 2006
}
@inproceedings{Stoilova05GAL,
    Author = {Stoilova, Lubomira and Holloway, Todd and Markines, Ben and Maguitman, Ana G. and Menczer, Filippo},
    Title = {GiveALink: Mining a Semantic Network of Bookmarks for Web Search and Recommendation},
    Booktitle = {Proc. KDD Workshop on Link Discovery: Issues, Approaches and Applications (LinkKDD)},
    Url = {http://informatics.indiana.edu/fil/Papers/givealink-linkkdd.pdf},
    Year = 2005
}

co-author-network

4. Publications

Metadata for the complete set of all PubMed records through 2012 (with part of 2013 available as well), including title, authors, and year of publication. All data provided originates from NLM’s PubMed database (as downloaded April 24, 2013 from the NLM FTP site) and was retrieved via the Scholarly Database.

CSV format:

PubMed ID1,title1,year of publication1,author1|author2|author3|…
PubMed ID2,title2,year of publication2,author4|author1|author5|…

Dataset statistics:

  • Dataset size: 21.5 mil publications and 10.8 mil authors
  • File size: 3.1GB uncompressed
  • Time period: 1809 – 2013

Data: publications-1809-2013.tar.gz (1.4GB)

If you use this dataset in your research, please cite either or both of these papers:

@inproceedings{Light2013ISSI,
  author    = {Light, Robert P., David E. Polley and Katy Börner},
  title     = {Open Data and Open Code for Big Science of Science Studies},
  booktitle = {Proceedings of International Society of Scientometrics and Informetrics Conference},
  year      = {2013},
  pages     = {1342--1356},
  url       = {http://cns.iu.edu/docs/publications/2013-light-sdb-sci2-issi.pdf}
}
@article{Rowe2009Scien,
  author  = {Rowe, Gavin La, Sumeet Adinath Ambre, John W. Burgoon, Weimao Ke, and Katy Börner},
  title   = {The Scholarly Database and its Utility for Scientometrics Research"},
  journal = {Scientometrics},
  year   = {2009},
  volume = {79},
  number = {2},
  month  = {May},
  url    = {http://cns.iu.edu/docs/publications/2009-larowe-sdb.pdf}
}

Truthy Team Wins WICI Data Challenge

WICI Data Challenge AwardCongratulations to Przemyslaw Grabowicz, Luca Aiello, and Fil Menczer for winning the WICI Data Challenge. A prize of $10,000 CAD accompanies this award from the Waterloo Institute for Complexity and Innovation at the University of Waterloo. The Challenge called for tools and methods that improve the exploration, analysis, and visualization of complex-systems data. The winning entry, titled Fast visualization of relevant portions of large dynamic networks, is an algorithm that selects subsets of nodes and edges that best represent an evolving graph and visualizes it either by creating a movie, or by streaming it to an interactive network visualization tool. The algorithm is deployed in the movie generation tool of the Truthy system, which allows users to create, in near-real time, YouTube videos that illustrate the spread and co-occurrence of memes on Twitter. Przemek and Luca worked on this project while visiting CNetS in 2011 and collaborating with the Truthy team. Bravo!

Postdoctoral Researcher in Analysis and Modeling of Social Networks

Network of Political Retweets

[UPDATE: this position has been filled.]

The Center for Complex Networks and Systems Research has an open postdoctoral position to study how ideas propagate through complex online social networks. The position is funded by a McDonnell Foundation’s grant in Complex Systems. The appointment starts as early as possible after January 2013 for one year and is renewable for up to 2 additional years. The salary is competitive and benefits are generous.

The postdoc will join a dynamic and interdisciplinary team that includes computer, physical, and cognitive scientists. The postdoc will work with PIs Filippo Menczer and Alessandro Flammini, other postdocs, and several PhD students on analysis and modeling of social media data. Areas of focus will include information diffusion patterns, epidemic models for the spread of ideas, interactions between network traffic and structure dynamics, and agent-based models to explain the emergence of viral bursts of attention. Domains of study will include politics, scientific knowledge, and world events. Go to the grant page or project page for further details on the team and project.

The ideal candidate will have a PhD in computing or physical sciences; a strong background in analysis and modeling of complex systems and networks; and solid programming skills necessary to handle big data and develop large scale simulations.

To apply, email/send a CV and names and emails of three references to Tara Holbrook. Applications received by 15 December 2012 will receive full consideration, but applications will be considered until the position is filled.

Indiana University is an Equal Opportunity/Affirmative Action employer. Applications from women and minorities are strongly encouraged. IU Bloomington is vitally interested in the needs of Dual Career couples.

Thanks to KDnuggets, SOCNET, Gephi, DBWorld, Air-L, CITASA and others for help in advertising this position.

2011 Truthy Updates

WSJ video on Truthy project
Mike Conover in the WSJ's report on the Truthy project

We’re pleased to report several exciting developments in our interdisciplinary project studying information diffusion in complex online social networks. The past year has resulted in several publications. Our results on the Truthy astroturf monitoring and detection system were presented at WWW 2011 and ICWSM 2011. Research into the polarized network structure of political communication on Twitter was presented at ICWSM and received the 2011 CITASA Best Student Paper Honorable Mention. We demonstrated the feasibility of the prediction of individuals’ political affiliation from network and text data (SocialCom 2011), a machine learning application that enables large-scale instrumentation of nearly 20,000 individuals’ political behaviors, policy foci, and geospatial distribution (Journal of Information Technology and Politics). We’re also working on a paper on partisan asymmetries in online political activity surrounding the 2010 U.S. congressional midterm elections.

Our results have been widely covered in the press, including the Wall Street JournalScienceCommunications of the ACM, NPR [1,2], The Chronicle of Higher Education, Discover Magazine, The Atlantic, New ScientistMIT Technology Review, and many more.

Current and future research is supported by an award from the NSF Interface between Computer Science and Economics & Social Sciences program, and a McDonnell Foundation grant. The former will focus on building an infrastructure for the study of information diffusion in social media, the characterization of meme spread patterns, and the development of sentiment analysis tools for social media. The latter will focus on modeling efforts, especially agent-based models of information diffusion, competition for attention, and the relationship between information sharing events and social network evolution.

Postdoctoral Researcher in Analysis and Modeling of Social Networks

Network of Political Retweets

The Center for Complex Networks and Systems Research has an open postdoctoral position to study how ideas propagate through complex online social networks. The position is funded by a McDonnell Foundation’s grant in Complex Systems. The appointment starts in January 2012 for one year and is renewable for up to 3 additional years. The salary is competitive and benefits are generous.

The postdoc will join a dynamic and interdisciplinary team that includes computer, physical, and cognitive scientists. The postdoc will work with PIs Filippo Menczer and Alessandro Flammini and several PhD students on analysis and modeling of social media data. Areas of focus will include information diffusion patterns, epidemic models for the spread of ideas, interactions between network traffic and structure dynamics, and agent-based models to explain the emergence of viral bursts of attention. Domains of study will include politics, scientific knowledge, and world events. Go to the grant page or project page for further details on the team and project.

The ideal candidate will have a PhD in computing or physical sciences; a strong background in analysis and modeling of complex systems and networks; and solid programming skills necessary to handle big data and develop large scale simulations.

To apply, email/send a CV and names and emails of three references to Tara Holbrook. Applications received by Oct. 24, 2011 will be given full consideration, but the position will remain open until a successful candidate is identified.

Indiana University is an Equal Opportunity/Affirmative Action employer. Applications from women and minorities are strongly encouraged. IU Bloomington is vitally interested in the needs of Dual Career couples.

Thanks to KDnuggets, SOCNET, Gephi, DBWorld, Air-L, CITASA and others for help in advertising this position.

Visualizing the Political Discourse on Twitter

Overview

Social media play an important role in shaping political discourse in the U.S. and around the world. However, empirical evidence suggests that politically active web users tend to organize into insular, homogenous communities segregated along partisan lines [1, 2].

In its own right, the formation of online communities is not necessarily a serious problem. However, a deliberative democracy relies on a broadly informed public and a healthy ecosystem of competing ideas. The concern is that if politically active individuals can avoid people and information they would not have chosen in advance, their opinions are likely to become increasingly extreme as a result of being exposed to more homogeneous viewpoints and fewer credible opposing opinions [3].

As part of a series of ongoing studies, we have examined two networks of political communication on Twitter, made up of more than 250,000 tweets from the six weeks prior to the 2010 U.S. congressional midterm elections. Using a combination of network clustering algorithms and manually-annotated data we demonstrate that the retweet network exhibits a highly partisan structure, segregating users into two distinct communities of politically likeminded individuals. In contrast, we find that the mention network does not exhibit this kind of partisan divide. Instead, mentions form a bridge between the two communities, resulting in users being exposed to people and information they would not have been likely to choose in advance.

We hypothesize that these network structures result in part from politically motivated individuals annotating tweets with hashtags that target ideologically opposed users. We argue that this process results in users being exposed to content they are not likely to rebroadcast, but to which they may respond using mentions. In a forthcoming article we provide statistical evidence in support of this hypothesis [4].

Everyone’s an Editor


Partisan Composition of Content Streams
This chart shows the relative number of tweets produced by left- and right-leaning users across a variety of popular information streams. Users from both sides of the political divide are able to contribute content that reflects their own political views.

Hashtags on Twitter, like a radio frequency or television channel, identify content streams associated with different topics and audiences. In contrast to the mass media model, where a single organization can exercise complete editorial control over a stream’s content, hashtags allow anyone to inject their own content into an information stream. Moreover, with hashtag streams, the marginal cost of contributing content to channels with which you might not otherwise engage is almost zero. As a result, a content stream about a well-defined topic, #teaparty for example, can include information that reflects a diversity of views on the subject.

At left we show that for many of the most popular political hashtags, users from both sides of the political spectrum contribute a substantial volume of content. The result is that users who sample information from these streams are likely to be exposed to people, information and opinions with which they might not agree.

Networks of Political Communication

In addition to understanding what people say when they talk about politics on Twitter, one of our primary goals is to understand how people communicate with one another. To this end we collected more than 250,000 tweets containing political hashtags from the six weeks leading up to the 2010 US congressional midterm elections. By recording interactions between users we can create networks of political communication corresponding to the two primary modes of public user-user engagement, retweets and mentions.

Using network clustering algorithms we identified two highly segregated communities of users in the retweet network (below). To understand whether this structure had a meaningful political interpretation we had two of the study’s authors review the tweets produced by 1,000 random users. The authors, working independently, were asked to decide whether the user expressed a ‘left-leaning’, ‘right-leaning’, or ‘undecidable’ political identity in the content of their tweets. To make sure an unbiased party could reproduce our results we compared these annotations with those of a non-author judge, and for a random 200-user subset we report excellent agreement between the authors’ annotations and those of the judge.

These annotations, taken together with the cluster data, reveal a highly partisan structure. In the retweet network, 80% of labeled users in the blue community express a left-leaning political identity, where 93% of labeled users in the red community express a right-leaning identity. In contrast, the mention network does not exhibit this kind of partisan structure, meaning that ideologically opposed users interact much more often using this mode of communication. This difference is particularly important with respect to political communication because it indicates that mentions act as a conduit through which users are exposed to information and opinions that reflect a diversity of political perspectives. Despite these findings, we emphasize that it’s premature to say conclusively whether this inter-ideological communication represents a constructive civil discourse, or whether it’s simply partisan flamebaiting.


Composite Communication Network

The composite of the political mention and retweet networks (7-core shown). Mentions form a communication bridge between the two politically homogeneous retweet communities.

Retweet Network
Among 1,000 manually-annotated users, 93% of users in the red cluster express a right-leaning political identity and 80% of users in the blue cluster express a left-leaning identity. Node colors reflect algorithmically-determined community assignments.


Mention Network

The mention network is dominated by a large politically heterogeneous cluster of users. Compared to retweets, ideologically-opposed users interact with one another much more frequently using mentions. Node colors reflect algorithmically-determined community assignments.

Combining the two networks to form a composite makes it clear that mentions form a bridge between users on the political left and right (shown in the ‘Composite Network’ figure). To explain this we come back to the fact that anyone can contribute content to a hashtag information stream. It’s quite common for users to produce tweets containing hashtags that target multiple politically opposed audiences. For example, consider the following real tweets:


User A: Please follow @Username for an outstanding progressive voice! #p2 #dems #prog #democrats #tcot

User B: Couple Aborts Twin Boys For Being Wrong Gender..http://bit.ly/xyz #tcot #christian #tlot #teaparty #p2 #prolife

 

Each of these users chose to contribute to multiple content streams with primary audiences of likeminded individuals; Progressives 2.0, #democrats, #prog for A, Top Conservatives on Twitter, #christian, #teaparty etc for B. The remarkable thing is that they both also chose to include one hashtag targeting users who would not likely seek out this kind of information on their own. In doing so these users were, with very little effort, able to expose ideologically-opposed consumers of the #p2 and #tcot content streams to their personal political views. Returning to the mass media model, the capacity for content injection creates a situation where literally anyone can decide what’s going to be on TV tonight.

We propose that when a user is exposed to content in this way, she will be unlikely to rebroadcast (retweet) it, but may choose to respond directly to the originator in the form of a mention. Consequently, the network of retweets would exhibit a politically segregated community structure, while the network of mentions would not. In the associated article we present statistical evidence in support of this hypothesis.

Looking Forward

This work is part of an ongoing project at the Center for Complex Networks and Systems Research at Indiana University’s School of Informatics and Computing. Various aspects of this work are slated for publication and release in venues dedicated to the computational, social and political sciences. Concurrent with the International Conference on Weblogs and Social Media we plan to release a network and hashtag dataset based on the information produced during the course of this study. If you have any questions about this or other related works, please contact Michael Conover or any of the other contributors to the project.

References

[1] Adamic, L., and Glance, N. 2005. The political blogosphere and the 2004 U.S. election: Divided they blog. In Proc. 3rd Intl. Workshop on Link Discovery (LinkKDD), 36–43.
[2] Hargittai, E.; Gallo, J.; and Kane, M. 2007. Cross-ideological discussions among conservative and liberal bloggers. Public Choice 134(1):67–86.
[3] Sunstein, C. R. 2007. Republic.com 2.0. Princeton University Press.
[4] Conover, M. D.; Ratkiewicz, J.; Francisco, M.; Gonalves, B.; Flammini, A.; and Menczer, F. Political Polarization on Twitter.In Proc. 5th Intl. Conference on Weblogs and Social Media.