Tag Archives: Truthy

Hoaxy: A Platform for Tracking Online Misinformation

diffusion networks of hoaxes in Twitter
Misinformation (yellow/brown) spreads within the healthy (blue) Twittersphere network. Left: chemtrails conspiracies mix with conversations about the sky. Right: antivax campaigns penetrate discussions about the flu.

UPDATE (21 Dec 2016): we just launched Hoaxy, our open platform to visualize the online spread of claims and fact checking.

Continue reading Hoaxy: A Platform for Tracking Online Misinformation

DESPIC team presents Bot Or Not demo and six posters at DoD meeting

IU Bot or Bot poster The DESPIC team at the Center for Complex Systems and Networks Research (CNetS) presented a demo of a new tool named BotOrNot at a DoD meeting held in Arlington, Virginia on April 23-25, 2014.  BotOrNot (truthy.indiana.edu/botornot) is a tool to automatically detect whether a given Twitter user is a social bot or a human. Trained on Twitter bots collected by our lab and the infolab at Texas A&M University, BotOrNot analyzes over a thousand features from the user’s friendship network, content, and temporal information in real time and estimates the degree to which the account may be a bot. In addition to the demo, the DESPIC team (including colleagues at the University of Michigan)  presented several posters on Scalable Architecture for Social Media ObservatoryMeme Clustering in  Streaming DataPersuasion Detection in Social StreamsHigh-Resolution Anomaly Detection in Social Streams, and Early Detection and Analysis of Rumors. See more coverage of BotOrNot on PCWorld, IDS, BBCPolitico, and MIT Technology Review.

Datasets

Web Science 2014 Data Challenge

The datasets described below are used in the Web Science 2014 Data Challenge. For more, information, please the call for participation. For updates, see the Data Challenge section of the Web Science 2014 website.

There are 4 datasets in this collection. Each is available as a .tar.gz file containing either .json or .csv files. When the JSON format is used, each .json file contains a single JSON object. The format of that object is dependent on the dataset. See below for details. The datasets have been prepared by Dimitar Nikolov.
clicks

1. Web Traffic

A collection of Web (HTTP) requests for the month of November 2009. This is a small sample of the larger click dataset, documented here. (More on Web Traffic project).

JSON object format:

{
    'timestamp': 123456789, # Unix timestamp
    'from': '...', # the referrer host
    'to': '...', # the target host
    'count': 1234 # the number of request between the referrer and target hosts that occurred within the given hour
}

The data has been aggregated for every hour of the day. Thus, if more than one request occurred from the same referrer host to the same target host between, say, 2pm and 3pm, this is reflected in the ‘count’ field of the JSON object with a timestamp for 2pm, rather than by a different JSON object with a different timestamp.

Dataset statistics:

  • Dataset size: 235M requests
  • File size: 2.7GB uncompressed
  • Time period: Nov 1, 2009 – Nov 22, 2009

Data: web-clicks-nov-2009.tgz (321MB)

If you use this dataset in your research, please cite either or both of these papers:

@inproceedings{Meiss08WSDM,
    title = {Ranking Web Sites with Real User Traffic},
    author = {Meiss, M. and Menczer, F. and Fortunato, S. and Flammini, A. and Vespignani, A.},
    booktitle = {Proc. First ACM International Conference on Web Search and Data Mining (WSDM)},
    url = {http://informatics.indiana.edu/fil/Papers/click.pdf},
    pages = {65--75},
    year = 2008
}
@incollection{Meiss2010WAW,
    title = {Modeling Traffic on the Web Graph},
    author = {Meiss, M. and Goncalves, B. and Ramasco, J. and Flammini, A. and Menczer, F.},
    booktitle = {Proc. 7th Workshop on Algorithms and Models for the Web Graph (WAW)},
    series = {Lecture Notes in Computer Science},
    url = {http://informatics.indiana.edu/fil/Papers/abc.pdf},
    pages = {50--61},
    volume = 6516,
    year = 2010
}

tcot

2. Twitter

A collection of records extracted from tweets for the month of November 2012 containing both #hashtags and URLs as part of the tweet. (More on Truthy project)

JSON object format:

{
    'timestamp': 123456789, # Unix timestamp
    'user_id': 12345, # an integer uniquely identifying the user who tweeted
    'hashtags': ['...', '...', '...'], # a list of hashtags used in the tweet
    'urls': ['...', '...', '...'] # a list of links used in the tweet
}

Dataset statistics:

  • Dataset size: 27.8M tweets
  • File size: 3.5GB uncompressed
  • Time Period: Nov 1, 2012 – Nov 30, 2012

Data: tweets-nov-2012.json.gz (865MB)

If you use this dataset in your research, please cite either or both of these papers:

@inproceedings{McKelvey:2013:DPS:2487788.2488174,
    author = {McKelvey, Karissa and Menczer, Filippo},
    title = {Design and prototyping of a social media observatory},
    booktitle = {Proceedings of the 22nd international conference on World Wide Web companion},
    series = {WWW '13 Companion},
    pages = {1351--1358},
    url = {http://dl.acm.org/citation.cfm?id=2487788.2488174},
    year = 2013
}
@inproceedings{McKelvey2013cscw,
    Author = {Karissa McKelvey and Filippo Menczer},
    Title = {{Truthy: Enabling the Study of Online Social Networks}},
    Booktitle = {Proc. 16th ACM Conference on Computer Supported Cooperative Work and Social Computing Companion (CSCW)},
    Url = {http://arxiv.org/abs/1212.4565},
    Year = 2013
}

givealink-logo

3. Social Bookmarking

A collection of bookmarks from GiveALink.org for the month of November 2009. (More on GiveALink project)

JSON object format:

{
    'timestamp': 123456789, # Unix timestamp for when the URL was posted
    'url': '...', # the URL that was bookmarked
    'hashtags': ['...', '...', '...'] # a set of tags attached to the URL by the (anonymous) user
}

Dataset statistics:

  • Dataset size: 61,665 posts (approximately 430,000 triples)
  • File size: 12MB uncompressed
  • Time period: Nov 1, 2009 – Nov 30, 2009

Data: givealink-nov-2009.tgz (2MB)

If you use this dataset in your research, please cite either or both of these papers:

@inproceedings{Markines06GAL,
    author = {Markines, B. and Stoilova, L. and Menczer, F.},
    title = {Bookmark hierarchies and collaborative recommendation},
    booktitle = {Proc. 21st National Conference on Artificial Intelligence (AAAI-06)},
    pages = {1375--1380},
    publisher = {AAAI Press},
    url = {http://www.aaai.org/Papers/AAAI/2006/AAAI06-216.pdf},
    year = 2006
}
@inproceedings{Stoilova05GAL,
    Author = {Stoilova, Lubomira and Holloway, Todd and Markines, Ben and Maguitman, Ana G. and Menczer, Filippo},
    Title = {GiveALink: Mining a Semantic Network of Bookmarks for Web Search and Recommendation},
    Booktitle = {Proc. KDD Workshop on Link Discovery: Issues, Approaches and Applications (LinkKDD)},
    Url = {http://informatics.indiana.edu/fil/Papers/givealink-linkkdd.pdf},
    Year = 2005
}

co-author-network

4. Publications

Metadata for the complete set of all PubMed records through 2012 (with part of 2013 available as well), including title, authors, and year of publication. All data provided originates from NLM’s PubMed database (as downloaded April 24, 2013 from the NLM FTP site) and was retrieved via the Scholarly Database.

CSV format:

PubMed ID1,title1,year of publication1,author1|author2|author3|…
PubMed ID2,title2,year of publication2,author4|author1|author5|…

Dataset statistics:

  • Dataset size: 21.5 mil publications and 10.8 mil authors
  • File size: 3.1GB uncompressed
  • Time period: 1809 – 2013

Data: publications-1809-2013.tar.gz (1.4GB)

If you use this dataset in your research, please cite either or both of these papers:

@inproceedings{Light2013ISSI,
  author    = {Light, Robert P., David E. Polley and Katy Börner},
  title     = {Open Data and Open Code for Big Science of Science Studies},
  booktitle = {Proceedings of International Society of Scientometrics and Informetrics Conference},
  year      = {2013},
  pages     = {1342--1356},
  url       = {http://cns.iu.edu/docs/publications/2013-light-sdb-sci2-issi.pdf}
}
@article{Rowe2009Scien,
  author  = {Rowe, Gavin La, Sumeet Adinath Ambre, John W. Burgoon, Weimao Ke, and Katy Börner},
  title   = {The Scholarly Database and its Utility for Scientometrics Research"},
  journal = {Scientometrics},
  year   = {2009},
  volume = {79},
  number = {2},
  month  = {May},
  url    = {http://cns.iu.edu/docs/publications/2009-larowe-sdb.pdf}
}

Truthy Team Wins WICI Data Challenge

WICI Data Challenge AwardCongratulations to Przemyslaw Grabowicz, Luca Aiello, and Fil Menczer for winning the WICI Data Challenge. A prize of $10,000 CAD accompanies this award from the Waterloo Institute for Complexity and Innovation at the University of Waterloo. The Challenge called for tools and methods that improve the exploration, analysis, and visualization of complex-systems data. The winning entry, titled Fast visualization of relevant portions of large dynamic networks, is an algorithm that selects subsets of nodes and edges that best represent an evolving graph and visualizes it either by creating a movie, or by streaming it to an interactive network visualization tool. The algorithm is deployed in the movie generation tool of the Truthy system, which allows users to create, in near-real time, YouTube videos that illustrate the spread and co-occurrence of memes on Twitter. Przemek and Luca worked on this project while visiting CNetS in 2011 and collaborating with the Truthy team. Bravo!

Postdoctoral Researcher in Analysis and Modeling of Social Networks

Network of Political Retweets

[UPDATE: this position has been filled.]

The Center for Complex Networks and Systems Research has an open postdoctoral position to study how ideas propagate through complex online social networks. The position is funded by a McDonnell Foundation’s grant in Complex Systems. The appointment starts as early as possible after January 2013 for one year and is renewable for up to 2 additional years. The salary is competitive and benefits are generous.

The postdoc will join a dynamic and interdisciplinary team that includes computer, physical, and cognitive scientists. The postdoc will work with PIs Filippo Menczer and Alessandro Flammini, other postdocs, and several PhD students on analysis and modeling of social media data. Areas of focus will include information diffusion patterns, epidemic models for the spread of ideas, interactions between network traffic and structure dynamics, and agent-based models to explain the emergence of viral bursts of attention. Domains of study will include politics, scientific knowledge, and world events. Go to the grant page or project page for further details on the team and project.

The ideal candidate will have a PhD in computing or physical sciences; a strong background in analysis and modeling of complex systems and networks; and solid programming skills necessary to handle big data and develop large scale simulations.

To apply, email/send a CV and names and emails of three references to Tara Holbrook. Applications received by 15 December 2012 will receive full consideration, but applications will be considered until the position is filled.

Indiana University is an Equal Opportunity/Affirmative Action employer. Applications from women and minorities are strongly encouraged. IU Bloomington is vitally interested in the needs of Dual Career couples.

Thanks to KDnuggets, SOCNET, Gephi, DBWorld, Air-L, CITASA and others for help in advertising this position.

2011 Truthy Updates

WSJ video on Truthy project
Mike Conover in the WSJ's report on the Truthy project

We’re pleased to report several exciting developments in our interdisciplinary project studying information diffusion in complex online social networks. The past year has resulted in several publications. Our results on the Truthy astroturf monitoring and detection system were presented at WWW 2011 and ICWSM 2011. Research into the polarized network structure of political communication on Twitter was presented at ICWSM and received the 2011 CITASA Best Student Paper Honorable Mention. We demonstrated the feasibility of the prediction of individuals’ political affiliation from network and text data (SocialCom 2011), a machine learning application that enables large-scale instrumentation of nearly 20,000 individuals’ political behaviors, policy foci, and geospatial distribution (Journal of Information Technology and Politics). We’re also working on a paper on partisan asymmetries in online political activity surrounding the 2010 U.S. congressional midterm elections.

Our results have been widely covered in the press, including the Wall Street JournalScienceCommunications of the ACM, NPR [1,2], The Chronicle of Higher Education, Discover Magazine, The Atlantic, New ScientistMIT Technology Review, and many more.

Current and future research is supported by an award from the NSF Interface between Computer Science and Economics & Social Sciences program, and a McDonnell Foundation grant. The former will focus on building an infrastructure for the study of information diffusion in social media, the characterization of meme spread patterns, and the development of sentiment analysis tools for social media. The latter will focus on modeling efforts, especially agent-based models of information diffusion, competition for attention, and the relationship between information sharing events and social network evolution.

Postdoctoral Researcher in Analysis and Modeling of Social Networks

Network of Political Retweets

The Center for Complex Networks and Systems Research has an open postdoctoral position to study how ideas propagate through complex online social networks. The position is funded by a McDonnell Foundation’s grant in Complex Systems. The appointment starts in January 2012 for one year and is renewable for up to 3 additional years. The salary is competitive and benefits are generous.

The postdoc will join a dynamic and interdisciplinary team that includes computer, physical, and cognitive scientists. The postdoc will work with PIs Filippo Menczer and Alessandro Flammini and several PhD students on analysis and modeling of social media data. Areas of focus will include information diffusion patterns, epidemic models for the spread of ideas, interactions between network traffic and structure dynamics, and agent-based models to explain the emergence of viral bursts of attention. Domains of study will include politics, scientific knowledge, and world events. Go to the grant page or project page for further details on the team and project.

The ideal candidate will have a PhD in computing or physical sciences; a strong background in analysis and modeling of complex systems and networks; and solid programming skills necessary to handle big data and develop large scale simulations.

To apply, email/send a CV and names and emails of three references to Tara Holbrook. Applications received by Oct. 24, 2011 will be given full consideration, but the position will remain open until a successful candidate is identified.

Indiana University is an Equal Opportunity/Affirmative Action employer. Applications from women and minorities are strongly encouraged. IU Bloomington is vitally interested in the needs of Dual Career couples.

Thanks to KDnuggets, SOCNET, Gephi, DBWorld, Air-L, CITASA and others for help in advertising this position.