Tag Archives: Twitter

Cracking the stealth political influence of bots

Among the millions of real people tweeting about the presidential race, there are also a lot accounts operated by fake people, or “bots.” Politicians and regular users alike use these accounts to increase their follower bases and push messages. PBS NewsHour science correspondent Miles O’Brien reports on how CNetS computer scientists can analyze Twitter handles to determine whether or not they are bots.

CASCI alumnus makes Fast Company’s most creative list

Ahmed Abdeen Hamed

Congratulations to CASCI alumnus Dr. Ahmed Abdeen Hamed who was recognized by FastCompany magazine, among the most creative people in the world, in 2016, for his research publication entitled: Twitter K-H networks in action: Advancing biomedical literature for drug search.Dr. Hamed completed his Computer Science MS degree at Indiana University in May 2005 and joined our Complex Networks & Systems track of the PhD in Informatics in the Fall of 2008. For personal reasons, he finished his PhD at the University of Vermont, but started his research in biomedical text mining with the CASCI group.

The Truth about Truthy

MegynKelly.jpg
The Truthy project was misrepresented in ‘The Kelly File’ and several other Fox News broadcasts. Public domain photo by MattGagnon via Wikimedia Commons.

For the past four years, researchers at the Center for Complex Networks and Systems Research at the Indiana University School of Informatics and Computing have been studying the ways in which information spreads on social media networks such as Twitter. This basic research project is federally funded, like a large percentage of university research across the country.

The project, informally dubbed “Truthy,” makes use of complex computer models to analyze the sharing of information on social media to determine how popular sentiment, user influence, attention, social network structure, and other factors affect the manner in which information is disseminated. Additionally, an important goal of the Truthy project is to better understand how social media can be abused.

Since 25 Aug 2014, when a first misleading article was posted on a conservative blog, the Truthy project has come under criticism from some, including The Kelly File and Fox and Friends broadcasts by Fox News on 26 and 28 Aug 2014, who have misrepresented its goals. Contrary to these claims, the target is the study of the structural patterns of information diffusion. For example, an email sent simultaneously to a million addresses is likely spam, even if we have no automatic way to determine whether its content is true or false. The assumption behind the Truthy effort is that an understanding of the spreading patterns may facilitate the identification of abuse, independent from the nature or political color of the communication.

While the Truthy platform provides support to study the evolution of communication in all portions of the political spectrum, it is not informed by political partisanship. The machine learning algorithms used to identify suspicious patterns of information diffusion are entirely oblivious to the possibly political partisanship of the messages.

Read the facts below for a primer on Truthy. More detailed information can be found on the Truthy website and in our publications.

Timeline and updates:

8/28/2014: Despite the clarifications in this post, Fox News and others continued to perpetrate their attacks to our research project and to the PI personally. Their accusations are based on false claims, supported by bits of text and figures selectively extracted from our writings and presented completely out of context, in misleading ways. None of the researchers were contacted for comments before these outlandish conspiracy theories were aired and published. There is a good dose of irony in a research project that studies the diffusion of misinformation becoming the target of such a powerful disinformation machine. (The video of the first segment on “The Kelly File” with misinformation about our project was later removed from the Fox News website.)

9/3/2014: David Uberti wrote an accurate account of recent events in Columbia Journalism Review.

10/18/2014: Unfortunately, the smear campaign against our research project continues, with unsupported allegations echoed in an misleading op-ed by FCC Commissioner Ajit Pai, who did not contact any of the researchers with questions about the accuracy of his allegations.

10/22/2014: Amid news reports that the chairman of the House Science, Space and Technology Committee initiated an investigation into the NSF grant supporting our project, read our interview in the Washington Post’s Monkey Cage setting the record straight about our research.

CRA, ACM, AAAI, USENIX, and SIAM write to congress about Truthy project

10/23/2014: While the House Majority Leader joins the fray, IU releases a statement in support of our work.

10/24/2014: Fox News and FCC Commissioner Pai continue to spread disinformation about our research. (The video of the interview about our project, to which we were not invited, was later removed from the Fox News website.)

10/27/2014: Some accurate coverage of the controversy appeared in Physics Today, Motherboard, Motherboard, and Indianapolis Star over the past few days.

11/3/2014: Jeffrey Mervis covers the controversy about this project in Science. We also provided additional information about our research in a slide deck embedded at the bottom of this post. 

11/4/2014: Five leading computing societies and associations (CRA, ACM, AAAI, USENIX, and SIAM) wrote a joint letter to the chairman and the committee ranking member of the House Committee on Science, Space, and Technology expressing their concern over mischaracterizations of our research.

11/7/2014: Over the past few days we have seen more coverage in Computer World, The Hill, Information Week, and Science about the reactions of the computing and science communities to the Truthy controversy.

11/11/2014: The House Science Committee Chairman sent a letter to the director of the  NSF on November 10, stating that our grant “was intended to create standards for online political discussion” and that a web service developed under the grant “targeted conservative social media messages.” These allegations are false, as we have explained in this post, in the slides embedded below, and in our publications — including the one quoted in the Chairman’s letter. On the same day, the Association of American Universities released a statement on the grant inquires by the House Science Committee.

11/21/2014: False rumors about our research continue to be spread. Some of the questions we have received suggested that our two separate project and demo websites were generating confusion, so we merged them into a redesigned research website with information and highlights about the research project, publications, demos, data, etc.

11/25/2014: Rep. Johnson and Rep. Lofgren, respectively ranking member and member of the House Committee on Science, write a letter to the committee chairman, Rep. Smith, in response to his accusations.

Facts about Truthy:

  1. Truthy is an informal nickname associated with a research project of the Center for Complex Networks and Systems Research at the IU School of Informatics and Computing. The project aims to study how information spreads on social media, such as Twitter.
  2. The project has focused on domains such as news, politics, social movements, scientific results, and trending social media topics. Researchers develop theoretical computer models and validate them by analyzing public data, mainly from the Twitter streaming API.
  3. Social media posts available through public APIs are processed without human intervention or judgment to visualize and study the spread of millions of memes. We aim to build a platform to make these analytic tools easily accessible to social scientists, reporters, and the general public.
  4. An important goal of the project is to help mitigate misuse and abuse of social media by helping us better understand how social media can be potentially abused. For example: when social bots are used to create the appearance of human-generated communication (hence the name “truthy”).  We study whether it is possible to automatically differentiate between organic content and so-called “astroturf.”
  5. Examples of research to date include analyses of geographic and temporal patterns in movements like Occupy Wall Street, societal unrest in Turkey, the polarization of online political discourse, the use of social media data to predict election outcomes and stock market movements, and the geographic diffusion of trending topics.
  6. On the more theoretical side, we have studied how individuals’  limited attention span affects what information we propagate and what social connections we make, and how the structure of social networks can help predict which memes are likely to become viral.
  7. Hundreds of researchers across the U.S. and the world are studying similar issues based on the same data and with analogous goals — these topics were studied well before the advent of social media. In the US these research efforts are supported not only by the NSF but also by other federal funding agencies such as DoD, DARPA, and IARPA.
  8. The results of our research have been covered widely in the press, published in top peer-reviewed journals, and presented at top conferences worldwide. All papers are publicly available.


Finally, the Truthy research project is not and never was:

  • a political watchdog
  • a database to be used by the federal government to monitor the activities of those who oppose its policies
  • a government probe of social media
  • an attempt to suppress free speech or limit political speech or develop standards for online political speech
  • a way to define “misinformation”
  • a partisan political effort
  • a system targeting political messages and commentary connected to conservative groups
  • a mechanism to terminate any social media accounts
  • a database tracking hate speech

DESPIC team presents Bot Or Not demo and six posters at DoD meeting

IU Bot or Bot poster The DESPIC team at the Center for Complex Systems and Networks Research (CNetS) presented a demo of a new tool named BotOrNot at a DoD meeting held in Arlington, Virginia on April 23-25, 2014.  BotOrNot (truthy.indiana.edu/botornot) is a tool to automatically detect whether a given Twitter user is a social bot or a human. Trained on Twitter bots collected by our lab and the infolab at Texas A&M University, BotOrNot analyzes over a thousand features from the user’s friendship network, content, and temporal information in real time and estimates the degree to which the account may be a bot. In addition to the demo, the DESPIC team (including colleagues at the University of Michigan)  presented several posters on Scalable Architecture for Social Media ObservatoryMeme Clustering in  Streaming DataPersuasion Detection in Social StreamsHigh-Resolution Anomaly Detection in Social Streams, and Early Detection and Analysis of Rumors. See more coverage of BotOrNot on PCWorld, IDS, BBCPolitico, and MIT Technology Review.

Congratulations to Dr. Lilian Weng!

Lilian Weng with her PhD committee
Lilian Weng with her PhD committee

Congratulations to Lilian Weng, who successfully defended her Informatics PhD dissertation titled Information diffusion on online social networks. The thesis provides insights into information diffusion on online social networks from three aspects: people who share information, features of transmissible content, and the mutual effects between network structure and diffusion process. The first part delves into the limited human attention. The second part of Dr. Weng’s dissertation investigates properties of transmissible content, particularly into the topic space. Finally, the thesis presents studies of how network structure, particularly community structure, influences the propagation of Internet memes and how the information flow in turn affects social link formation. Dr. Weng’s work can contribute to a better and more comprehensive understanding of information diffusion among online social-technical systems and yield applications to viral marketing, advertisement, and social media analytics. Congratulations from her colleagues and committee members: Alessandro Flammini, YY Ahn, Steve Myers, and Fil Menczer!

Datasets

Web Science 2014 Data Challenge

The datasets described below are used in the Web Science 2014 Data Challenge. For more, information, please the call for participation. For updates, see the Data Challenge section of the Web Science 2014 website.

There are 4 datasets in this collection. Each is available as a .tar.gz file containing either .json or .csv files. When the JSON format is used, each .json file contains a single JSON object. The format of that object is dependent on the dataset. See below for details. The datasets have been prepared by Dimitar Nikolov.
clicks

1. Web Traffic

A collection of Web (HTTP) requests for the month of November 2009. This is a small sample of the larger click dataset, documented here. (More on Web Traffic project).

JSON object format:

{
    'timestamp': 123456789, # Unix timestamp
    'from': '...', # the referrer host
    'to': '...', # the target host
    'count': 1234 # the number of request between the referrer and target hosts that occurred within the given hour
}

The data has been aggregated for every hour of the day. Thus, if more than one request occurred from the same referrer host to the same target host between, say, 2pm and 3pm, this is reflected in the ‘count’ field of the JSON object with a timestamp for 2pm, rather than by a different JSON object with a different timestamp.

Dataset statistics:

  • Dataset size: 235M requests
  • File size: 2.7GB uncompressed
  • Time period: Nov 1, 2009 – Nov 22, 2009

Data: web-clicks-nov-2009.tgz (321MB)

If you use this dataset in your research, please cite either or both of these papers:

@inproceedings{Meiss08WSDM,
    title = {Ranking Web Sites with Real User Traffic},
    author = {Meiss, M. and Menczer, F. and Fortunato, S. and Flammini, A. and Vespignani, A.},
    booktitle = {Proc. First ACM International Conference on Web Search and Data Mining (WSDM)},
    url = {http://informatics.indiana.edu/fil/Papers/click.pdf},
    pages = {65--75},
    year = 2008
}
@incollection{Meiss2010WAW,
    title = {Modeling Traffic on the Web Graph},
    author = {Meiss, M. and Goncalves, B. and Ramasco, J. and Flammini, A. and Menczer, F.},
    booktitle = {Proc. 7th Workshop on Algorithms and Models for the Web Graph (WAW)},
    series = {Lecture Notes in Computer Science},
    url = {http://informatics.indiana.edu/fil/Papers/abc.pdf},
    pages = {50--61},
    volume = 6516,
    year = 2010
}

tcot

2. Twitter

A collection of records extracted from tweets for the month of November 2012 containing both #hashtags and URLs as part of the tweet. (More on Truthy project)

JSON object format:

{
    'timestamp': 123456789, # Unix timestamp
    'user_id': 12345, # an integer uniquely identifying the user who tweeted
    'hashtags': ['...', '...', '...'], # a list of hashtags used in the tweet
    'urls': ['...', '...', '...'] # a list of links used in the tweet
}

Dataset statistics:

  • Dataset size: 27.8M tweets
  • File size: 3.5GB uncompressed
  • Time Period: Nov 1, 2012 – Nov 30, 2012

Data: tweets-nov-2012.json.gz (865MB)

If you use this dataset in your research, please cite either or both of these papers:

@inproceedings{McKelvey:2013:DPS:2487788.2488174,
    author = {McKelvey, Karissa and Menczer, Filippo},
    title = {Design and prototyping of a social media observatory},
    booktitle = {Proceedings of the 22nd international conference on World Wide Web companion},
    series = {WWW '13 Companion},
    pages = {1351--1358},
    url = {http://dl.acm.org/citation.cfm?id=2487788.2488174},
    year = 2013
}
@inproceedings{McKelvey2013cscw,
    Author = {Karissa McKelvey and Filippo Menczer},
    Title = {{Truthy: Enabling the Study of Online Social Networks}},
    Booktitle = {Proc. 16th ACM Conference on Computer Supported Cooperative Work and Social Computing Companion (CSCW)},
    Url = {http://arxiv.org/abs/1212.4565},
    Year = 2013
}

givealink-logo

3. Social Bookmarking

A collection of bookmarks from GiveALink.org for the month of November 2009. (More on GiveALink project)

JSON object format:

{
    'timestamp': 123456789, # Unix timestamp for when the URL was posted
    'url': '...', # the URL that was bookmarked
    'hashtags': ['...', '...', '...'] # a set of tags attached to the URL by the (anonymous) user
}

Dataset statistics:

  • Dataset size: 61,665 posts (approximately 430,000 triples)
  • File size: 12MB uncompressed
  • Time period: Nov 1, 2009 – Nov 30, 2009

Data: givealink-nov-2009.tgz (2MB)

If you use this dataset in your research, please cite either or both of these papers:

@inproceedings{Markines06GAL,
    author = {Markines, B. and Stoilova, L. and Menczer, F.},
    title = {Bookmark hierarchies and collaborative recommendation},
    booktitle = {Proc. 21st National Conference on Artificial Intelligence (AAAI-06)},
    pages = {1375--1380},
    publisher = {AAAI Press},
    url = {http://www.aaai.org/Papers/AAAI/2006/AAAI06-216.pdf},
    year = 2006
}
@inproceedings{Stoilova05GAL,
    Author = {Stoilova, Lubomira and Holloway, Todd and Markines, Ben and Maguitman, Ana G. and Menczer, Filippo},
    Title = {GiveALink: Mining a Semantic Network of Bookmarks for Web Search and Recommendation},
    Booktitle = {Proc. KDD Workshop on Link Discovery: Issues, Approaches and Applications (LinkKDD)},
    Url = {http://informatics.indiana.edu/fil/Papers/givealink-linkkdd.pdf},
    Year = 2005
}

co-author-network

4. Publications

Metadata for the complete set of all PubMed records through 2012 (with part of 2013 available as well), including title, authors, and year of publication. All data provided originates from NLM’s PubMed database (as downloaded April 24, 2013 from the NLM FTP site) and was retrieved via the Scholarly Database.

CSV format:

PubMed ID1,title1,year of publication1,author1|author2|author3|…
PubMed ID2,title2,year of publication2,author4|author1|author5|…

Dataset statistics:

  • Dataset size: 21.5 mil publications and 10.8 mil authors
  • File size: 3.1GB uncompressed
  • Time period: 1809 – 2013

Data: publications-1809-2013.tar.gz (1.4GB)

If you use this dataset in your research, please cite either or both of these papers:

@inproceedings{Light2013ISSI,
  author    = {Light, Robert P., David E. Polley and Katy Börner},
  title     = {Open Data and Open Code for Big Science of Science Studies},
  booktitle = {Proceedings of International Society of Scientometrics and Informetrics Conference},
  year      = {2013},
  pages     = {1342--1356},
  url       = {http://cns.iu.edu/docs/publications/2013-light-sdb-sci2-issi.pdf}
}
@article{Rowe2009Scien,
  author  = {Rowe, Gavin La, Sumeet Adinath Ambre, John W. Burgoon, Weimao Ke, and Katy Börner},
  title   = {The Scholarly Database and its Utility for Scientometrics Research"},
  journal = {Scientometrics},
  year   = {2009},
  volume = {79},
  number = {2},
  month  = {May},
  url    = {http://cns.iu.edu/docs/publications/2009-larowe-sdb.pdf}
}

National Coverage for “More Tweets, More Votes”

Findings by CNetS researchers on social media indicators of election results received significant coverage in the national press. The paper More Tweets, More Votes: Social Media as a Quantitative Indicator of Political Behavior by Joseph Digrazia, Karissa McKelvey, Johan Bollen, and Fabio Rojas was presented at the 2013 Meeting of the American Sociological Association in NYC. It was covered by NPR, The Wall Street JournalMSNBCC-SPANThe Washington PostThe Atlantic, and many other media.