Tag Archives: Twitter

Datasets

Web Science 2014 Data Challenge

The datasets described below are used in the Web Science 2014 Data Challenge. For more, information, please the call for participation. For updates, see the Data Challenge section of the Web Science 2014 website.

There are 4 datasets in this collection. Each is available as a .tar.gz file containing either .json or .csv files. When the JSON format is used, each .json file contains a single JSON object. The format of that object is dependent on the dataset. See below for details. The datasets have been prepared by Dimitar Nikolov.
clicks

1. Web Traffic

A collection of Web (HTTP) requests for the month of November 2009. This is a small sample of the larger click dataset, documented here. (More on Web Traffic project).

JSON object format:

{
    'timestamp': 123456789, # Unix timestamp
    'from': '...', # the referrer host
    'to': '...', # the target host
    'count': 1234 # the number of request between the referrer and target hosts that occurred within the given hour
}

The data has been aggregated for every hour of the day. Thus, if more than one request occurred from the same referrer host to the same target host between, say, 2pm and 3pm, this is reflected in the ‘count’ field of the JSON object with a timestamp for 2pm, rather than by a different JSON object with a different timestamp.

Dataset statistics:

  • Dataset size: 235M requests
  • File size: 2.7GB uncompressed
  • Time period: Nov 1, 2009 – Nov 22, 2009

Data: web-clicks-nov-2009.tgz (321MB)

If you use this dataset in your research, please cite either or both of these papers:

@inproceedings{Meiss08WSDM,
    title = {Ranking Web Sites with Real User Traffic},
    author = {Meiss, M. and Menczer, F. and Fortunato, S. and Flammini, A. and Vespignani, A.},
    booktitle = {Proc. First ACM International Conference on Web Search and Data Mining (WSDM)},
    url = {http://informatics.indiana.edu/fil/Papers/click.pdf},
    pages = {65--75},
    year = 2008
}
@incollection{Meiss2010WAW,
    title = {Modeling Traffic on the Web Graph},
    author = {Meiss, M. and Goncalves, B. and Ramasco, J. and Flammini, A. and Menczer, F.},
    booktitle = {Proc. 7th Workshop on Algorithms and Models for the Web Graph (WAW)},
    series = {Lecture Notes in Computer Science},
    url = {http://informatics.indiana.edu/fil/Papers/abc.pdf},
    pages = {50--61},
    volume = 6516,
    year = 2010
}

tcot

2. Twitter

A collection of records extracted from tweets for the month of November 2012 containing both #hashtags and URLs as part of the tweet. (More on Truthy project)

JSON object format:

{
    'timestamp': 123456789, # Unix timestamp
    'user_id': 12345, # an integer uniquely identifying the user who tweeted
    'hashtags': ['...', '...', '...'], # a list of hashtags used in the tweet
    'urls': ['...', '...', '...'] # a list of links used in the tweet
}

Dataset statistics:

  • Dataset size: 27.8M tweets
  • File size: 3.5GB uncompressed
  • Time Period: Nov 1, 2012 – Nov 30, 2012

Data: tweets-nov-2012.json.gz (865MB)

If you use this dataset in your research, please cite either or both of these papers:

@inproceedings{McKelvey:2013:DPS:2487788.2488174,
    author = {McKelvey, Karissa and Menczer, Filippo},
    title = {Design and prototyping of a social media observatory},
    booktitle = {Proceedings of the 22nd international conference on World Wide Web companion},
    series = {WWW '13 Companion},
    pages = {1351--1358},
    url = {http://dl.acm.org/citation.cfm?id=2487788.2488174},
    year = 2013
}
@inproceedings{McKelvey2013cscw,
    Author = {Karissa McKelvey and Filippo Menczer},
    Title = {{Truthy: Enabling the Study of Online Social Networks}},
    Booktitle = {Proc. 16th ACM Conference on Computer Supported Cooperative Work and Social Computing Companion (CSCW)},
    Url = {http://arxiv.org/abs/1212.4565},
    Year = 2013
}

givealink-logo

3. Social Bookmarking

A collection of bookmarks from GiveALink.org for the month of November 2009. (More on GiveALink project)

JSON object format:

{
    'timestamp': 123456789, # Unix timestamp for when the URL was posted
    'url': '...', # the URL that was bookmarked
    'hashtags': ['...', '...', '...'] # a set of tags attached to the URL by the (anonymous) user
}

Dataset statistics:

  • Dataset size: 61,665 posts (approximately 430,000 triples)
  • File size: 12MB uncompressed
  • Time period: Nov 1, 2009 – Nov 30, 2009

Data: givealink-nov-2009.tgz (2MB)

If you use this dataset in your research, please cite either or both of these papers:

@inproceedings{Markines06GAL,
    author = {Markines, B. and Stoilova, L. and Menczer, F.},
    title = {Bookmark hierarchies and collaborative recommendation},
    booktitle = {Proc. 21st National Conference on Artificial Intelligence (AAAI-06)},
    pages = {1375--1380},
    publisher = {AAAI Press},
    url = {http://www.aaai.org/Papers/AAAI/2006/AAAI06-216.pdf},
    year = 2006
}
@inproceedings{Stoilova05GAL,
    Author = {Stoilova, Lubomira and Holloway, Todd and Markines, Ben and Maguitman, Ana G. and Menczer, Filippo},
    Title = {GiveALink: Mining a Semantic Network of Bookmarks for Web Search and Recommendation},
    Booktitle = {Proc. KDD Workshop on Link Discovery: Issues, Approaches and Applications (LinkKDD)},
    Url = {http://informatics.indiana.edu/fil/Papers/givealink-linkkdd.pdf},
    Year = 2005
}

co-author-network

4. Publications

Metadata for the complete set of all PubMed records through 2012 (with part of 2013 available as well), including title, authors, and year of publication. All data provided originates from NLM’s PubMed database (as downloaded April 24, 2013 from the NLM FTP site) and was retrieved via the Scholarly Database.

CSV format:

PubMed ID1,title1,year of publication1,author1|author2|author3|…
PubMed ID2,title2,year of publication2,author4|author1|author5|…

Dataset statistics:

  • Dataset size: 21.5 mil publications and 10.8 mil authors
  • File size: 3.1GB uncompressed
  • Time period: 1809 – 2013

Data: publications-1809-2013.tar.gz (1.4GB)

If you use this dataset in your research, please cite either or both of these papers:

@inproceedings{Light2013ISSI,
  author    = {Light, Robert P., David E. Polley and Katy Börner},
  title     = {Open Data and Open Code for Big Science of Science Studies},
  booktitle = {Proceedings of International Society of Scientometrics and Informetrics Conference},
  year      = {2013},
  pages     = {1342--1356},
  url       = {http://cns.iu.edu/docs/publications/2013-light-sdb-sci2-issi.pdf}
}
@article{Rowe2009Scien,
  author  = {Rowe, Gavin La, Sumeet Adinath Ambre, John W. Burgoon, Weimao Ke, and Katy Börner},
  title   = {The Scholarly Database and its Utility for Scientometrics Research"},
  journal = {Scientometrics},
  year   = {2009},
  volume = {79},
  number = {2},
  month  = {May},
  url    = {http://cns.iu.edu/docs/publications/2009-larowe-sdb.pdf}
}

National Coverage for “More Tweets, More Votes”

Findings by CNetS researchers on social media indicators of election results received significant coverage in the national press. The paper More Tweets, More Votes: Social Media as a Quantitative Indicator of Political Behavior by Joseph Digrazia, Karissa McKelvey, Johan Bollen, and Fabio Rojas was presented at the 2013 Meeting of the American Sociological Association in NYC. It was covered by NPR, The Wall Street JournalMSNBCC-SPANThe Washington PostThe Atlantic, and many other media.

Truthy Team Wins WICI Data Challenge

WICI Data Challenge AwardCongratulations to Przemyslaw Grabowicz, Luca Aiello, and Fil Menczer for winning the WICI Data Challenge. A prize of $10,000 CAD accompanies this award from the Waterloo Institute for Complexity and Innovation at the University of Waterloo. The Challenge called for tools and methods that improve the exploration, analysis, and visualization of complex-systems data. The winning entry, titled Fast visualization of relevant portions of large dynamic networks, is an algorithm that selects subsets of nodes and edges that best represent an evolving graph and visualizes it either by creating a movie, or by streaming it to an interactive network visualization tool. The algorithm is deployed in the movie generation tool of the Truthy system, which allows users to create, in near-real time, YouTube videos that illustrate the spread and co-occurrence of memes on Twitter. Przemek and Luca worked on this project while visiting CNetS in 2011 and collaborating with the Truthy team. Bravo!

Postdoctoral Researcher in Analysis and Modeling of Social Networks

Network of Political Retweets

[UPDATE: this position has been filled.]

The Center for Complex Networks and Systems Research has an open postdoctoral position to study how ideas propagate through complex online social networks. The position is funded by a McDonnell Foundation’s grant in Complex Systems. The appointment starts as early as possible after January 2013 for one year and is renewable for up to 2 additional years. The salary is competitive and benefits are generous.

The postdoc will join a dynamic and interdisciplinary team that includes computer, physical, and cognitive scientists. The postdoc will work with PIs Filippo Menczer and Alessandro Flammini, other postdocs, and several PhD students on analysis and modeling of social media data. Areas of focus will include information diffusion patterns, epidemic models for the spread of ideas, interactions between network traffic and structure dynamics, and agent-based models to explain the emergence of viral bursts of attention. Domains of study will include politics, scientific knowledge, and world events. Go to the grant page or project page for further details on the team and project.

The ideal candidate will have a PhD in computing or physical sciences; a strong background in analysis and modeling of complex systems and networks; and solid programming skills necessary to handle big data and develop large scale simulations.

To apply, email/send a CV and names and emails of three references to Tara Holbrook. Applications received by 15 December 2012 will receive full consideration, but applications will be considered until the position is filled.

Indiana University is an Equal Opportunity/Affirmative Action employer. Applications from women and minorities are strongly encouraged. IU Bloomington is vitally interested in the needs of Dual Career couples.

Thanks to KDnuggets, SOCNET, Gephi, DBWorld, Air-L, CITASA and others for help in advertising this position.

PLEAD 2012 keynote

PLEAD 2012I was honored to give a keynote presentation at PLEAD 2012, the CIKM Workshop on Politics, Elections and Data. My talk was titled The diffusion of political memes in social media. The workshop was held in beautiful Maui Hawaii, but alas, I could not attend in person and gave the presentation remotely via skype 🙁

IARPA contract to study new ways to forecast critical societal events

University and industry scientists are determining how to forecast significant societal events, ranging from violent protests to nationwide credit-rate crashes, by analyzing the billions of pieces of information in the ocean of public communications, such as tweets, web queries, oil prices, and daily stock market activity.

“We are automating the generation of alerts, so that intelligence analysts can focus on interpreting the discoveries rather than on the mechanics of integrating information,” said Naren Ramakrishnan, the Thomas L. Phillips Professor of Engineering in the computer science department at Virginia Tech. He is leading the team of computer scientists and subject-matter experts from Virginia Tech, the University of Maryland, Cornell University, Children’s Hospital of Boston, San Diego State University, University of California at San Diego, and Indiana University, and from the companies, CACI International Inc., and Basis Technology.

CNetS Professors Bollen and Rocha from the School of Informatics and Computing at Indiana University are members of this project. Prof. Bollen, has devised a way to evaluate the tone of tweets – calm, alert, vital, etc. — to predict stock market trends. Prof. Rocha, has developed bio-inspired methods to predict associations in biochemical, social, and knowledge networks, including web and e-mail systems.

Additional details: Researchers study new ways to forecast critical societal events.

DARPA award

Prof. Flammini (PI) and Menczer have been awarded a three-year, $2M grant from DARPA in the context of the Social Media in Strategic Communication (SMISC) program, whose primary goal is “to develop a new science of social networks built on an emerging technology base,” Our IU unit leads a three-group team that includes collaborators at Lockheed-Martin Advanced Technology Lab and the University of Michigan. The funded project is aimed at designing and implementing a system to detect online persuasion campaigns.

2011 Truthy Updates

WSJ video on Truthy project
Mike Conover in the WSJ's report on the Truthy project

We’re pleased to report several exciting developments in our interdisciplinary project studying information diffusion in complex online social networks. The past year has resulted in several publications. Our results on the Truthy astroturf monitoring and detection system were presented at WWW 2011 and ICWSM 2011. Research into the polarized network structure of political communication on Twitter was presented at ICWSM and received the 2011 CITASA Best Student Paper Honorable Mention. We demonstrated the feasibility of the prediction of individuals’ political affiliation from network and text data (SocialCom 2011), a machine learning application that enables large-scale instrumentation of nearly 20,000 individuals’ political behaviors, policy foci, and geospatial distribution (Journal of Information Technology and Politics). We’re also working on a paper on partisan asymmetries in online political activity surrounding the 2010 U.S. congressional midterm elections.

Our results have been widely covered in the press, including the Wall Street JournalScienceCommunications of the ACM, NPR [1,2], The Chronicle of Higher Education, Discover Magazine, The Atlantic, New ScientistMIT Technology Review, and many more.

Current and future research is supported by an award from the NSF Interface between Computer Science and Economics & Social Sciences program, and a McDonnell Foundation grant. The former will focus on building an infrastructure for the study of information diffusion in social media, the characterization of meme spread patterns, and the development of sentiment analysis tools for social media. The latter will focus on modeling efforts, especially agent-based models of information diffusion, competition for attention, and the relationship between information sharing events and social network evolution.

Postdoctoral Researcher in Analysis and Modeling of Social Networks

Network of Political Retweets

The Center for Complex Networks and Systems Research has an open postdoctoral position to study how ideas propagate through complex online social networks. The position is funded by a McDonnell Foundation’s grant in Complex Systems. The appointment starts in January 2012 for one year and is renewable for up to 3 additional years. The salary is competitive and benefits are generous.

The postdoc will join a dynamic and interdisciplinary team that includes computer, physical, and cognitive scientists. The postdoc will work with PIs Filippo Menczer and Alessandro Flammini and several PhD students on analysis and modeling of social media data. Areas of focus will include information diffusion patterns, epidemic models for the spread of ideas, interactions between network traffic and structure dynamics, and agent-based models to explain the emergence of viral bursts of attention. Domains of study will include politics, scientific knowledge, and world events. Go to the grant page or project page for further details on the team and project.

The ideal candidate will have a PhD in computing or physical sciences; a strong background in analysis and modeling of complex systems and networks; and solid programming skills necessary to handle big data and develop large scale simulations.

To apply, email/send a CV and names and emails of three references to Tara Holbrook. Applications received by Oct. 24, 2011 will be given full consideration, but the position will remain open until a successful candidate is identified.

Indiana University is an Equal Opportunity/Affirmative Action employer. Applications from women and minorities are strongly encouraged. IU Bloomington is vitally interested in the needs of Dual Career couples.

Thanks to KDnuggets, SOCNET, Gephi, DBWorld, Air-L, CITASA and others for help in advertising this position.

Truthy tool identifies smear tactics on Twitter

Astroturfers, Twitter-bombers and smear campaigners need beware this election season as a group of leading Indiana University information and computer scientists today unleashed Truthy.indiana.edu, a sophisticated new Twitter-based research tool that combines data mining, social network analysis and crowdsourcing to uncover deceptive tactics and misinformation leading up to the Nov. 2 elections. Combing through thousands of tweets per hour in search of political keywords, the team based out of IU’s School of Informatics and Computing will isolate patterns of interest and then insert those memes (ideas or patterns passed by imitation) into Twitter’s application programming interface (API) to obtain more information about the meme’s history.

In the run-up to the mid-term elections, Truthy uncovered a number of abuses such as robot-driven traffic to politician websites and networks of bot accounts controlled by individuals to promote fake news. These findings have been widely covered in the press, with mentions in The Atlantic, MIT Technology Review, PC World, New Scientist, NPR, Ars Technica, Fast Company, The Chronicle of Higher Education, The New York Times Magazine, and many other media. Read more here and here.