NaN is a research group exploring complex systems, adaptive agents, modeling, simulation, artificial life, and complex (information, biological, and social) networks. We especially focus on the Web as a complex information network in which we leave abundant traces of our social and semantic activities: what we do, what we are interested in, whom we talk to, what knowledge we acquire and contribute.
Here we list our public datasets and tools for processing data for researchers and people who are interested. Most datasets were collected and prepared for research projects by NaN members. Please acknowledge our effort by citing corresponding papers if you have used our datasets. Thank you and enjoy!
Datasets:
Fact checking |
Web traffic (WebSci14) |
Twitter (WebSci14) |
Social bookmarking (WebSci14) |
Publications (WebSci14) |
Topic diversity |
Virality prediction |
Legitimate classification |
Political polarization |
Web Click Data |
Last.fm
Data tools:
Klatsch |
Fast Visualization of network |
WebGraph++ |
Java Crawler |
Topical Crawler Evaluation |
Latent Energy Environments |
OAMulator |
Recruit
project: NaN -> observatory: CNetS -> organization: CNetS
contact Dominic DiFranzo or Jim Hendler @ RPI/WSTNet for questions
================================================== –>
→ Fact Checking Dataset [Download][README] 
DBpedia 2016-10 knowledge graph in a format used by Knowledge Linker (KL), Relational Knowledge Linker (KL-REL) and Knowledge Stream (KS) algorithms, and a collection of synthetic and real datasets used to evalute them and other state-of-the-art methods applicable to fact checking. Some of the synthetic datasets were created by us, while others were downloaded from KGMiner.
- Source: DBpedia 2016
- File size: 538M compressed; ~3G uncompressed
- Please cite:
Prashant Shiralkar, Alessandro Flammini, Filippo Menczer, Giovanni Luca Ciampaglia Finding Streams in Knowledge Graphs to Support Fact Checking. In Proc. of the Intl. Conf. on Data Mining, 2017.
→ Web Traffic DatasetWebsci2014! [Download][README] 
A collection of Web (HTTP) requests for the month of November 2009. This is a small sample of the larger click dataset.
- Source: Generated by applying a Berkeley Packet Filter to a mirror of the traffic passing through the border router of Indiana University.
- Date range: Nov. 1, 2009 to November 22. 2009
- File size: 235M requests; 2.7GB uncompressed
- Please cite:
Mark R. Meiss, Filippo Menczer, Santo Fortunato, Alessandro Flammini, and Alessandro Vespignani. Ranking web sites with real user traffic. In Proc. 2008 Intl. Conf. on Web Search and Data Mining, pp.65-76. ACM, 2008.Mark R. Meiss, Bruno Gonçalves, José J. Ramasco, Alessandro Flammini, and Filippo Menczer. Modeling traffic on the web graph. In Proc. 7th Workshop on Algorithms and Models for the Web Graph (WAW), pp. 50-61. Springer Berlin Heidelberg, 2010.
→ Twitter DatasetWebsci2014! [Download][README] 
A collection of records extracted from tweets for the month of November 2012 containing both #hashtags and URLs as part of the tweet.
- Source: Sampled public tweets from Twitter streaming API.
- Date range: November 2012.
- File size: 27.8M tweets; 3.6GB uncompressed.
- Please cite:
Karissa McKelvey and Filippo Menczer. Truthy: Enabling the Study of Online Social Networks. In Proc. 16th ACM Conference on Computer Supported Cooperative Work and Social Computing Companion (CSCW), 2013.
→ Social Bookmarking DatasetWebsci2014! [Download][README] 
A collection of bookmarks from GiveALink.org for the month of November 2009.
- Source: GiveALink.org.
- Date range: November 1, 2009 to November 30, 2009.
- File size: 61,665 posts (approximately 430,000 triples); 12MB uncompressed
- Please cite:
Ben Markines, Lubomira Stoilova, and Filippo Menczer. Bookmark hierarchies and collaborative recommendation. In Proc. 21st National Conference on Artificial Intelligence (AAAI), pp 1375-1380, 2006.Lubomira Stoilova, Todd Holloway, Ben Markines, Ana G. aguitman, and Filippo Menczer. GiveALink: Mining a Semantic Network of Bookmarks for Web Search and Recommendation. In Proc. KDD Workshop on Link Discovery: Issues, Approaches and Applications (LinkKDD), 2006.
→ Publications DatasetWebsci2014! [Download][README] 
Metadata for the complete set of all PubMed records through 2012 (with part of 2013 available as well), including title, authors, and year of publication. All data provided originates from NLM’s PubMed database (as downloaded April 24, 2013 from the NLM FTP site) and was retrieved via the Scholarly Database.
- Source: Scholarly Database.
- Date range: 1809 to 2013.
- File size: 21.5M publications and 10.8M authors; 3.1GB uncompressed
- Please cite:
Robert P. Light, David E. Polley and Katy Börner. Open Data and Open Code for Big Science of Science Studies. In Proc. Intl. Society of Scientometrics and Informetrics Conf., pp 1342-1356, 2013.Gavin La Rowe, Sumeet Adinath Ambre, John W. Burgoon, Weimao Ke, and Katy Börner. The Scholarly Database and its Utility for Scientometrics Research. Scientometrics, 79.2 (2009): 219-234.
→ Topical diversity of user interests and content [Download][README] 
- Source: Sampled public tweets from Twitter streaming API.
- Date range: January 1, 2013 to March 31, 2013.
- Data size: 6.4 GB; about 490 millions tweets.
- Contains:
- Sampled tweets during 3 months.
- Each tweet is associated with a timestamp, anonymized user ID, and a list of hashtags.
- Please cite:
Lilian Weng and Filippo Menczer. Topicality and Social Impact: Diverse Messages but Focused Messengers. Under review. 2014.
→ Astroturf/Legitimate Classification [Download][README] 
This is the training data used to produce the results shown in the paper listed below.
- Source: Sampled public tweets from Twitter streaming API.
- Date range: September 14 to October 27, 2010.
- Contains:
- data.arff: holds the un-resampled training data.
- data_balanced.arff: holds the resampled training data.
- data.instance_to_id.pickle: holds a Python pickle relating instance IDs in the
data.arff file with Meme IDs in the Truthy database. To view the page for a
particular meme ID, go to http://truthy.indiana.edu/m?id=
- Please cite:
Jacob Ratkiewicz, Michael Conover, Mark Meiss, Bruno Goncalves, Alessandro Flammini, and Filippo Menczer. Detecting and Tracking Political Abuse in Social Media. Proc. 5th International AAAI Conference on Weblogs and Social Media ICWSM, 2011.
→ Political Polarization on Twitter [Download][README] 
This is the training data used to produce the results shown in the paper listed below.
- Source: Sampled public tweets from Twitter streaming API.
- Date range: 6 weeks prior to the 2010 Congressional midterm elections.
- Contains:
- Three networks of political communication between Twitter users
- Please cite:
Michael Conover, Jacob Ratkiewicz, Matthew Francisco, Bruno Goncalves, Alessandro Flammini, and Filippo Menczer. Political Polarization on Twitter. Proc. 5th International AAAI Conference on Weblogs and Social Media ICWSM, 2011.
→ Web Click Dataset [Download] 
This is the data used to produce the results shown in the paper listed below.
- Source: Generated by applying a Berkeley Packet Filter to a mirror of the traffic passing through the border router of Indiana University.
- Date range: September 2006 and May 2010.
- Contains 2 collections:
- raw: About 25 billion requests, where only the host name of the referrer is retained. Collected between 26 Sep 2006 and 3 Mar 2008; missing 98 days of data, including the entire month of Jun 2007. Approximately 0.85 TB, compressed.
- raw-url: About 28.6 billion requests, where the full referrer URL is retained. Collected between 3 Mar 2008 and 31 May 2010; missing 179 days of data, including the entire months of Dec 2008, Jan 2009, and Feb 2009. Approximately 1.5 TB, compressed.
- Please cite:
Mark R. Meiss, Filippo Menczer, Santo Fortunato, Alessandro Flammini, and Alessandro Vespignani. Ranking web sites with real user traffic. In Proc. 2008 Intl. Conf. on Web Search and Data Mining, pp.65-76. ACM, 2008.Mark R. Meiss, Bruno Gonçalves, José J. Ramasco, Alessandro Flammini, and Filippo Menczer. Modeling traffic on the web graph. In Proc. 7th Workshop on Algorithms and Models for the Web Graph (WAW), pp. 50-61. Springer Berlin Heidelberg, 2010.
Last.fm Dataset→ [Download][README] 
This is the data used to produce the results shown in the paper below.
- Source: A crawl of Last.fm users, their annotations, friends and neighborhood relations, and group membership.
- Date range: First half of 2009.
- Please cite:
Schifanella, R., Barrat, A., Cattuto, C., Markines, B., and Menczer, F. (2010). Folks in Folksonomies: Social Link Prediction from Shared Metadata. Proc. 3rd ACM International Conference on Web Search and Data Mining (WSDM). arXiv Preprint
→ Klatsch [Download][README] 
Klatsch is a framework and language for exploring and analyzing feeds of social media data.
- The purpose of the Klatsch framework is to provide an easy-to-program, flexible
interface for exploring and analyzing feeds of social media data. It’s meant
to be easy to interface to existing algorithms and graph representations and to
produce pretty pictures in a variety of formats. The language itself is
somewhere between Python and Scheme: dynamic types, procedures as first-class
data, call-by-value semantics, and a nod toward object orientation. - Klatsch is built around a scripting language implemented by an interpreter
written in Java. You don’t need to know how to program Java in order to
develop Klatsch scripts; that’s only necessary if you want to extend or modify
the interpreter itself. - Please cite:
Jacob Ratkiewicz, Michael Conover, Mark Meiss, Bruno Gonçalves, Snehal Patil, Alessandro Flammini, and Filippo Menczer. Truthy: Mapping the Spread of Astroturf in Microblog Streams. Proc. 20th Intl. World Wide Web Conf. Companion (WWW), 2011.
→ Fast visualization of large dynamic networks [Download][README] 
- This is a collection of two tools for visualization of large dynamic networks, that perform the following functions, respectively:
- From a chronological sequence of graph links in form of sdnet files produce differential updates to a subgraph of the network delegated for visualization in a format of JSON events. (src/visualize_tweets_finitefile.cpp)
- Produce movies of evolving graphs from a feed of the JSON events (scripts/DynamicGraph_wici.py).
- Please cite:
Grabowicz, Przemyslaw A., Luca Maria Aiello, and Filippo Menczer. Fast filtering and animation of large dynamic networks. EPJ Data Science 3(1), 2014.
→ WebGraph++ [Github][README] 
This software is a translation into C++ of the excellent Webgraph library by P. Boldi and S. Vigna.. The original library, written in Java, is easy to use but hampered by some requirements of the Java virtual machine. This C++ translation attempts to preserve much of the ease of use (through integration with the Boost Graph Library), but bypasses requirements imposed by a virtual machine.
Like the original Webgraph library, this work is available under the GNU General Public License.
- Please read more about the tool here.
- Please cite:
J. Ratkiewicz and F. Menczer. Text snippets from the DomGraph. Proc. SIGIR Workshop on Focused Retrieval. 2008.
→ Multi-threaded crawlers in Java [Download][README] 
The code implements a multi-threaded Web crawler. Please read more about the tool here.
- Please cite:
G. Pant, P. Srinivasan, F. Menczer. Crawling the Web. In M. Levene and A. Poulovassilis, eds.: Web Dynamics, Springer, 2004.
→ A General Evaluation Framework for Topical Crawlers. [Download][README] 
The script and data files are released in association with, and implement/illustrate algorithms described in, the following paper. Please refer to the paper for a detailed illustration of the procedures implemented in the script, and of the data files.
- Please cite:
Srinivasan, P and Pant, G and Menczer, F. A General Evaluation Framework for Topical Crawlers. Information Retrieval 8(3):417-447, 2005.
→ Latent Energy Environments [Download][README] 
An artificial life model and simulator of controlled complexity, using endogenous fitness. Software and documentation available for Unix and Macintosh.
Please read more about the tool here.
- Please cite:
F Menczer and RK Belew. Latent Energy Environments. In: R. Belew and M. Mitchell, editors, Adaptive Individuals in Evolving Populations: Models and Algorithms. Addison Wesley, Reading, MA, 1996.
→ OAMulator [Download][README] 
The OAMulator is a Web based resource to support the teaching of instruction set architecture, assembly languages, memory, addressing, high level programming, and compilation. The tool is based on a simple, virtual CPU architecture called the One Address Machine. A compiler allows to take programs written in a special programming language, called OAMPL, and transform them into OAM assembly. An OAM assembler/emulator allows to interpret and execute OAM assembly code (produced by the compiler or written directly). The OAMulator is targeted at students who take introductory courses in information technology or information systems. The OAMulator is designed to take the mystery out of the CPU architecture and let students gain confidence with the concepts of compilers and binary execution.
Please read more about the tool here.
- Please cite:
Menczer, Filippo, and Alberto Maria Segre. OAMulator: A Teaching Resource to Introduce Computer Architecture Concepts. Journal on Educational Resources in Computing (JERIC) 1.4: 18-30, 2001.
→ Recruit [Download][README] 
Recruit is a free and open source Web-based software system to support a faculty search committee in its academic recruiting/hiring tasks. Recruit makes it easy for a department, division, or school faculty search committee to accept, manage, review and annotate job applications on the Web.
Please read more about the tool here.