How to use the computing infrastructure
Thanks to funding from NSF (Award No. IIS-0811994) and the School of
Informatics, NaN and other CNetS researchers, under the supervision of PIs Fil
Menczer, Alex Vespignani, and YY Ahn can avail themselves of an advanced
computing and storage infrastructure.
The infrastructure is composed of five servers (
snowball) each with 16 cores and memory ranging between 32GB
(burns) to 190GB (snowball). Some of these machines mount a 20 TB disk array,
integrated into a fibre-channel Storage Area Network using GFS. The servers are
shared among several researchers, and the resources are therefore limited. In
order to limit accidents (such as out-of-memory
situations, denials of
service, etc.) and
resource overload, we worked out a few simple usage guidelines for the servers.
To get an account on the servers: given the source of funding, this
infrastructure is reserved for projects managed by PIs Menczer, Vespignani, Ahn
only. PIs must approve requests. Once a request is approved, it can be submitted
to our support team. You must explicitly state in your request that you read
this document completely and agree to these guidelines.
IMPORTANT: if you are caught breaching this guidelines your account will be
1. Purpose of the servers
The servers are intended mainly for jobs that manage large datasets related to
CNetS projects. In particular:
- Do NOT use
lennyfor any computational job. Lenny runs the
database of the Botometer project. Carl is where
the NaN group runs its production Web and database services (such as
Scholarometer, etc), as well as where the
CNetS web site and these guides are hosted, so
there must be no computationally-intesive job running on carl — for any
- The other servers connected to the GFS are reserved mainly for jobs that
access data on the GFS. The machines are:
particular, please do NOT use hadoop on the cnets servers. Instead, please use
FutureGrid. We have a special project
created for the NSF grant here.
Please ask one of the existing people on the project to be added to it.
- For jobs requiring MySQL, please use
burns. Do not install MySQL on
- For other jobs, please use Big Red II,
FutureGrid. For a list of available
computing systems at IU, see here
2. The GFS is ONLY intended for storing results from the Moe API
The GFS is mounted under
/home/gfs and also accessible under the symbolic link
/l/cnets. This space is used to store results of queries from our new
IndexedHBase infrastructure on Moe (moe.soic.indiana.edu). No other files or
directories are allowed anywhere under
/l/cnets. If you need to store large
datasets, there are various
including long-term backup on tape (MDSS). Note also that the GFS is not
snowball. You can check what filesystems are available on a
machine by using the command
Unfortunately this measure has been made necessary after widespread misuse of
the GFS, whose primary purpose is to store Twitter data. We are currently
(09/2013) considering a new quota system which would allow users to have small
directories on the GFS, and once the new infrastructure for storing Twitter data
will be available, it will be possible to use the GFS for other purposes, but
for now users are not allowed to use the GFS for storing their own data.
Note also that for small data sets you can use your home directory. Home folders
are subject to a 10GB disk quota (type
quota in the console will show your
3. Never run
du (disk usage) on the GFS
This is due to a performance problem of the GFS. For further information on how
to use the GFS efficiently, have look at the specific howto on
this website. You are strongly recommended to follow the best practices
4. Share resources and do not overload the machines
- You MUST always use
niceto run your computations. Any process that will not
use nice may be terminated. Look at this
under “Caveats and Tips.”
- Never take more than 50% of the memory for your computations (use
vmstatto monitor memory consumption).
- Do not load huge datasets in main memory. You might think that 190GB of memory
is a lot and therefore that that tweets file you need to sort will fit in
memory. It is not and it will not.
- Do not launch more processes/threads than the number of cores on a machine,
unless you know what you are doing (e.g. the jobs are I/O bound). To
check how many cores are available on a machine, do
topand then press
in the console.
If you are caught launching a large number of processes and/or not using nice,
and/or running a machine out of memory, your account may be terminated.
Use the unix utility
man (manual) to learn more about each of these commands
man vmstat, etc.). If you are new to Linux, you can start
from this guide. The Linux
Document Project (TLDP) is in general a good starting point,
though a bit outdated, for learning about Linux. Google
linux tutorial will
also return a list of high-quality learning material.
5. Each user must pay attention to all announcements related to the infrastructure
You will be added to a low-traffic announcement mailing list when your account
request is accepted.
6. Technical requests must be sent to the SOIC Help Desk
7. Acknowledge the funding agencies for their financial support on your papers
If you use this infrastructure to support your research, be sure to acknowledge
NSF Award No. IIS-0811994 in your papers.