Usage Guidelines

How to use the computing infrastructure

Thanks to funding from NSF (Award No. IIS-0811994) and the School of
Informatics, NaN and other CNetS researchers, under the supervision of PIs Fil
Menczer, Alex Vespignani, and YY Ahn can avail themselves of an advanced
computing and storage infrastructure.

The infrastructure is composed of five servers (burns, smithers, lenny,
carl, and snowball) each with 16 cores and memory ranging between 32GB
(burns) to 190GB (snowball). Some of these machines mount a 20 TB disk array,
integrated into a fibre-channel Storage Area Network using GFS. The servers are
shared among several researchers, and the resources are therefore limited. In
order to limit accidents (such as out-of-memory
, denials of
, etc.) and
resource overload, we worked out a few simple usage guidelines for the servers.

To get an account on the servers: given the source of funding, this
infrastructure is reserved for projects managed by PIs Menczer, Vespignani, Ahn
only. PIs must approve requests. Once a request is approved, it can be submitted
to our support team. You must explicitly state in your request that you read
this document completely and agree to these guidelines

The guidelines

IMPORTANT: if you are caught breaching this guidelines your account will be

1. Purpose of the servers

The servers are intended mainly for jobs that manage large datasets related to
CNetS projects. In particular:

  • Do NOT use carl or lenny for any computational job. Lenny runs the
    database of the Botometer project. Carl is where
    the NaN group runs its production Web and database services (such as
    Scholarometer, etc), as well as where the
    CNetS web site and these guides are hosted, so
    there must be no computationally-intesive job running on carl — for any
  • The other servers connected to the GFS are reserved mainly for jobs that
    access data on the GFS
    . The machines are: smithers and burns. In
    particular, please do NOT use hadoop on the cnets servers. Instead, please use
    FutureGrid. We have a special project
    created for the NSF grant here.
    Please ask one of the existing people on the project to be added to it.
  • For jobs requiring MySQL, please use burns. Do not install MySQL on
    other machines.
  • For other jobs, please use Big Red II,
    Karst, or
    For a list of available
    computing systems at IU, see here
    (campus-wide) and

2. The GFS is ONLY intended for storing results from the Moe API

The GFS is mounted under /home/gfs and also accessible under the symbolic link
/l/cnets. This space is used to store results of queries from our new
IndexedHBase infrastructure on Moe ( No other files or
directories are allowed anywhere under /l/cnets. If you need to store large
datasets, there are various
including long-term backup on tape (MDSS). Note also that the GFS is not
available on snowball. You can check what filesystems are available on a
machine by using the command df.

Unfortunately this measure has been made necessary after widespread misuse of
the GFS, whose primary purpose is to store Twitter data. We are currently
(09/2013) considering a new quota system which would allow users to have small
directories on the GFS, and once the new infrastructure for storing Twitter data
will be available, it will be possible to use the GFS for other purposes, but
for now users are not allowed to use the GFS for storing their own data.

Note also that for small data sets you can use your home directory. Home folders
are subject to a 10GB disk quota (type quota in the console will show your
current usage).

3. Never run du (disk usage) on the GFS

This is due to a performance problem of the GFS. For further information on how
to use the GFS efficiently, have look at the specific howto on
this website. You are strongly recommended to follow the best practices
describe there.

4. Share resources and do not overload the machines

In particular:

  • You MUST always use nice to run your computations. Any process that will not
    use nice may be terminated. Look at this

    under “Caveats and Tips.”
  • Never take more than 50% of the memory for your computations (use top and
    vmstat to monitor memory consumption).
  • Do not load huge datasets in main memory. You might think that 190GB of memory
    is a lot and therefore that that tweets file you need to sort will fit in
    memory. It is not and it will not.
  • Do not launch more processes/threads than the number of cores on a machine,
    unless you know what you are doing (e.g. the jobs are I/O bound). To
    check how many cores are available on a machine, do top and then press 1
    in the console.

If you are caught launching a large number of processes and/or not using nice,
and/or running a machine out of memory, your account may be terminated

Use the unix utility man (manual) to learn more about each of these commands
(e.g. man nice, man vmstat, etc.). If you are new to Linux, you can start
from this guide. The Linux
Document Project (TLDP) is in general a good starting point,
though a bit outdated, for learning about Linux. Google linux tutorial will
also return a list of high-quality learning material.

5. Each user must pay attention to all announcements related to the infrastructure

You will be added to a low-traffic announcement mailing list when your account
request is accepted.

6. Technical requests must be sent to the SOIC Help Desk

The help desk is accessible at this
. Note that other systems such as
Quarry, Big Red II, or the Data Capacitor, have their own support teams. Look
them up on the IU Knowledge Base.

7. Acknowledge the funding agencies for their financial support on your papers

If you use this infrastructure to support your research, be sure to acknowledge
NSF Award No. IIS-0811994 in your papers.

8. NEVER share you account, password, and access keys with other people

This is very uncool and, most importantly, constitutes a breach of the
university-wide policies
employees) which might
result in you being kicked out of your program and/or your position.