How to organize your code

Introduction

This is a small recipe on how to organize your Python code efficiently for a
scientific workflow. Keeping your files organized rationally from day one will
help you keep track of your progress at all stages of a research project,
from inception to publication of the results, it will make it simpler to reuse
your code (thus reducing the number of bugs and headaches!), and will make it
easier to collaborate with other people in the lab.

As an example, let us imagine that we started a research project for analyzing
the Bitcoin transaction network.

Instructions

First, create a folder called bitcoin_analysis (or whatever name you like)
and add an empty file called __init__.py. This will be the main package for
your code. You don’t put there anything else except source code files.

Second, create a sub-folder called parser under bitcoin_analysis and add an
__init__.py there as well.

In a location in your PYTHONPATH, create a file with extension .pth and add
to it the full path to the bitcoin_analysis folder. This will automatically
add the package to your Python path.

Create a scripts/ folder, move there your executable scripts (like the
parser) and add it to your PATH variable.

Fix the imports in those scripts. For example, assuming that at step before you
have put your parser code under bitcoin_analysis.parser:

from BCDataStream import ...

becomes:

from bitcoin_analysis.parser.BCDataStream import ...

Now any python module will correctly import your code from the
bitcoin_analysis package, regardless of where it resides.

At the top level of the repository, create a file called setup.py. This is
the distutils script that will be used
to install the package to other sites. You can customize the following script
to your needs:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
from numpy import get_include

_incl = [ get_include() ]

setup(
        name="knowledge_clouse",
        description='Graph-theoretic measures of knowledge confidence',
        version='0.0.1pre',
        author='Giovanni Luca Ciampaglia',
        author_email='gciampag@indiana.edu',
        packages=['knowledge_measure'],
        cmdclass={'build_ext' : build_ext},
        ext_modules=[
            Extension("knowledge_measure.cmaxmin", ["knowledge_measure/cmaxmin.pyx"],
                include_dirs=_incl,
                extra_compile_args=['-fopenmp'],
                extra_link_args=['-fopenmp'])
            ],
        scripts = [
            'scripts/closure.py',
            'scripts/cycles.py',
            'scripts/ontoparse.py',
            'scripts/test_dag.py',
            'scripts/prep.py',
            ]
        ) 

Notes:

  • you don’t have C extension modes, so skip the ext_modules part, the
    build_ext cmd, the import from Cython, and the import of get_include from
    numpy (as well as the line assigning to the _incl variable).
  • in the list that you pass to the scripts argument, add the path to all the
    scripts you have. When you will do python setup.py install, the setup.py
    script will copy the scripts under bin/ in the installation prefix that you
    specify (e.g. a virtualenv, /usr/local/bin, or wherever you pass as
    --prefix to setup.py).

Create an experiment folder, and start populating it as in the example from
my previous mail. In each individual experiment, you keep only the code
related to that experiment. You also store there any output file that you
produce, as long as they are not too big, e.g. plots or textual dumps. Large
data files (like the *.blk files) are instead kept somewhere else totally
outside of the repository and you never check them in the repo. In this way
when somebody wants to clone your repository, he or she doesn’t have to wait
hours to download gigabytes and gigabytes of data.

When you are working on an experiment, and you realize that you need to reuse
some code written in another experiment, you just refactor it under the main
bitcoin_analysis package. This increases code reuse for this and future
projects.