Common Python programming mistakes

Python is a wonderful language but, as any programming language, possesses its
own idiosyncraces, that, when overlooked, may led into awful bugs and/or very
poor performance. This is a collection of common traps that people fall into
when coding software with Python. It is specifically focused on scientific tasks
involving problems of large scale, including simulation, data analysis, and
visualization. This guide presupposes a working level knowledge of Python and of
its standard library.

A note on versions: at the time of drafting of this document, The Python
community is still under a major transition from the old Python 2 towards Python
3. Many of the problems described here stem from idiosyncraces or limitations of
Python 2 that have been solved or strongly mitigated in the more recent versions
of the language and its standard library. In general, if you are approaching
Python for the first time, it’s just simpler to start coding in Python 3 right
away, since the most important libraries we use (e.g. NumPy, SciPy, Matplotlib)
have been already ported to it, and so there should be no compatibility problems
with your code. That stated, I’ll cover several examples from the perspective of the legacy version, since there is still a lot of legacy code around and – I
confess – it is the one I am most familiar with. I’ll try to point out whenever
a given programming mistake has been defused in Python 3 and what has been
specifically changed in the language in order to do so.

Mistake #1: read instead of readline

This is perhaps the single most common mistake when dealing with large datasets:
trying to load all your data into memory. Let’s look at a common example, a
script that prints the 10 most frequent words in a text file.

import pprint
counts = {}
f = open('./hugedocument.txt')
data = f.read().split()
f.close()
for word in data:
    if word in counts:
        counts[word] += 1
    else:
        counts[word] = 1
counts = sorted(counts.items(), key=lambda k : k[1], reverse=True)
pprint.pprint(counts[:10])

On line 3, the script will read the whole file into a temporary string, break
the string at empty spaces and put all the resulting non-empty words into a
list called data. With a very large data file, such as daily compressed dumps of
tweets this will use up all available memory, and will in turn trigger the
out-of-memory (OOM) killer to come into action. This is a component of
the Linux operative system whose purpose is to identify and terminate all
running processes who are responsible for too much resource consumption.
Unfortunately, the OOM killer can only take a probabilistic guess, and so it may
kill other, innocent processes before catching the actual resource hog.

To understand how Linux manages memory and how the oom-killer works, read here:

This mistake can be fixed by processing the text file one line at a time:

import pprint
counts = {}
f = open('./hugedocument.txt')
while True:
    line = f.readline()
    if line:
        for word in line.split():
            if word in counts:
                counts[word] += 1
            else:
                counts[word] = 1
    else:
        break
f.close()
counts = sorted(counts.items(), key=lambda k : k[1], reverse=True)
pprint.pprint(counts[:10])

Or better, use the fact that text files in Python are iterators:

import pprint
counts = {}
f = open('./hugedocument.txt')
for line in f:
    for word in line.split():
        if word in counts:
            counts[word] += 1
        else:
            counts[word] = 1
f.close()
counts = sorted(counts.items(), key=lambda k : k[1], reverse=True)
pprint.pprint(counts[:10])

Mistake #2: membership test of lists instead of sets

This is mistake that will result in CPU consumption. Often you want to check
that an element appears in a sqeuence of elements. In Python this is done with:

seq = range(10)
elem = 1
flag = elem in seq

However, seq is a Python list, and Python lists offer only O(N) time
complexity for testing element membership, i.e. you have to go through seq
element by element. Since you just want to tests whether elem belongs to
seq, what you can do is to use a set:

seq = set(range(10))
elem = 1
flag = elem in seq

Mistake #3: range instead of xrange

This does not need many presentations: if you are interating over a billion
elements, the following:

N = int(1e9)
for i in range(N):
    ...

Will first allocate in memory a list of integers from 0 to 999,999,999, and then
iterate over it. In contrast xrange will return an iterator that goes over the
sequence one element at the time.

Note: this has been solved in Python 3 by making range behave like
xrange.

Mistake #4: too many function calls

Often we need to perform the same computation over a large number of elements.
Each computation is independent of the others, and so we write a function that
takes a single argument. Then, we apply the function to each element:

def compute(element):
    ... # various steps

for elem in inputdata:
    compute(elem)

Because calling Python function has a significant overhead, with sequences of
several million elements this results in very poor performances. Instead, change
the function to accept multiple elements:

def compute(*elements):
    for element in elements:
        ... # various steps

compute(*inputdata)

Also, note the syntax for defining the function: *elements means that that the
function accepts a variable number of arguments, that will be automatically
packed in a sequence called elements. A classic example is the function sum.

Mistake #5: map vs imap, filter vs ifilter, etc.

Again, we have a sequence of elements, and we need to apply the same operation
to each element in it (also known as “mapping”), filter a subset of elements,
and so on and so forth. Sometimes we do several rounds of mapping/filtering
before being able to compute the actual result we want.

For example, let us imagine that we want count the number of zeros in a random
sequence of integers from 0 to 9. Of course this can be done in a much simpler
way using the numpy.randint and numpy.histogram, but just for illustrative
purposes let us give an implementation using numpy.rand (which returns real
numbers instead of integers) and with filter:

import numpy as np
data = np.random.rand(int(1e9)) * 10
result = len(filter(lambda k : k == 0, map(round, data)))

The last line will first round the data to the nearest integers (our random
integer sequence), then create a list of all zeros in it, and finally take its
length. This means that on average we increase memory consumption by 110%.
Replacing map with itertools.imap and filter with itertools.ifilter
solves the issue.

Note: this issue does not exist anymore in Python 3: both map and filter
now return iterators instead of lists.