Python code profiling and accelerating your calculations with numba

You wrote up your excellent idea as Python program/module but you are unsatisfied with its performance. The chances are high most of us have been there at least once. I’ve been there last week.

I found excellent method for outlier detection (Enhanced Isolation Forest). eIF was initially written in Python and later optimised in Cython (using C++). C++ is ~40x faster than vanilla Python version, but it lacks the possibility to save the model (which is crucial for my project). Since adding model saving to C++ version is rather complicated buisness, I’ve decided to optimise Python code. Initially I hoped for ~5-10x speed improvement. The final effect surprised me, as rewritten Python code was ~40x faster than initial version matching C++ version performance!

How is it possible? Speeding up your code isn’t trivial. First you need to find which parts of your code are slow (so-called code profiling). Once you know that, you can start tinkering with the code itself (code optimisation).

Code profiling

Traditionally I’ve been relying on %timeit which reports precise execution time for expressions in Python.

%timeit F3.fit(X)
# 1.25 s ± 792 µs per loop (mean ± std. dev. of 7 runs, 1 l oop each)

As awesome as %timeit is, it won’t really tell you which parts of your code are time consuming. At least not directly. For that you’ll need something more advanced.

Code profiling became easier thanks to line_profiler. You can install, load and use it in Jupyter notebook as follows:

# install line_profiler in your system
!pip install line_profiler 
# load the module into current Jupyter notebook
%load_ext line_profiler

# evaluate populate_nodes function of F3.fit program
%lprun -f F3.populate_nodes F3.fit(X)

The example above tells you that although line 134 takes just 11.7 µs per single execution, overall it takes 42.5% of the execution time as it’s executed over 32k times. So starting optimisation of the code from this single line could have dramatic effect on overall execution time.

Code optimisation

First thing I’ve noticed in the original Python code was that in order to calculate outlier score individual samples were streamed through individual trees in the iForest.

        for i in  range(len(X_in)):
            h_temp = 0
            for j in range(self.ntrees):
                h_temp += PathFactor(X_in[i],self.Trees[j]).path*1.0            # Compute path length for each point
            Eh = h_temp/self.ntrees                                             # Average of path length travelled by the point in all trees.
            S[i] = 2.0**(-Eh/self.c)                                            # Anomaly Score
        return S

Since those are operations on arrays, lots of time can be saved if either all samples are processed by individual trees or if individual samples are processed by all trees. Implementing this wasn’t difficult and, combined with cleaning the code from unnecessary variables & classes, resulted in ~6-7x speed-up.

Speeding array operations with numba

Further improvements were much more mild and required detailed code profiling. As mentioned above, single line took 42% overall execution time. Upon closer inspection, I’ve realised that calling X.min(axis=0) and X.max(axis=0) was really time consuming.

x = np.random.random(size=(256, 12))
%timeit x.min(axis=0), x.max(axis=0)
# 15.6 µs ± 43.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Python code can be optimised with numba. For example calculating min and max simultaneously using numba just-in-time compiler results in over 7x faster execution!

from numba import jit

@jit
def minmax(x):
    """np.min(x, axis=0), np.max(x, axis=0) for 2D array but faster"""
    m, n = len(x), len(x[0])
    mi, ma = np.empty(n), np.empty(n)
    mi[:] = ma[:] = x[0]
    for i in range(1, m):
        for j in range(n):
            if x[i, j]>ma[j]: ma[j] = x[i, j]
            elif x[i, j]<mi[j]: mi[j] = x[i, j]
    return mi, ma

%timeit minmax(x) 
# 2.19 µs ± 4.61 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# make sure the results are the same
np.all([minmax(x), (x.min(axis=0), x.max(axis=0))], axis=0)

Apart from that, there have been several other parts that could be optimised with numba. You can have a look at eif_new.py and compare it with older and C++ version using this notebook. If you want to know details, just comment below – I’ll be more than happy to discuss them 🙂

If you’re looking for ways of speeding up array operations, definitely check numexpr beside numba. eIF case didn’t really need numexpr optimisations, but it’s really impressive project and I can imagine many people could benefit from it. So spread the word!

Reducing the size of large git repository

The github repository of #NGSchool website has grown to over 5GB. I wanted to reduce the size & simplify this repository, but this task turned out to quite complicated. Instead, I have decided to leave current repo as is (and probably removed it soon) and start new repo for existing version. I could do that, as I don’t care about version earlier than the one I’m currently using. This is short how-to:

  1. Push all changes and remove .git folder
  2. git push origin master
    rm -rI .git
    
  3. Rename existing repo
  4. Settings > Repository name > RENAME

  5. Start new repository using old repo name
  6. Don’t need to create any files as all already exists.

  7. Init your local repo and add new remote
  8. git init
    git remote add origin git@github.com:USER/REPO
    
  9. Commit changes and push
  10. git add --all . && git commit -m "fresh" && git push origin master
    

Doing so, my new repo size is below 1GB, which is much better compared to 5GB previously.

Github push fails due to large files

Lately, I have had lots of problems with pushing large files to github. I am maintaining compilation of materials and software deposited by other people, so cannot control the size of files… and this makes push to fail often.

git push
remote: error: GH001: Large files detected. You may want to try Git Large File Storage - https://git-lfs.github.com.
remote: error: Trace: 6f0f7f66995a394598595375954732db
remote: error: See http://git.io/iEPt8g for more information.
remote: error: File chip_seq/reads/sox2_chip.fastq.gz is 109.69 MB; this exceeds GitHub's file size limit of 100.00 MB

To remove large files from commit, execute

git filter-branch -f --index-filter 'git rm --cached --ignore-unmatch chip_seq/reads/sox2_chip.fastq.gz'
git push

To add large files using git-lfs, execute

# tract by git lfs files larger than 50MB, skipping those in .git folder
find . -type f -size +50M ! -iwholename "*.git*" | rev | cut -f1 -d'/' | rev | xargs git lfs track
# 
git add --all . && git commit -m "final" && git push origin

Make sure that your file are smaller than 2GB, otherwise your push will fail again 😉

Then, to before pull in another machine, make sure to install git-lfs

git lfs install
git pull

Working with large binary files in git

Git is great, there is no doubt about that. Being able to revert any changes and recover lost data is simply priceless. But recently, I have started to be concerned about the size of some of my repositories. Some, especially those containing changing binary files, were really large!!!
You can check the size of your repository by simple command:

git count-objects -vH

Here, git Large File Storage (LSF) comes into action. Below, I’ll describe how to install and mark large binary files, so they are not uploaded as a whole, but only relevant chunks of changed binary file is uploaded.

  1. Installation of git-lfs
  2. # add packagecloud repo
    curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
    
    # install git-lsf
    sudo apt-get install git-lfs 
    
    # end enable it
    git lfs install
    
  3. Marking and commiting binary file
  4. # mark large binary file
    git lfs track some.file
    
    # add, commit & push changes
    git add some.file
    git commit -m "some.file as LSF"
    git push origin master
    

Using docker for application development

I found Docker super useful, but going through a manual is quite time consuming. Here, very stripped manual to create your first image and push it online 🙂

# install docker
wget -qO- https://get.docker.com/ | sh
 
# add your user to docker group
sudo usermod -aG docker $USER
 
# check if it's working
docker run docker/whalesay cowsay "hello world!"
 
# create an account on https://hub.docker.com
# and login
docker login -u $USER --email=EMAIL
 
# run image
docker run -it ubuntu
 
# make some changes ie. create user, install needed software etc
 
# finally open new terminal & commit changes (SESSIONID=HOSTNAME)
docker commit SESSIONID $USER/image:version
 
# mount local directory `pwd`/test as /test in read/write mode
docker run -it -v `pwd`/test:/test:rw $USER/image:version some command with arguments
 
# push image
docker push $USER/image:version

From now, you can get your image from any other machine connected to Internet by executing:

docker run -it $USER/image:version
# ie. redundans image
docker run -it -w /root/src/redundans lpryszcz/redundans:v0.11b ./redundans.py -v -i test/{600,5000}_{1,2}.fq.gz -f test/contigs.fa -o test/run1
 
# you can create alias latest, then version can be skipped on running
docker tag lpryszcz/redundans:v0.11b lpryszcz/redundans:latest
docker push lpryszcz/redundans:latest
 
docker run -it lpryszcz/redundans

You can add info about your repository at https://hub.docker.com/r/$USER/image/

Working efficiently with millions of files

Working with millions of intermediate files can be very challenging, especially if you need to store them in distributed / network file system (NFS). This will make listing / navigating the directories to take ages… and removing of these files very time-consuming.
During building metaPhOrs DB, I needed to store some ~7.5 million of intermediate files that were subsequently processed in HPC. Saving these amount of files in the NFS would seriously affect not only myself, but also overall system performance.
One could store files in an archive, but then if you want to retrieve the data you would need to parse rather huge archives (tens-to-hundreds of GB) in order to retrieve rather small portions of data.
I have realised that TAR archives are natively supported in Python and can be indexed (see `tar_indexer`), which provide easy integration into existing code and random-access. If you work with text data, you can even zlib.compress the data stored inside you archives!
Below, I’m providing relevant parts of my code:
BASH

# index content of multiple tar archives
tar2index.py -v -i db_*/*.tar -d archives.db3
 
# search for some_file in mutliple archives
tar2index.py -v -f some_file -d archives.db3

Python

import sqlite3, time
import tarfile, zlib, cStringIO
 
###
# lookup function
def tar_lookup(dbpath, file_name):
    """Return file name inside tar, tar file name, offset and file size."""
    cur = sqlite3.connect(dbpath).cursor()
    cur.execute("""SELECT o.file_name, f.file_name, offset, file_size
                FROM offset_data as o JOIN file_data as f ON o.file_id=f.file_id
                WHERE o.file_name like ?""", (file_name,))
    return cur.fetchall()
 
###
# saving to archive
    # open tarfile
    tar = tarfile.open(tarpath, "w")
    # save files to tar
    for fname, txt in files_generator:
        # compress file content (optionally)
        gztxt = zlib.compress(txt)
        # get tarinfo
        ti = tarfile.TarInfo(fname)
        ti.size  = len(gztxt)
        ti.mtime = time.time()
        # add to tar
        tar.addfile(ti, cStringIO.StringIO(gztxt))
 
###
# reading from indexed archive(s)
# NOTE: before you need to run tar2index.py on your archives
    tarfnames = tar_lookup(index_path, file_name)
    for i, (name, tarfn, offset, file_size) in enumerate(tarfnames, 1):
        tarf = open(tarfn)
        # move pointer to right archive place
        tarf.seek(offset)
        # read tar fragment & uncompress
        txt = zlib.decompress(tarf.read(file_size))

Tracing exceptions in multiprocessing in Python

I had problems with debugging my programme using multiprocessing.Pool.

Traceback (most recent call last):
  File "src/homologies2mysql_multi.py", line 294, in <module>
    main()
  File "src/homologies2mysql_multi.py", line 289, in main
    o.noupload, o.verbose)
  File "src/homologies2mysql_multi.py", line 242, in homologies2mysql
    for i, data in enumerate(p.imap_unordered(worker, pairs), 1):
  File "/usr/lib64/python2.6/multiprocessing/pool.py", line 520, in next
    raise value
ValueError: need more than 1 value to unpack

I could run it without multiprocessing, but then I’d have to wait some days for the program to reach the point where it crashes.
Luckily, Python is equipped with traceback, that allows handy tracing of exceptions.
Then, you can add a decorator to problematic function, that will report nice error message:

import traceback, functools, multiprocessing
 
def trace_unhandled_exceptions(func):
    @functools.wraps(func)
    def wrapped_func(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except:
            print 'Exception in '+func.__name__
            traceback.print_exc()
    return wrapped_func
 
@trace_unhandled_exceptions
def go():
    print(1)
    raise Exception()
    print(2)
 
p = multiprocessing.Pool(1)
 
p.apply_async(go)
p.close()
p.join()

The error message will look like:

1
Exception in go
Traceback (most recent call last):
  File "<stdin>", line 5, in wrapped_func
  File "<stdin>", line 4, in go
Exception

Solution found on StackOverflow.

Connecting to MySQL without passwd prompt

If you are (like me) annoyed by providing password at every mysql login, you can skip it. Also it makes easier programmatic access to any MySQL db, as not passwd prompting is necessary 🙂
Create `~/.my.cnf` file:

[client]
user=username
password="pass"
 
[mysql]
user=username
password="pass"

And login without `-p` parameter:

mysql -h host -u username dbname

If you want to use `~/.my.cnf` file in MySQLdb, just connect using this:

import MySQLdb
cnx = MySQLdb.connect(host=host, port=port, read_default_file="~/.my.cnf")

Batch convert of .xlsx (Microsoft Office) to .tsv (tab-delimited) files

I had to retrieve data from multiple .xlsx files with multiple sheets. This can be done manually, but it will be rather time-consuming tasks, plus Office quotes text fields, which is not very convenient for downstream analysis…
I have found handy script, xlsx2tsv.py, that does the job, but it reports only one sheet at the time. Thus, I have rewritten xlsx2tsv.py a little to save all sheets from given .xlsx file into separate folder. In addition, multiple .xlsx files can be process at once. My version can be found on github.

xlsx2tsv.py *.xlsx