On simplifying dependencies

Lately, to make Redundans more user friendly, I have simplified it’s dependencies, by replacing Biopython, numpy, scipy and SQLite with some (relatively) simple functions or modules.

Here, I will just focus on replacing Biopython, particularly SeqIO.index_db with FastaIndex. You may ask yourself, why I have invested time in reinventing the wheel. I’m big fan of Biopython, yet it’s huge project and some solutions are not optimal or require problematic dependencies. This is the case with SeqIO.db_index, that relies on SQLite3. Here again, I’m a big fan of SQLite, yet building Biopython with SQLite enabled proved not to be very straightforward for non-standard systems or less experience users. Beside, on some NFS settings, the SQLite3 db cannot be created at all.

Ok, let’s start from the basics. SeqIO.index_db allows random access to sequence files, so for example you can rapidly retrieve any entry from very large file. This is achieved by storing the ID and position of each entry from particular file in database, SQLite3 db. Then, if you want to retrieve particular record, SeqIO.index_db looks up if this record is present in SQLite3 db, retrieves record position in the file and reads only small chunk of this file instead of parsing entire file every time you want to get some record(s).
Similar feature is offered by samtools faidx, but in this case, the coordinates of each entry are stored in tab-delimited file .fai (more info about .fai). This format can be easily read & write by any programme, so I have decided to use it. In addition, I have realised, that samtools faidx is flexible enough, so you can add additional columns to the .fai without interrupting its functionality, but about that later…

In Redundans, I’ve been using SeqIO.index_db during assembly reduction (fasta2homozygous.py). Additionally, beside storing index, I’ve been also generating statistics for every FastA file, like number of contigs, cumulative size, N50, N90, GC and so on. I have realised, that these two can be easily combined, by extending .fai with four additional columns, storing number of occurencies for A, C, G & T in every sequence. Such .fai is compatible with samtools faidx and provides very easy way of calculating bunch of statistics about this file.
All of these, I’ve implemented in FastaIndex. Beside being dependency-free & very handy indexer, it can be used also as alternative to samtools faidx to retrieve sequences from large FastA files.

# retrieve bases from 20 to 60 from NODE_2
./FastaIndex.py -i test/run1/contigs.fa -r NODE_2_length_7674_cov_46.7841_ID_3:20-60
#Time elapsed: 0:00:00.014243

samtools faidx test/run1/contigs.fa NODE_2_length_7674_cov_46.7841_ID_3:20-60

Using docker for application development

I found Docker super useful, but going through a manual is quite time consuming. Here, very stripped manual to create your first image and push it online 🙂

# install docker
wget -qO- https://get.docker.com/ | sh
# add your user to docker group
sudo usermod -aG docker $USER
# check if it's working
docker run docker/whalesay cowsay "hello world!"
# create an account on https://hub.docker.com
# and login
docker login -u $USER --email=EMAIL
# run image
docker run -it ubuntu
# make some changes ie. create user, install needed software etc
# finally open new terminal & commit changes (SESSIONID=HOSTNAME)
docker commit SESSIONID $USER/image:version
# mount local directory `pwd`/test as /test in read/write mode
docker run -it -v `pwd`/test:/test:rw $USER/image:version some command with arguments
# push image
docker push $USER/image:version

From now, you can get your image from any other machine connected to Internet by executing:

docker run -it $USER/image:version
# ie. redundans image
docker run -it -w /root/src/redundans lpryszcz/redundans:v0.11b ./redundans.py -v -i test/{600,5000}_{1,2}.fq.gz -f test/contigs.fa -o test/run1
# you can create alias latest, then version can be skipped on running
docker tag lpryszcz/redundans:v0.11b lpryszcz/redundans:latest
docker push lpryszcz/redundans:latest
docker run -it lpryszcz/redundans

You can add info about your repository at https://hub.docker.com/r/$USER/image/

Working efficiently with millions of files

Working with millions of intermediate files can be very challenging, especially if you need to store them in distributed / network file system (NFS). This will make listing / navigating the directories to take ages… and removing of these files very time-consuming.
During building metaPhOrs DB, I needed to store some ~7.5 million of intermediate files that were subsequently processed in HPC. Saving these amount of files in the NFS would seriously affect not only myself, but also overall system performance.
One could store files in an archive, but then if you want to retrieve the data you would need to parse rather huge archives (tens-to-hundreds of GB) in order to retrieve rather small portions of data.
I have realised that TAR archives are natively supported in Python and can be indexed (see `tar_indexer`), which provide easy integration into existing code and random-access. If you work with text data, you can even zlib.compress the data stored inside you archives!
Below, I’m providing relevant parts of my code:

# index content of multiple tar archives
tar2index.py -v -i db_*/*.tar -d archives.db3
# search for some_file in mutliple archives
tar2index.py -v -f some_file -d archives.db3


import sqlite3, time
import tarfile, zlib, cStringIO
# lookup function
def tar_lookup(dbpath, file_name):
    """Return file name inside tar, tar file name, offset and file size."""
    cur = sqlite3.connect(dbpath).cursor()
    cur.execute("""SELECT o.file_name, f.file_name, offset, file_size
                FROM offset_data as o JOIN file_data as f ON o.file_id=f.file_id
                WHERE o.file_name like ?""", (file_name,))
    return cur.fetchall()
# saving to archive
    # open tarfile
    tar = tarfile.open(tarpath, "w")
    # save files to tar
    for fname, txt in files_generator:
        # compress file content (optionally)
        gztxt = zlib.compress(txt)
        # get tarinfo
        ti = tarfile.TarInfo(fname)
        ti.size  = len(gztxt)
        ti.mtime = time.time()
        # add to tar
        tar.addfile(ti, cStringIO.StringIO(gztxt))
# reading from indexed archive(s)
# NOTE: before you need to run tar2index.py on your archives
    tarfnames = tar_lookup(index_path, file_name)
    for i, (name, tarfn, offset, file_size) in enumerate(tarfnames, 1):
        tarf = open(tarfn)
        # move pointer to right archive place
        # read tar fragment & uncompress
        txt = zlib.decompress(tarf.read(file_size))

Tracing exceptions in multiprocessing in Python

I had problems with debugging my programme using multiprocessing.Pool.

Traceback (most recent call last):
  File "src/homologies2mysql_multi.py", line 294, in <module>
  File "src/homologies2mysql_multi.py", line 289, in main
    o.noupload, o.verbose)
  File "src/homologies2mysql_multi.py", line 242, in homologies2mysql
    for i, data in enumerate(p.imap_unordered(worker, pairs), 1):
  File "/usr/lib64/python2.6/multiprocessing/pool.py", line 520, in next
    raise value
ValueError: need more than 1 value to unpack

I could run it without multiprocessing, but then I’d have to wait some days for the program to reach the point where it crashes.
Luckily, Python is equipped with traceback, that allows handy tracing of exceptions.
Then, you can add a decorator to problematic function, that will report nice error message:

import traceback, functools, multiprocessing
def trace_unhandled_exceptions(func):
    def wrapped_func(*args, **kwargs):
            return func(*args, **kwargs)
            print 'Exception in '+func.__name__
    return wrapped_func
def go():
    raise Exception()
p = multiprocessing.Pool(1)

The error message will look like:

Exception in go
Traceback (most recent call last):
  File "<stdin>", line 5, in wrapped_func
  File "<stdin>", line 4, in go

Solution found on StackOverflow.

Connecting to MySQL without passwd prompt

If you are (like me) annoyed by providing password at every mysql login, you can skip it. Also it makes easier programmatic access to any MySQL db, as not passwd prompting is necessary 🙂
Create `~/.my.cnf` file:


And login without `-p` parameter:

mysql -h host -u username dbname

If you want to use `~/.my.cnf` file in MySQLdb, just connect using this:

import MySQLdb
cnx = MySQLdb.connect(host=host, port=port, read_default_file="~/.my.cnf")

Batch convert of .xlsx (Microsoft Office) to .tsv (tab-delimited) files

I had to retrieve data from multiple .xlsx files with multiple sheets. This can be done manually, but it will be rather time-consuming tasks, plus Office quotes text fields, which is not very convenient for downstream analysis…
I have found handy script, xlsx2tsv.py, that does the job, but it reports only one sheet at the time. Thus, I have rewritten xlsx2tsv.py a little to save all sheets from given .xlsx file into separate folder. In addition, multiple .xlsx files can be process at once. My version can be found on github.

xlsx2tsv.py *.xlsx

Easy citation in LibreOffice / OpenOffice with Mendeley

Creating reference list is always a nightmare. Mendeley and its handy LibreOffice / OpenOffice plugin may be of great help to many. It was for me. Below, I’ll describe how to make it working.

# get & install mendeley from https://www.mendeley.com/download-mendeley-desktop/

# check version of your mendeley
#  Help > About Mendeley Desktop

# clone repo and build plugin
git clone git@github.com:Mendeley/openoffice-plugin.git
cd openoffice-plugin/
python build.py 1.15.2 false

# add to LibreOffice
#  Tools > Extension Manager > Add...
#   and look for `Mendeley-1.15.2.oxt`

After OpenOffice / LibreOffice restart, you should see new bar. Note, in order for the plugin to work, Mendeley has to be running.

What’s great about this plugin, you can adjust citation style by just a few clicks by clicking on `Choose Citation Style`. There is quite extensive database of predefined citation styles, so adjusting the reference style to your favourite journal will take just a few seconds 🙂
More info about the plugin on github.

Installing Gene Cluster on Ubuntu

Gene Cluster is a program for clustering. I wanted to use it to analyse gene expression data. However, I had problems during installation under Ubuntu 14.04. This is how I solved it:

# install dependencies: Motif libraries
sudo apt-get install libxext-dev libmotif-dev

Get GeneCluster3.0 source code and unpack it.

# configure to install in local dir
./configure --prefix=`pwd` --program-prefix=gene_ && make && make install

# add install dir to ~/.bashrc

Installing new version of Python without root

Some time ago I was recommending to use Python virtual environment to install local version of Python packages. However this will not solve the issue of outdated version Python in the server your are working in. Here, pythonbrew may be help for you.

# install pythonbrew to ~/.pythonbrew
curl -kL http://xrl.us/pythonbrewinstall | bash

# add to ~/.bashrc to automatically activate pythonbrew
[[ -s "$HOME/.pythonbrew/etc/bashrc" ]] && source "$HOME/.pythonbrew/etc/bashrc"                                                         

# open new terminal tab (Ctrl+Shift+T) or window (Ctrl+Shift+N)

# install python 2.7.10
pythonbrew install 2.7.10

# and enable the new version
pythonbrew switch 2.7.10

# from now on, you can enjoy the version of your choice and install dependencies
which python
python --version
#Python 2.7.10
which pip