Generating word clouds in Python

Often I needed to visualise functions associated with set of genes. word_cloud is very handy word cloud generator written in Python.
Install it

sudo pip install git+git://github.com/amueller/word_cloud.git

Incorporate it into your Python code

import matplotlib.pyplot as plt
from wordcloud import WordCloud
 
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

wordcloud = WordCloud().generate(text)
img=plt.imshow(wordcloud)
plt.axis("off")
plt.show()
 
#or save as png
img.write_png("wordcloud.png")

It’s implemented in metaPhOrs.

Fast interval selection in Python

Lately, I’ve been working on a program that computes depth of coverage from BAM file given some genome intervals i.e. genes or exons (bam2cov.py). The performance of this task is heavily affected by the time needed for selecting interesting intervals.

At first, I have stored intervals as a list of tuples and I’ve used filter function to select intervals of interest, but the performance of this implementation wasn’t satisfactory (1m29s for test set).

# populate intervals
c2i[chrom] = [(start, end, strand, name), ]
 
# select intervals
ivals = filter(lambda x: s<x[0]<e or s<x[1]<e, c2i[chrom])

Later, I’ve used numpy.array, which turned out to be much faster (0m07s). Noteworthy, numpy implementation also uses less memory as object are store more efficiently in array (13 bytes per interval; 3x 4b for ‘uint32′ + 1b for ‘bool_’) than in list of tuples (88 bytes per interval; sys.getsizeof((100,1000,1,100))).

import numpy as np
dtype = np.dtype({'names':   ['start',  'end',    'strand', 'entry_id'], \
                  'formats': ['uint32', 'uint32', 'bool_', 'uint32']})
# populate intervals
chr2intervals[chrom] = np.array(data, dtype=dtype)
 
# select intervals
ivals = c2i[chrom][np.any([np.all([ c2i[chrom]['start']>=s, c2i[chrom]['start']<=e ], axis=0),
                           np.all([ c2i[chrom]['end']  >=s, c2i[chrom]['end']  <=e ], axis=0),
                           np.all([ c2i[chrom]['start']< s, c2i[chrom]['end']  > e ], axis=0)], axis=0)]

Benchmarking code:

import numpy as np
from datetime import datetime
from bam2cov import load_intervals
 
# load data
chr2intervals, entries = load_intervals('stranded/DANRE.gtf', 0)
 
# define chromosome name and length
chrom, chrlen = '1', 60348388
 
# test
sample = random.sample(xrange(0, chrlen), 10000)
counts = np.zeros(entries, dtype='uint')
t0 = datetime.now()
for ii,s in enumerate(sample, 1):
  if not ii%1000:
    print ii
  e = s+100
  # filter from list - uncomment if needed
  #selected = filter(lambda x: s<x[0]<e or s<x[1]<e, c2i[chrom])
  # select from numpy.array
  selected = c2i[chrom][/c][np.any([np.all([s<c2i[chrom][/c]['start'], c2i[chrom][/c]['start']<e], axis=0), np.all([s<c2i[chrom][/c]['stop'], c2i[chrom][/c]['stop']<e], axis=0)], axis=0)]
  for s, e, strand, i in selected:
    counts[i] += 1
     
dt=datetime.now()-t0
print dt, counts.sum()

Install jedi-mode in emacs

Today, I have went through rather complicated process of installing jedi-mode in emacs. I post here what worked for me. First of all, make sure you run recent emacs (24+) and then follow:

# install python-virtualenv
sudo apt-get install python-virtualenv 

# run emacs, enable MELPA repository, refresh packages and install jedi-mode
emacs 
M-: (add-to-list 'package-archives'("melpa" . "http://melpa.org/packages/") t)
M-x package-refresh-contents [RET]
M-x package-install [RET] jedi [RET]

# install python server
M-x jedi:install-server

# add below lines to ~/.emacs.d/init.el & restart emacs
;; Standard Jedi.el setting
(add-hook 'python-mode-hook 'jedi:setup)
(setq jedi:complete-on-dot t)

BAM2bigWig conversion using pybedtools

Today I was looking for a handy way of converting BAM to bigWig. Biostars turned out to be handy place for a start. Guess what, there is ready module implemented in pybedtools that does exactly that:

from pybedtools.contrib.bigwig import bam_to_bigwig
bam_to_bigwig(bam='path/to/bam', genome='hg19', output='path/to/bigwig')

Unfortunately, my genome of interested in not hosted at UCSC, so I needed to alter bam_to_bigwig slightly. In addition, I’ve replaced the read counting function mapped_read_count with my own implementation that rely on BAM index and therefore return number of alignments nearly instantly. This reduces the time by some minutes for large BAMs.

My own bam2bigwig.py implementation can be found github repo:

bam2bigwig.py -i SOME.BAM -g GENOME.fa -o SOME.BAM.bw

Multiprocessing in Python and garbage collection

Working with multiple threads in Python often leads to high RAM consumption. Unfortunately, automatic garbage collection in child processes isn’t working well. But there are two alternatives:

  • When using Pool(), you can specify no. of task after which the child will be restarted resulting in memory release.
p = Pool(processes=4, maxtasksperchild=1000)
  • If you use Process(), you can simply delete unwanted objects call gc.collect() inside the child. Note, this may slow down your child process substantially!

TAR random access

I was often challenged with accessing thousands/millions files from network file system (NFS). As I update some of the stored files once in a while, I have decided to store these files in multiple TAR archives.  The data complexity was therefore reduced. But still, there was an issue with random access to the files within each archive.

First, I had a look at tar indexer. Its simplicity is brilliant. Yet, it stores index in raw text file and it can handle only single tar file. Therefore, I have ended up writing my own tar_indexer tool using sqlite3 for index storing and allowing indexing of multiple tar archives. This can be easily incorporated into any Python project.

Note, only raw (uncompressed) tar files are accepted as native tar.gz cannot be random accessed. But you can compress each file using zlib before adding it to tar. At least, this is what I do.

Hope, someone will find it useful:)