Fast interval selection in Python

Lately, I’ve been working on a program that computes depth of coverage from BAM file given some genome intervals i.e. genes or exons (bam2cov.py). The performance of this task is heavily affected by the time needed for selecting interesting intervals.

At first, I have stored intervals as a list of tuples and I’ve used filter function to select intervals of interest, but the performance of this implementation wasn’t satisfactory (1m29s for test set).

# populate intervals
c2i[chrom] = [(start, end, strand, name), ]
 
# select intervals
ivals = filter(lambda x: s<x[0]<e or s<x[1]<e, c2i[chrom])

Later, I’ve used numpy.array, which turned out to be much faster (0m07s). Noteworthy, numpy implementation also uses less memory as object are store more efficiently in array (13 bytes per interval; 3x 4b for ‘uint32′ + 1b for ‘bool_’) than in list of tuples (88 bytes per interval; sys.getsizeof((100,1000,1,100))).

import numpy as np
dtype = np.dtype({'names':   ['start',  'end',    'strand', 'entry_id'], \
                  'formats': ['uint32', 'uint32', 'bool_', 'uint32']})
# populate intervals
chr2intervals[chrom] = np.array(data, dtype=dtype)
 
# select intervals
ivals = c2i[chrom][np.any([np.all([ c2i[chrom]['start']>=s, c2i[chrom]['start']<=e ], axis=0),
                           np.all([ c2i[chrom]['end']  >=s, c2i[chrom]['end']  <=e ], axis=0),
                           np.all([ c2i[chrom]['start']< s, c2i[chrom]['end']  > e ], axis=0)], axis=0)]

Benchmarking code:

import numpy as np
from datetime import datetime
from bam2cov import load_intervals
 
# load data
chr2intervals, entries = load_intervals('stranded/DANRE.gtf', 0)
 
# define chromosome name and length
chrom, chrlen = '1', 60348388
 
# test
sample = random.sample(xrange(0, chrlen), 10000)
counts = np.zeros(entries, dtype='uint')
t0 = datetime.now()
for ii,s in enumerate(sample, 1):
  if not ii%1000:
    print ii
  e = s+100
  # filter from list - uncomment if needed
  #selected = filter(lambda x: s<x[0]<e or s<x[1]<e, c2i[chrom])
  # select from numpy.array
  selected = c2i[chrom][/c][np.any([np.all([s<c2i[chrom][/c]['start'], c2i[chrom][/c]['start']<e], axis=0), np.all([s<c2i[chrom][/c]['stop'], c2i[chrom][/c]['stop']<e], axis=0)], axis=0)]
  for s, e, strand, i in selected:
    counts[i] += 1
     
dt=datetime.now()-t0
print dt, counts.sum()