Fast interval selection in Python

Lately, I’ve been working on a program that computes depth of coverage from BAM file given some genome intervals i.e. genes or exons ( The performance of this task is heavily affected by the time needed for selecting interesting intervals.

At first, I have stored intervals as a list of tuples and I’ve used filter function to select intervals of interest, but the performance of this implementation wasn’t satisfactory (1m29s for test set).

# populate intervals
c2i[chrom] = [(start, end, strand, name), ]
# select intervals
ivals = filter(lambda x: s<x[0]<e or s<x[1]<e, c2i[chrom])

Later, I’ve used numpy.array, which turned out to be much faster (0m07s). Noteworthy, numpy implementation also uses less memory as object are store more efficiently in array (13 bytes per interval; 3x 4b for ‘uint32′ + 1b for ‘bool_’) than in list of tuples (88 bytes per interval; sys.getsizeof((100,1000,1,100))).

import numpy as np
dtype = np.dtype({'names':   ['start',  'end',    'strand', 'entry_id'], \
                  'formats': ['uint32', 'uint32', 'bool_', 'uint32']})
# populate intervals
chr2intervals[chrom] = np.array(data, dtype=dtype)
# select intervals
ivals = c2i[chrom][np.any([np.all([ c2i[chrom]['start']>=s, c2i[chrom]['start']<=e ], axis=0),
                           np.all([ c2i[chrom]['end']  >=s, c2i[chrom]['end']  <=e ], axis=0),
                           np.all([ c2i[chrom]['start']< s, c2i[chrom]['end']  > e ], axis=0)], axis=0)]

Benchmarking code:

import numpy as np
from datetime import datetime
from bam2cov import load_intervals
# load data
chr2intervals, entries = load_intervals('stranded/DANRE.gtf', 0)
# define chromosome name and length
chrom, chrlen = '1', 60348388
# test
sample = random.sample(xrange(0, chrlen), 10000)
counts = np.zeros(entries, dtype='uint')
t0 =
for ii,s in enumerate(sample, 1):
  if not ii%1000:
    print ii
  e = s+100
  # filter from list - uncomment if needed
  #selected = filter(lambda x: s<x[0]<e or s<x[1]<e, c2i[chrom])
  # select from numpy.array
  selected = c2i[chrom][/c][np.any([np.all([s<c2i[chrom][/c]['start'], c2i[chrom][/c]['start']<e], axis=0), np.all([s<c2i[chrom][/c]['stop'], c2i[chrom][/c]['stop']<e], axis=0)], axis=0)]
  for s, e, strand, i in selected:
    counts[i] += 1
print dt, counts.sum()

Leave a Reply

Your email address will not be published. Required fields are marked *