Lately, I’ve been working on a program that computes depth of coverage from BAM file given some genome intervals i.e. genes or exons (bam2cov.py). The performance of this task is heavily affected by the time needed for selecting interesting intervals.
At first, I have stored intervals as a list of tuples and I’ve used filter function to select intervals of interest, but the performance of this implementation wasn’t satisfactory (1m29s for test set).
# populate intervals c2i[chrom] = [(start, end, strand, name), ] # select intervals ivals = filter(lambda x: s<x[0]<e or s<x[1]<e, c2i[chrom])
Later, I’ve used numpy.array, which turned out to be much faster (0m07s). Noteworthy, numpy implementation also uses less memory as object are store more efficiently in array (13 bytes per interval; 3x 4b for ‘uint32′ + 1b for ‘bool_’) than in list of tuples (88 bytes per interval; sys.getsizeof((100,1000,1,100))
).
import numpy as np dtype = np.dtype({'names': ['start', 'end', 'strand', 'entry_id'], \ 'formats': ['uint32', 'uint32', 'bool_', 'uint32']}) # populate intervals chr2intervals[chrom] = np.array(data, dtype=dtype) # select intervals ivals = c2i[chrom][np.any([np.all([ c2i[chrom]['start']>=s, c2i[chrom]['start']<=e ], axis=0), np.all([ c2i[chrom]['end'] >=s, c2i[chrom]['end'] <=e ], axis=0), np.all([ c2i[chrom]['start']< s, c2i[chrom]['end'] > e ], axis=0)], axis=0)]
Benchmarking code:
import numpy as np from datetime import datetime from bam2cov import load_intervals # load data chr2intervals, entries = load_intervals('stranded/DANRE.gtf', 0) # define chromosome name and length chrom, chrlen = '1', 60348388 # test sample = random.sample(xrange(0, chrlen), 10000) counts = np.zeros(entries, dtype='uint') t0 = datetime.now() for ii,s in enumerate(sample, 1): if not ii%1000: print ii e = s+100 # filter from list - uncomment if needed #selected = filter(lambda x: s<x[0]<e or s<x[1]<e, c2i[chrom]) # select from numpy.array selected = c2i[chrom][/c][np.any([np.all([s<c2i[chrom][/c]['start'], c2i[chrom][/c]['start']<e], axis=0), np.all([s<c2i[chrom][/c]['stop'], c2i[chrom][/c]['stop']<e], axis=0)], axis=0)] for s, e, strand, i in selected: counts[i] += 1 dt=datetime.now()-t0 print dt, counts.sum()