SOLiD RNA-Seq & splice-aware mapping

I’ve lost quite a lot of time trying to align color-space RNA-Seq reads. SHRiMP paper explains nicely, why it’s important to align SOLiD reads in color-space, instead of converting color-space directly into sequence-space. Below, you can find the simplest solution I have found, using tophat, relying on bowtie mapper (bowtie2 doesn’t support color-space) and color-space reads in .csfasta.

# generate genome index in color-space
bowtie-build --color GENOME.fa GENOME
# get SOLiD reads from SRA if you don't have them already in .csfasta
abi-dump SRR062662
# tophat splice-aware mapping in color-space
mkdir tophat
for f in READS_DIR/*.csfasta; do
  s=`echo $f | cut -f2 -d'/' | cut -f1 -d'.'`
  if [ ! -d tophat/$s ]; then
    echo `date` $f $s
    tophat -p 4 --no-coverage-search --color -o tophat/$s --quals $ref $f READS_DIR/${s}_QV.qual

NGS alignment efficiency is highly affected by adaptor / PCR primer contamination

In one of my recent projects, I’ve been analysing RNA-Seq libraries with high variability of alignment efficiency.

I decided to spend some time and try to find the reason why there is such high variability in reads that can be mapped. I have ran FastQC on all .fastq.gz files. I have looked at overall library quality and later focused on some specific measures.

  1. I haven’t found any clear association between number of warns/fails and alignment efficiency.
  2. Similarly, there is no association between alignment efficiency and any group of quality measures.
  3. But, libraries with the highest fraction of uniquely aligned reads tend to pass `Overrepresented sequences` filter.
  4. Finally, I’ve realised that alignment efficiency anti-correlates with adapter / PCR primer contamination levels
  5. Below, you can find some BASH code I’ve used.

    # run fastqc using 4 threads
    mkdir fastqc
    fastqc -t 4 -i *.fq.gz -o fastqc
    # get fraction of reads affected by all over-represented sequences
    for f in fastqc/*.fq_fastqc/fastqc_data.txt; do
      echo $f `grep -A100 ">>Overrepresented sequences" $f | \
       grep -m1 -B100 ">>END_MODULE" | awk '{sum+=$3} END {print sum}'`;
    # get fraction of reads affected by Adapter or PCR primers
    for f in fastqc/*.fq_fastqc/fastqc_data.txt; do
      echo $f `grep -A100 ">>Overrepresented sequences" $f | \
       grep -m1 -B100 ">>END_MODULE" | \
       grep -P "Adapter|PCR" | awk 'BEGIN {sum=0} {sum+=$3} END {print sum}'`;

Selecting stable genes from RNA-Seq experiment

A friend of mine was interested in selecting stable genes from RNA-Seq experiment, stable meaning genes with expression not affected by changing conditions.
I think the easiest solution is using partitioning function implemented in cummeRbund (K-means clustering):


Of course, it’s good to play with `k` value, so you indeed get some cluster with genes that are of interest.
Then, once you have some candidates, you may want to look for genes having similar expression patterns.


Fast interval selection in Python

Lately, I’ve been working on a program that computes depth of coverage from BAM file given some genome intervals i.e. genes or exons ( The performance of this task is heavily affected by the time needed for selecting interesting intervals.

At first, I have stored intervals as a list of tuples and I’ve used filter function to select intervals of interest, but the performance of this implementation wasn’t satisfactory (1m29s for test set).

# populate intervals
c2i[chrom] = [(start, end, strand, name), ]
# select intervals
ivals = filter(lambda x: s<x[0]<e or s<x[1]<e, c2i[chrom])

Later, I’ve used numpy.array, which turned out to be much faster (0m07s). Noteworthy, numpy implementation also uses less memory as object are store more efficiently in array (13 bytes per interval; 3x 4b for ‘uint32′ + 1b for ‘bool_’) than in list of tuples (88 bytes per interval; sys.getsizeof((100,1000,1,100))).

import numpy as np
dtype = np.dtype({'names':   ['start',  'end',    'strand', 'entry_id'], \
                  'formats': ['uint32', 'uint32', 'bool_', 'uint32']})
# populate intervals
chr2intervals[chrom] = np.array(data, dtype=dtype)
# select intervals
ivals = c2i[chrom][np.any([np.all([ c2i[chrom]['start']>=s, c2i[chrom]['start']<=e ], axis=0),
                           np.all([ c2i[chrom]['end']  >=s, c2i[chrom]['end']  <=e ], axis=0),
                           np.all([ c2i[chrom]['start']< s, c2i[chrom]['end']  > e ], axis=0)], axis=0)]

Benchmarking code:

import numpy as np
from datetime import datetime
from bam2cov import load_intervals
# load data
chr2intervals, entries = load_intervals('stranded/DANRE.gtf', 0)
# define chromosome name and length
chrom, chrlen = '1', 60348388
# test
sample = random.sample(xrange(0, chrlen), 10000)
counts = np.zeros(entries, dtype='uint')
t0 =
for ii,s in enumerate(sample, 1):
  if not ii%1000:
    print ii
  e = s+100
  # filter from list - uncomment if needed
  #selected = filter(lambda x: s<x[0]<e or s<x[1]<e, c2i[chrom])
  # select from numpy.array
  selected = c2i[chrom][/c][np.any([np.all([s<c2i[chrom][/c]['start'], c2i[chrom][/c]['start']<e], axis=0), np.all([s<c2i[chrom][/c]['stop'], c2i[chrom][/c]['stop']<e], axis=0)], axis=0)]
  for s, e, strand, i in selected:
    counts[i] += 1
print dt, counts.sum()

Integrate bigWig files into GBrowse

Lately, I have waisted quite a lot of time trying to integrate bigWig files into GBrowse.

At first, I had problems with installation of Bio::DB::BigWig in Ubuntu 14.04 x64. I had to combine tips from several places (this was the most helpful):

tar xfz userApps.src.tgz
cd userApps/kent/src/

# in inc/ replace `CFLAGS=` line with `CFLAGS=-fPIC` 

cd lib/
cd ..

sudo -i
export MACHTYPE=x86_64
export KENT_SRC=`pwd`
cpan Bio::DB::BigWig

Then, I have spent quite a lot of time trying to incorporate multiple bigWig files at once. This can accomplished with Bio::DB::BigWigSet. Just add this to your genome.conf:

db_adaptor      = Bio::DB::BigWigSet
db_args         = -dir /path/to/your/bigWig/files
                  -feature_type summary
database        = RibosomalRNASeq
feature         = summary
glyph           = wiggle_density
height          = 15
category        = Ribosomal RNA-Seq

Note, your bigWig files should be placed in one directory and end with .bw.
Additionally, you can create file that will define additional information about your samples.

BAM2bigWig conversion using pybedtools

Today I was looking for a handy way of converting BAM to bigWig. Biostars turned out to be handy place for a start. Guess what, there is ready module implemented in pybedtools that does exactly that:

from pybedtools.contrib.bigwig import bam_to_bigwig
bam_to_bigwig(bam='path/to/bam', genome='hg19', output='path/to/bigwig')

Unfortunately, my genome of interested in not hosted at UCSC, so I needed to alter bam_to_bigwig slightly. In addition, I’ve replaced the read counting function mapped_read_count with my own implementation that rely on BAM index and therefore return number of alignments nearly instantly. This reduces the time by some minutes for large BAMs.

My own implementation can be found github repo: -i SOME.BAM -g GENOME.fa -o