After identifying some antisense reads in stranded RNA-Seq experiments, I got interested how stranded is stranded RNA-Seq. Here, I’ll describe my methodology in more details.
- Quantify reads for all genes / exons
- Calculate sense-strand enrichment
Initially, I’ve been using bedtools2 for this task, but due to its limitations (bedtools coverage
needed to be run twice for sense & antisense, while bedtools multicov
was terribly slow for large .gtf
files) I have written my own tool for coverage calculation (bam2cov.py & bam2cov_pool.py).
mkdir stranded for f in *.bam; do s=`echo $f | cut -f1 -d'.'` if [ ! -s stranded/$s.gtf.gz ]; then echo `date` $s bam2cov.py -i $s.bam -b REF.gtf | gzip > stranded/$s.gtf.gz fi done; date
Here, I start from selecting interesting features from .gtf.gz
and saving these in tmp.bed
. Subsequently, I scan tmp.bed for overlapping features on antisense and print only those with no overlap (bedtools intersect -S -v -wa
). Finally, sense-strand enrichment is calculated and reported.
This is done for all features (`.`), genes (`gene`) and exons (`exon`).
cd stranded for cat in . gene exon; do echo `date` category $cat; for f in *.gtf.gz; do # select interesting features and save to bed zgrep -P "\t${cat}\t" $f | awk -F'\t' -v OFS='\t' '{print $1,$4,$5,$10,$11,$7}'> tmp.bed; # select features with no overlap on antisense & calculate enrichment echo -e "${f}\t"`bedtools intersect -S -v -wa -a tmp.bed -b tmp.bed | \ awk -F'\t' -v OFS='\t' '{sum1+=$4; sum2+=$5} END {print sum1,sum2,sum1/sum2}'`; done; done; rm tmp.bed; date
Summarising, I have found highly variable sense-strand read enrichment (6.39-27.05; median: 9.85) in our RNA-Seq stranded data (6 time points with 3 RNA fractions in colours). I have obtained slightly lower enrichment for genes (5.98-24.07; median: 8.74), probably due to some antisense genes / exons not annotated.
Overall, red RNA fraction tends to have higher sense-strand read enrichment that blue/orange. The question `why there is such high variability in enrichment between time points / RNA fractions` remains to be addressed…
Visit Biostars for discussion.