How stranded is stranded RNA-Seq protocol?

After identifying some antisense reads in stranded RNA-Seq experiments, I got interested how stranded is stranded RNA-Seq. Here, I’ll describe my methodology in more details.

  1. Quantify reads for all genes / exons
  2. Initially, I’ve been using bedtools2 for this task, but due to its limitations (bedtools coverage needed to be run twice for sense & antisense, while bedtools multicov was terribly slow for large .gtf files) I have written my own tool for coverage calculation (bam2cov.py & bam2cov_pool.py).

    mkdir stranded
    for f in *.bam; do
      s=`echo $f | cut -f1 -d'.'`
      if [ ! -s stranded/$s.gtf.gz ]; then
        echo `date` $s
        bam2cov.py -i $s.bam -b REF.gtf | gzip > stranded/$s.gtf.gz
      fi
    done; date
    
  3. Calculate sense-strand enrichment
  4. Here, I start from selecting interesting features from .gtf.gz and saving these in tmp.bed. Subsequently, I scan tmp.bed for overlapping features on antisense and print only those with no overlap (bedtools intersect -S -v -wa). Finally, sense-strand enrichment is calculated and reported.
    This is done for all features (`.`), genes (`gene`) and exons (`exon`).

    cd stranded
    for cat in . gene exon; do
      echo `date` category $cat;
      for f in *.gtf.gz; do
        # select interesting features and save to bed
        zgrep -P "\t${cat}\t" $f | awk -F'\t' -v OFS='\t' '{print $1,$4,$5,$10,$11,$7}'> tmp.bed;
        # select features with no overlap on antisense & calculate enrichment
        echo -e "${f}\t"`bedtools intersect -S -v -wa -a tmp.bed -b tmp.bed | \
          awk -F'\t' -v OFS='\t' '{sum1+=$4; sum2+=$5} END {print sum1,sum2,sum1/sum2}'`;
      done;
    done;
    rm tmp.bed; date
    

Summarising, I have found highly variable sense-strand read enrichment (6.39-27.05; median: 9.85) in our RNA-Seq stranded data (6 time points with 3 RNA fractions in colours). I have obtained slightly lower enrichment for genes (5.98-24.07; median: 8.74), probably due to some antisense genes / exons not annotated.
Overall, red RNA fraction tends to have higher sense-strand read enrichment that blue/orange. The question `why there is such high variability in enrichment between time points / RNA fractions` remains to be addressed…

Visit Biostars for discussion.