What fraction of the genome is expressed?

Recently, I was interested to find out what fraction of genome is expressed. This can be computed quite easily from RNA-Seq alignments using bedtools and some simple scripting. Note, it’s crucial to use -split parameter in bedtools genomecov to report correct coverage from split alignments. Otherwise, the introns that are spliced-out will be treated as expressed regions.

# get genomic regions with at least n reads aligned
n=3
ref=genome
for f in *.bam; do
  echo `date` $f;
  # get genomic intervals covered by n reads
  bedtools genomecov -split -ibam $f -g $ref.fa.fai -dz | awk -F'\t' 'BEGIN {OFS = FS} {if ($3>='$n') {print $1,$2,$2+1}}' | bedtools merge > $f.${n}reads.bed;
  # report bp covered by n reads
  awk '{sum+=$3-$2} END {print sum}' $f.${n}reads.bed
done; date

How stranded is stranded RNA-Seq protocol?

After identifying some antisense reads in stranded RNA-Seq experiments, I got interested how stranded is stranded RNA-Seq. Here, I’ll describe my methodology in more details.

  1. Quantify reads for all genes / exons
  2. Initially, I’ve been using bedtools2 for this task, but due to its limitations (bedtools coverage needed to be run twice for sense & antisense, while bedtools multicov was terribly slow for large .gtf files) I have written my own tool for coverage calculation (bam2cov.py & bam2cov_pool.py).

    mkdir stranded
    for f in *.bam; do
      s=`echo $f | cut -f1 -d'.'`
      if [ ! -s stranded/$s.gtf.gz ]; then
        echo `date` $s
        bam2cov.py -i $s.bam -b REF.gtf | gzip > stranded/$s.gtf.gz
      fi
    done; date
    
  3. Calculate sense-strand enrichment
  4. Here, I start from selecting interesting features from .gtf.gz and saving these in tmp.bed. Subsequently, I scan tmp.bed for overlapping features on antisense and print only those with no overlap (bedtools intersect -S -v -wa). Finally, sense-strand enrichment is calculated and reported.
    This is done for all features (`.`), genes (`gene`) and exons (`exon`).

    cd stranded
    for cat in . gene exon; do
      echo `date` category $cat;
      for f in *.gtf.gz; do
        # select interesting features and save to bed
        zgrep -P "\t${cat}\t" $f | awk -F'\t' -v OFS='\t' '{print $1,$4,$5,$10,$11,$7}'> tmp.bed;
        # select features with no overlap on antisense & calculate enrichment
        echo -e "${f}\t"`bedtools intersect -S -v -wa -a tmp.bed -b tmp.bed | \
          awk -F'\t' -v OFS='\t' '{sum1+=$4; sum2+=$5} END {print sum1,sum2,sum1/sum2}'`;
      done;
    done;
    rm tmp.bed; date
    

Summarising, I have found highly variable sense-strand read enrichment (6.39-27.05; median: 9.85) in our RNA-Seq stranded data (6 time points with 3 RNA fractions in colours). I have obtained slightly lower enrichment for genes (5.98-24.07; median: 8.74), probably due to some antisense genes / exons not annotated.
Overall, red RNA fraction tends to have higher sense-strand read enrichment that blue/orange. The question `why there is such high variability in enrichment between time points / RNA fractions` remains to be addressed…

Visit Biostars for discussion.