What fraction of the genome is expressed?

Recently, I was interested to find out what fraction of genome is expressed. This can be computed quite easily from RNA-Seq alignments using bedtools and some simple scripting. Note, it’s crucial to use -split parameter in bedtools genomecov to report correct coverage from split alignments. Otherwise, the introns that are spliced-out will be treated as expressed regions.

# get genomic regions with at least n reads aligned
for f in *.bam; do
  echo `date` $f;
  # get genomic intervals covered by n reads
  bedtools genomecov -split -ibam $f -g $ref.fa.fai -dz | awk -F'\t' 'BEGIN {OFS = FS} {if ($3>='$n') {print $1,$2,$2+1}}' | bedtools merge > $f.${n}reads.bed;
  # report bp covered by n reads
  awk '{sum+=$3-$2} END {print sum}' $f.${n}reads.bed
done; date

Visualisation of repeats alongside NGS data

In one of the previous projects we were interested to look for association between deletions and repeats. At that time, I’ve written small program, repeats2sam.py, identifying perfect direct and inverted repeats in chromosomes. repeats2sam.py uses repeat-match from great MUMmer package and reports SAM-formatted repeats. This can be then easily converted to BAM. It takes several minutes for small genomes (20Mb) and several hours for large genomes (~8 hours for 1.5Gb).

# identify direct & inverted repeats
repeats2sam.py -v --inverted -i $ref.fa | samtools view -SbuT $ref.fa - | samtools sort - $ref.repeats
samtools index $ref.repeats.bam

Repeats are stored as paired-end reads, so they can be easily visualised alongside any NGS data in IGV:

  • direct & inverted repeats
  • inverted repeats + RNAseq
  • inverted repeats + RNAseq zommed in

If you are interested in DNA direct repeats, either skip –inverted parameter, or select only direct repeats from your BAM file. In addition, you may want to select only repeats fulfilling certain length and distance criteria.

# select only inverted repeats; at least 21bp long and distant by less than 5000bp
i=21; samtools view $ref.repeats.inverted.bam | awk '$8<5000 && length($10)>='$i | samtools view -SbuT $ref.fa - | samtools sort - $ref.repeats.inverted.${i}bp && samtools index $ref.repeats.inverted.${i}bp.bam
# select only direct repeats; at least 21bp long and distant by less than 5000bp
i=21; samtools view $ref.repeats.direct.bam | awk '$8<5000 && length($10)>='$i | samtools view -SbuT $ref.fa - | samtools sort - $ref.repeats.direct.${i}bp && samtools index $ref.repeats.direct.${i}bp.bam

All above mentioned programs can be found in github.

How stranded is stranded RNA-Seq protocol?

After identifying some antisense reads in stranded RNA-Seq experiments, I got interested how stranded is stranded RNA-Seq. Here, I’ll describe my methodology in more details.

  1. Quantify reads for all genes / exons
  2. Initially, I’ve been using bedtools2 for this task, but due to its limitations (bedtools coverage needed to be run twice for sense & antisense, while bedtools multicov was terribly slow for large .gtf files) I have written my own tool for coverage calculation (bam2cov.py & bam2cov_pool.py).

    mkdir stranded
    for f in *.bam; do
      s=`echo $f | cut -f1 -d'.'`
      if [ ! -s stranded/$s.gtf.gz ]; then
        echo `date` $s
        bam2cov.py -i $s.bam -b REF.gtf | gzip > stranded/$s.gtf.gz
    done; date
  3. Calculate sense-strand enrichment
  4. Here, I start from selecting interesting features from .gtf.gz and saving these in tmp.bed. Subsequently, I scan tmp.bed for overlapping features on antisense and print only those with no overlap (bedtools intersect -S -v -wa). Finally, sense-strand enrichment is calculated and reported.
    This is done for all features (`.`), genes (`gene`) and exons (`exon`).

    cd stranded
    for cat in . gene exon; do
      echo `date` category $cat;
      for f in *.gtf.gz; do
        # select interesting features and save to bed
        zgrep -P "\t${cat}\t" $f | awk -F'\t' -v OFS='\t' '{print $1,$4,$5,$10,$11,$7}'> tmp.bed;
        # select features with no overlap on antisense & calculate enrichment
        echo -e "${f}\t"`bedtools intersect -S -v -wa -a tmp.bed -b tmp.bed | \
          awk -F'\t' -v OFS='\t' '{sum1+=$4; sum2+=$5} END {print sum1,sum2,sum1/sum2}'`;
    rm tmp.bed; date

Summarising, I have found highly variable sense-strand read enrichment (6.39-27.05; median: 9.85) in our RNA-Seq stranded data (6 time points with 3 RNA fractions in colours). I have obtained slightly lower enrichment for genes (5.98-24.07; median: 8.74), probably due to some antisense genes / exons not annotated.
Overall, red RNA fraction tends to have higher sense-strand read enrichment that blue/orange. The question `why there is such high variability in enrichment between time points / RNA fractions` remains to be addressed…

Visit Biostars for discussion.