Detection of RNA editing

I’ve been asked to help analysing data from RNA editing experiment. Typically, such experiments consists of DNA and RNA sequencing. The purpose of these is to identify the differences between DNA template and resulting RNA product.
The canonical RNA editing enzyme, ADAR, deaminase adenine to inosine (A-to-I editing). Inosine is subsequently detected as guanine by the sequencer. Thus the identification of RNA editing can be simplified down to genotyping of DNA and RNA samples and finding differences between these two.
Importantly, RNA editing can be promiscuous, meaning multiple As are changed into Is in given region. This makes an alignment challenging. In addition, you may ignore most of edited sites due to quality filtering (many SNP clustered together), if you genotype your libraries with standard tools/pipelines ie. GATK. Finally, RNAseq is more error-prone than DNAseq, due to low fidelity of reverse-transcriptase, so you may want to account for that as well.
Ideally, you want stranded RNAseq, then only A>G editing will be observed. But in reality, you may analyse also unstranded RNAseq libraries, then you expect both A>G and T>C edited sites, from transcripts of genes on Watson (+) and Crick (-) strand, respectively.

DNA/RNAseq alignment

For the alignment part, I’ve been using BWA MEM (DNAseq) and STAR (RNAseq). If you use STAR, you should reassign mapping qualities (mapQ). It’s also recommended to split reads at N CIGAR. These two can be accomplished with GATK:

java -jar ~/src/GATK/GenomeAnalysisTK.jar -T SplitNCigarReads -R REF.fa -I SAMPLE.bam -o SAMPLE.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS > SAMPLE.split.log 2>&1 &

Finally, you may want to realign both DNAseq and RNAseq around indels using GATK IndelRealigner.

RNA editing identification from BAM files

I’ve written program, (currently REDiscover), to detect RNA editing in a set of DNAseq and RNAseq experiments. It’s in early developmental phase, but it’s functional already. It can be run simply by: -d bwamem/DNAseq*.bam -r star/RNAseq*.bam > bam2RNAediting.txt

Quality control

Simple quality control for RNA editing experiment could be looking at the enrichment of A>G (and T>C) edited sites over other changes. If RNA editing is present in your sample you would expect to find quite strong enrichment of A>G (and T>C) changes between DNAseq and RNAseq. This is indeed what I’ve found in my samples.

Let me know how it works for you. I’m looking for your feedback!

What fraction of the genome is expressed?

Recently, I was interested to find out what fraction of genome is expressed. This can be computed quite easily from RNA-Seq alignments using bedtools and some simple scripting. Note, it’s crucial to use -split parameter in bedtools genomecov to report correct coverage from split alignments. Otherwise, the introns that are spliced-out will be treated as expressed regions.

# get genomic regions with at least n reads aligned
for f in *.bam; do
  echo `date` $f;
  # get genomic intervals covered by n reads
  bedtools genomecov -split -ibam $f -g $ref.fa.fai -dz | awk -F'\t' 'BEGIN {OFS = FS} {if ($3>='$n') {print $1,$2,$2+1}}' | bedtools merge > $f.${n}reads.bed;
  # report bp covered by n reads
  awk '{sum+=$3-$2} END {print sum}' $f.${n}reads.bed
done; date

Wicked-fast transcript quantification with Salmon

Traditionally, in order to quantify transcript abundance from RNA-Seq, one has to first align the reads onto the reference and analyse these alignments. While widely accepted, this approach has several disadvantages:

  • alignment step is slow, especially in splice-aware mode
  • spliced alignments are error-prone
  • huge intermediate files are produced (.bam)
  • transcript quantification from these intermediate files is also slow

At the RECOMB2015, Rob Patro presented two algorithms enabling rapid transcript quantifications from RNA-Seq. Sailfish is alignment-free method, while its successor, Salmon, perform light-weight alignment, identifying just the super maximal exact matches (SMEMs).

On my data, Salmon is ~3 times faster (6-7min) and uses ~20 times less memory (1.1GB) than STAR (~20min / ~24GB). Note, STAR results (.bam) need to be further analysed (ie. cufflinks) in order to quantify transcripts abundances. This step takes another 10min, thus we get transcript abundances after 6-7 min from Salmon and 30min from STAR + cufflinks. Just for comparison, similar analysis done with tophat2 takes above 6 hours!

  1. Install dependencies & salmon
  2. # install dependencies
    ## here it's important to install boost1.55 & remove older version before
    sudo apt-get remove libboost-all-dev
    sudo apt-get install libbz2-dev libtbb-dev libboost1.55-all-dev
    # clone salmon repo
    git clone
    # and build
    cd salmon
    cmake -DBOOST_ROOT=/usr/include/boost -DTBB_INSTALL_DIR=/usr/include/tbb
    make && make install && make test
    # add to .bashrc
    echo "# salmon" >> ~/.bashrc
    echo "export PATH=$PATH:"`pwd`"/bin" >> ~/.bashrc
    # open new BASH window (Ctrl + Shift + T) or reload environmental variables
    source ~/.bashrc
  3. Index transcriptome
  4. salmon index -t transcripts.fa -i transcripts.index
  5. Quantify transcript abundances
  6. You have to specify library correctly (-l).

    salmon quant -p 4 -l SF -i ref/transcripts.index -r <(zcat sample1.fq.gz) -o sample1

Note, here the reads are decompressed using process substitution – this is very handy way of providing the preprocess data as input through Unix pipes.

Detecting isoform switching from RNA-Seq

Alternative splicing produces an array of transcripts from individual gene. Some of these alternative transcripts are expressed in tissue-specific fashion and may play different roles in the cell.

Recently, I got interested in identifying isoform switching from RNA-Seq data, this is detecting changes in the most expressed isoform between conditions. As I couldn’t find any ready solution, I have written my own script for this task ( This script simply parse isoforms.fpkm_tracking output from cufflinks and report genes with evidence of isoform switching into tab-delimited output file.

For each gene, number of transcripts, cumulative expression, major isoforms for each condition and each major isoform are reported. Lowly expressed genes in given condition are coded with `0` (--minFPKM < 1), while genes without clear major isoform are marked with `-1` (--frac 0.25). -v -i sample1/isoforms.fpkm_tracking -o major_isoforms.txt
# combine data from multiple samples
paste sample1/isoforms.fpkm_tracking <(cut -f10- sample2/isoforms.fpkm_tracking) \
 | -v -o major_isoforms2.txt
# visualise using cummeRbund
cuff <- readCufflinks("sample1")
# expression plot for isoforms of given gene
gene<-getGene(cuff, geneid)
expressionPlot(isoforms(gene), showErrorbars=T)
dev.copy(svg, paste(geneid,".sample1.svg"), width=12, height=12);

SOLiD RNA-Seq & splice-aware mapping

I’ve lost quite a lot of time trying to align color-space RNA-Seq reads. SHRiMP paper explains nicely, why it’s important to align SOLiD reads in color-space, instead of converting color-space directly into sequence-space. Below, you can find the simplest solution I have found, using tophat, relying on bowtie mapper (bowtie2 doesn’t support color-space) and color-space reads in .csfasta.

# generate genome index in color-space
bowtie-build --color GENOME.fa GENOME
# get SOLiD reads from SRA if you don't have them already in .csfasta
abi-dump SRR062662
# tophat splice-aware mapping in color-space
mkdir tophat
for f in READS_DIR/*.csfasta; do
  s=`echo $f | cut -f2 -d'/' | cut -f1 -d'.'`
  if [ ! -d tophat/$s ]; then
    echo `date` $f $s
    tophat -p 4 --no-coverage-search --color -o tophat/$s --quals $ref $f READS_DIR/${s}_QV.qual

NGS alignment efficiency is highly affected by adaptor / PCR primer contamination

In one of my recent projects, I’ve been analysing RNA-Seq libraries with high variability of alignment efficiency.

I decided to spend some time and try to find the reason why there is such high variability in reads that can be mapped. I have ran FastQC on all .fastq.gz files. I have looked at overall library quality and later focused on some specific measures.

  1. I haven’t found any clear association between number of warns/fails and alignment efficiency.
  2. Similarly, there is no association between alignment efficiency and any group of quality measures.
  3. But, libraries with the highest fraction of uniquely aligned reads tend to pass `Overrepresented sequences` filter.
  4. Finally, I’ve realised that alignment efficiency anti-correlates with adapter / PCR primer contamination levels
  5. Below, you can find some BASH code I’ve used.

    # run fastqc using 4 threads
    mkdir fastqc
    fastqc -t 4 -i *.fq.gz -o fastqc
    # get fraction of reads affected by all over-represented sequences
    for f in fastqc/*.fq_fastqc/fastqc_data.txt; do
      echo $f `grep -A100 ">>Overrepresented sequences" $f | \
       grep -m1 -B100 ">>END_MODULE" | awk '{sum+=$3} END {print sum}'`;
    # get fraction of reads affected by Adapter or PCR primers
    for f in fastqc/*.fq_fastqc/fastqc_data.txt; do
      echo $f `grep -A100 ">>Overrepresented sequences" $f | \
       grep -m1 -B100 ">>END_MODULE" | \
       grep -P "Adapter|PCR" | awk 'BEGIN {sum=0} {sum+=$3} END {print sum}'`;

How stranded is stranded RNA-Seq protocol?

After identifying some antisense reads in stranded RNA-Seq experiments, I got interested how stranded is stranded RNA-Seq. Here, I’ll describe my methodology in more details.

  1. Quantify reads for all genes / exons
  2. Initially, I’ve been using bedtools2 for this task, but due to its limitations (bedtools coverage needed to be run twice for sense & antisense, while bedtools multicov was terribly slow for large .gtf files) I have written my own tool for coverage calculation ( &

    mkdir stranded
    for f in *.bam; do
      s=`echo $f | cut -f1 -d'.'`
      if [ ! -s stranded/$s.gtf.gz ]; then
        echo `date` $s -i $s.bam -b REF.gtf | gzip > stranded/$s.gtf.gz
    done; date
  3. Calculate sense-strand enrichment
  4. Here, I start from selecting interesting features from .gtf.gz and saving these in tmp.bed. Subsequently, I scan tmp.bed for overlapping features on antisense and print only those with no overlap (bedtools intersect -S -v -wa). Finally, sense-strand enrichment is calculated and reported.
    This is done for all features (`.`), genes (`gene`) and exons (`exon`).

    cd stranded
    for cat in . gene exon; do
      echo `date` category $cat;
      for f in *.gtf.gz; do
        # select interesting features and save to bed
        zgrep -P "\t${cat}\t" $f | awk -F'\t' -v OFS='\t' '{print $1,$4,$5,$10,$11,$7}'> tmp.bed;
        # select features with no overlap on antisense & calculate enrichment
        echo -e "${f}\t"`bedtools intersect -S -v -wa -a tmp.bed -b tmp.bed | \
          awk -F'\t' -v OFS='\t' '{sum1+=$4; sum2+=$5} END {print sum1,sum2,sum1/sum2}'`;
    rm tmp.bed; date

Summarising, I have found highly variable sense-strand read enrichment (6.39-27.05; median: 9.85) in our RNA-Seq stranded data (6 time points with 3 RNA fractions in colours). I have obtained slightly lower enrichment for genes (5.98-24.07; median: 8.74), probably due to some antisense genes / exons not annotated.
Overall, red RNA fraction tends to have higher sense-strand read enrichment that blue/orange. The question `why there is such high variability in enrichment between time points / RNA fractions` remains to be addressed…

Visit Biostars for discussion.

Selecting stable genes from RNA-Seq experiment

A friend of mine was interested in selecting stable genes from RNA-Seq experiment, stable meaning genes with expression not affected by changing conditions.
I think the easiest solution is using partitioning function implemented in cummeRbund (K-means clustering):


Of course, it’s good to play with `k` value, so you indeed get some cluster with genes that are of interest.
Then, once you have some candidates, you may want to look for genes having similar expression patterns.