Traditionally, in order to quantify transcript abundance from RNA-Seq, one has to first align the reads onto the reference and analyse these alignments. While widely accepted, this approach has several disadvantages:
alignment step is slow, especially in splice-aware mode
spliced alignments are error-prone
huge intermediate files are produced (.bam)
transcript quantification from these intermediate files is also slow
At the RECOMB2015, Rob Patro presented two algorithms enabling rapid transcript quantifications from RNA-Seq. Sailfish is alignment-free method, while its successor, Salmon, perform light-weight alignment, identifying just the super maximal exact matches (SMEMs).
On my data, Salmon is ~3 times faster (6-7min) and uses ~20 times less memory (1.1GB) than STAR (~20min / ~24GB). Note, STAR results (.bam) need to be further analysed (ie. cufflinks) in order to quantify transcripts abundances. This step takes another 10min, thus we get transcript abundances after 6-7 min from Salmon and 30min from STAR + cufflinks. Just for comparison, similar analysis done with tophat2 takes above 6 hours!
Install dependencies & salmon
# install dependencies
## here it's important to install boost1.55 & remove older version before
sudo apt-get remove libboost-all-dev
sudo apt-get install libbz2-dev libtbb-dev libboost1.55-all-dev
# clone salmon repo
git clone email@example.com:COMBINE-lab/salmon.git
# and build
cmake -DBOOST_ROOT=/usr/include/boost -DTBB_INSTALL_DIR=/usr/include/tbb
make && make install && make test
# add to .bashrc
echo "# salmon" >> ~/.bashrc
echo "export PATH=$PATH:"`pwd`"/bin" >> ~/.bashrc
# open new BASH window (Ctrl + Shift + T) or reload environmental variables
salmon index -t transcripts.fa -i transcripts.index
Alternative splicing produces an array of transcripts from individual gene. Some of these alternative transcripts are expressed in tissue-specific fashion and may play different roles in the cell.
Recently, I got interested in identifying isoform switching from RNA-Seq data, this is detecting changes in the most expressed isoform between conditions. As I couldn’t find any ready solution, I have written my own script for this task (fpkm2major.py). This script simply parse isoforms.fpkm_tracking output from cufflinks and report genes with evidence of isoform switching into tab-delimited output file.
For each gene, number of transcripts, cumulative expression, major isoforms for each condition and each major isoform are reported. Lowly expressed genes in given condition are coded with `0` (--minFPKM < 1), while genes without clear major isoform are marked with `-1` (--frac 0.25).
fpkm2major.py -v -i sample1/isoforms.fpkm_tracking -o major_isoforms.txt
# combine data from multiple samples
paste sample1/isoforms.fpkm_tracking <(cut -f10- sample2/isoforms.fpkm_tracking) \
| fpkm2major.py -v -o major_isoforms2.txt
# visualise using cummeRbund
cuff <- readCufflinks("sample1")
# expression plot for isoforms of given gene
dev.copy(svg, paste(geneid,".sample1.svg"), width=12, height=12); dev.off()
A friend of mine was interested in selecting stable genes from RNA-Seq experiment, stable meaning genes with expression not affected by changing conditions.
I think the easiest solution is using partitioning function implemented in cummeRbund (K-means clustering):
Of course, it’s good to play with `k` value, so you indeed get some cluster with genes that are of interest.
Then, once you have some candidates, you may want to look for genes having similar expression patterns.