Traditionally, in order to quantify transcript abundance from RNA-Seq, one has to first align the reads onto the reference and analyse these alignments. While widely accepted, this approach has several disadvantages:
- alignment step is slow, especially in splice-aware mode
- spliced alignments are error-prone
- huge intermediate files are produced (.bam)
- transcript quantification from these intermediate files is also slow
At the RECOMB2015, Rob Patro presented two algorithms enabling rapid transcript quantifications from RNA-Seq. Sailfish is alignment-free method, while its successor, Salmon, perform light-weight alignment, identifying just the super maximal exact matches (SMEMs).
On my data, Salmon is ~3 times faster (6-7min) and uses ~20 times less memory (1.1GB) than STAR (~20min / ~24GB). Note, STAR results (.bam) need to be further analysed (ie. cufflinks) in order to quantify transcripts abundances. This step takes another 10min, thus we get transcript abundances after 6-7 min from Salmon and 30min from STAR + cufflinks. Just for comparison, similar analysis done with tophat2 takes above 6 hours!
- Install dependencies & salmon
- Index transcriptome
- Quantify transcript abundances
# install dependencies ## here it's important to install boost1.55 & remove older version before sudo apt-get remove libboost-all-dev sudo apt-get install libbz2-dev libtbb-dev libboost1.55-all-dev # clone salmon repo git clone git@github.com:COMBINE-lab/salmon.git # and build cd salmon cmake -DBOOST_ROOT=/usr/include/boost -DTBB_INSTALL_DIR=/usr/include/tbb make && make install && make test # add to .bashrc echo "# salmon" >> ~/.bashrc echo "export PATH=$PATH:"`pwd`"/bin" >> ~/.bashrc # open new BASH window (Ctrl + Shift + T) or reload environmental variables source ~/.bashrc
salmon index -t transcripts.fa -i transcripts.index
You have to specify library correctly (-l).
salmon quant -p 4 -l SF -i ref/transcripts.index -r <(zcat sample1.fq.gz) -o sample1
Note, here the reads are decompressed using process substitution – this is very handy way of providing the preprocess data as input through Unix pipes.