In one of my recent projects, I’ve been analysing RNA-Seq libraries with high variability of alignment efficiency.
I decided to spend some time and try to find the reason why there is such high variability in reads that can be mapped. I have ran FastQC on all .fastq.gz files. I have looked at overall library quality and later focused on some specific measures.
- I haven’t found any clear association between number of warns/fails and alignment efficiency.
- Similarly, there is no association between alignment efficiency and any group of quality measures.
- But, libraries with the highest fraction of uniquely aligned reads tend to pass `Overrepresented sequences` filter.
- Finally, I’ve realised that alignment efficiency anti-correlates with adapter / PCR primer contamination levels
Below, you can find some BASH code I’ve used.
# run fastqc using 4 threads mkdir fastqc fastqc -t 4 -i *.fq.gz -o fastqc # get fraction of reads affected by all over-represented sequences for f in fastqc/*.fq_fastqc/fastqc_data.txt; do echo $f `grep -A100 ">>Overrepresented sequences" $f | \ grep -m1 -B100 ">>END_MODULE" | awk '{sum+=$3} END {print sum}'`; done # get fraction of reads affected by Adapter or PCR primers for f in fastqc/*.fq_fastqc/fastqc_data.txt; do echo $f `grep -A100 ">>Overrepresented sequences" $f | \ grep -m1 -B100 ">>END_MODULE" | \ grep -P "Adapter|PCR" | awk 'BEGIN {sum=0} {sum+=$3} END {print sum}'`; done