NGS alignment efficiency is highly affected by adaptor / PCR primer contamination

In one of my recent projects, I’ve been analysing RNA-Seq libraries with high variability of alignment efficiency.

I decided to spend some time and try to find the reason why there is such high variability in reads that can be mapped. I have ran FastQC on all .fastq.gz files. I have looked at overall library quality and later focused on some specific measures.

  1. I haven’t found any clear association between number of warns/fails and alignment efficiency.
  2. Similarly, there is no association between alignment efficiency and any group of quality measures.
  3. But, libraries with the highest fraction of uniquely aligned reads tend to pass `Overrepresented sequences` filter.
  4. Finally, I’ve realised that alignment efficiency anti-correlates with adapter / PCR primer contamination levels
  5. Below, you can find some BASH code I’ve used.

    # run fastqc using 4 threads
    mkdir fastqc
    fastqc -t 4 -i *.fq.gz -o fastqc
     
    # get fraction of reads affected by all over-represented sequences
    for f in fastqc/*.fq_fastqc/fastqc_data.txt; do
      echo $f `grep -A100 ">>Overrepresented sequences" $f | \
       grep -m1 -B100 ">>END_MODULE" | awk '{sum+=$3} END {print sum}'`;
    done
     
    # get fraction of reads affected by Adapter or PCR primers
    for f in fastqc/*.fq_fastqc/fastqc_data.txt; do
      echo $f `grep -A100 ">>Overrepresented sequences" $f | \
       grep -m1 -B100 ">>END_MODULE" | \
       grep -P "Adapter|PCR" | awk 'BEGIN {sum=0} {sum+=$3} END {print sum}'`;
    done
    

Leave a Reply

Your email address will not be published. Required fields are marked *