Visualisation of repeats alongside NGS data

In one of the previous projects we were interested to look for association between deletions and repeats. At that time, I’ve written small program, repeats2sam.py, identifying perfect direct and inverted repeats in chromosomes. repeats2sam.py uses repeat-match from great MUMmer package and reports SAM-formatted repeats. This can be then easily converted to BAM. It takes several minutes for small genomes (20Mb) and several hours for large genomes (~8 hours for 1.5Gb).

# identify direct & inverted repeats
ref=human
repeats2sam.py -v --inverted -i $ref.fa | samtools view -SbuT $ref.fa - | samtools sort - $ref.repeats
samtools index $ref.repeats.bam

Repeats are stored as paired-end reads, so they can be easily visualised alongside any NGS data in IGV:

  • direct & inverted repeats
  • inverted repeats + RNAseq
  • inverted repeats + RNAseq zommed in

If you are interested in DNA direct repeats, either skip –inverted parameter, or select only direct repeats from your BAM file. In addition, you may want to select only repeats fulfilling certain length and distance criteria.

# select only inverted repeats; at least 21bp long and distant by less than 5000bp
i=21; samtools view $ref.repeats.inverted.bam | awk '$8<5000 && length($10)>='$i | samtools view -SbuT $ref.fa - | samtools sort - $ref.repeats.inverted.${i}bp && samtools index $ref.repeats.inverted.${i}bp.bam
# select only direct repeats; at least 21bp long and distant by less than 5000bp
i=21; samtools view $ref.repeats.direct.bam | awk '$8<5000 && length($10)>='$i | samtools view -SbuT $ref.fa - | samtools sort - $ref.repeats.direct.${i}bp && samtools index $ref.repeats.direct.${i}bp.bam

All above mentioned programs can be found in github.

Leave a Reply

Your email address will not be published. Required fields are marked *