In one of the previous projects we were interested to look for association between deletions and repeats. At that time, I’ve written small program, repeats2sam.py, identifying perfect direct and inverted repeats in chromosomes. repeats2sam.py uses repeat-match from great MUMmer package and reports SAM-formatted repeats. This can be then easily converted to BAM. It takes several minutes for small genomes (20Mb) and several hours for large genomes (~8 hours for 1.5Gb).
# identify direct & inverted repeats ref=human repeats2sam.py -v --inverted -i $ref.fa | samtools view -SbuT $ref.fa - | samtools sort - $ref.repeats samtools index $ref.repeats.bam
Repeats are stored as paired-end reads, so they can be easily visualised alongside any NGS data in IGV:
- direct & inverted repeats
- inverted repeats + RNAseq
- inverted repeats + RNAseq zommed in
If you are interested in DNA direct repeats, either skip –inverted parameter, or select only direct repeats from your BAM file. In addition, you may want to select only repeats fulfilling certain length and distance criteria.
# select only inverted repeats; at least 21bp long and distant by less than 5000bp i=21; samtools view $ref.repeats.inverted.bam | awk '$8<5000 && length($10)>='$i | samtools view -SbuT $ref.fa - | samtools sort - $ref.repeats.inverted.${i}bp && samtools index $ref.repeats.inverted.${i}bp.bam # select only direct repeats; at least 21bp long and distant by less than 5000bp i=21; samtools view $ref.repeats.direct.bam | awk '$8<5000 && length($10)>='$i | samtools view -SbuT $ref.fa - | samtools sort - $ref.repeats.direct.${i}bp && samtools index $ref.repeats.direct.${i}bp.bam
All above mentioned programs can be found in github.