Visualisation of repeats alongside NGS data

In one of the previous projects we were interested to look for association between deletions and repeats. At that time, I’ve written small program, repeats2sam.py, identifying perfect direct and inverted repeats in chromosomes. repeats2sam.py uses repeat-match from great MUMmer package and reports SAM-formatted repeats. This can be then easily converted to BAM. It takes several minutes for small genomes (20Mb) and several hours for large genomes (~8 hours for 1.5Gb).

# identify direct & inverted repeats
ref=human
repeats2sam.py -v --inverted -i $ref.fa | samtools view -SbuT $ref.fa - | samtools sort - $ref.repeats
samtools index $ref.repeats.bam

Repeats are stored as paired-end reads, so they can be easily visualised alongside any NGS data in IGV:

  • direct & inverted repeats
  • inverted repeats + RNAseq
  • inverted repeats + RNAseq zommed in

If you are interested in DNA direct repeats, either skip –inverted parameter, or select only direct repeats from your BAM file. In addition, you may want to select only repeats fulfilling certain length and distance criteria.

# select only inverted repeats; at least 21bp long and distant by less than 5000bp
i=21; samtools view $ref.repeats.inverted.bam | awk '$8<5000 && length($10)>='$i | samtools view -SbuT $ref.fa - | samtools sort - $ref.repeats.inverted.${i}bp && samtools index $ref.repeats.inverted.${i}bp.bam
# select only direct repeats; at least 21bp long and distant by less than 5000bp
i=21; samtools view $ref.repeats.direct.bam | awk '$8<5000 && length($10)>='$i | samtools view -SbuT $ref.fa - | samtools sort - $ref.repeats.direct.${i}bp && samtools index $ref.repeats.direct.${i}bp.bam

All above mentioned programs can be found in github.

Detecting isoform switching from RNA-Seq

Alternative splicing produces an array of transcripts from individual gene. Some of these alternative transcripts are expressed in tissue-specific fashion and may play different roles in the cell.

Recently, I got interested in identifying isoform switching from RNA-Seq data, this is detecting changes in the most expressed isoform between conditions. As I couldn’t find any ready solution, I have written my own script for this task (fpkm2major.py). This script simply parse isoforms.fpkm_tracking output from cufflinks and report genes with evidence of isoform switching into tab-delimited output file.

For each gene, number of transcripts, cumulative expression, major isoforms for each condition and each major isoform are reported. Lowly expressed genes in given condition are coded with `0` (--minFPKM < 1), while genes without clear major isoform are marked with `-1` (--frac 0.25).

fpkm2major.py -v -i sample1/isoforms.fpkm_tracking -o major_isoforms.txt
 
# combine data from multiple samples
paste sample1/isoforms.fpkm_tracking <(cut -f10- sample2/isoforms.fpkm_tracking) \
 | fpkm2major.py -v -o major_isoforms2.txt
 
# visualise using cummeRbund
R
library(cummeRbund)
cuff <- readCufflinks("sample1")
 
# expression plot for isoforms of given gene
geneid="ENSDARG00000013615"
gene<-getGene(cuff, geneid)
expressionPlot(isoforms(gene), showErrorbars=T)
dev.copy(svg, paste(geneid,".sample1.svg"), width=12, height=12); dev.off()

BAM2bigWig conversion using pybedtools

Today I was looking for a handy way of converting BAM to bigWig. Biostars turned out to be handy place for a start. Guess what, there is ready module implemented in pybedtools that does exactly that:

from pybedtools.contrib.bigwig import bam_to_bigwig
bam_to_bigwig(bam='path/to/bam', genome='hg19', output='path/to/bigwig')

Unfortunately, my genome of interested in not hosted at UCSC, so I needed to alter bam_to_bigwig slightly. In addition, I’ve replaced the read counting function mapped_read_count with my own implementation that rely on BAM index and therefore return number of alignments nearly instantly. This reduces the time by some minutes for large BAMs.

My own bam2bigwig.py implementation can be found github repo:

bam2bigwig.py -i SOME.BAM -g GENOME.fa -o SOME.BAM.bw