Identification of potential transcription factor binding sites (TFBS) across species

My colleague asked me for help with identification of targets for some transcription factors (TFs). The complication is that target motifs for these TFs are known in human, (A/G)GGTGT(C/G/T)(A/G), but exact binding motif is not known in the species of interest. Nevertheless, we decided to scan the genome for matches of this motifs. To facilitate that, I’ve written small program,, finding sequence motifs in the genome. The program employs regex to find matches in forward and reverse complement and reports bed-formatted output. -vcs -i DANRE.fa -r "[AG]GGTGT[CGT][AG]" > tf.bed 2> tf.log is quite fast, scanning 1.5G genome in ~1 minute on modern desktop. The program reports some basic stats ie. number of matches in +/- strand for each chromosome to stderr.
Most likely, you will find hundred thousands of putative TFBS. Therefore, it’s good to filter some of them ie. focusing on these in proximity of some genes of interest. This can be accomplished using combination of awk, bedtools and two other scripts: and

# crosslink with genes within 100 kb upstream of coding genes
awk '$3=="gene"' genome.gtf > gene.gtf
cat tf.bed | 100000 | bedtools intersect -s -loj -a - -b gene.gtf  | > tf.genes100k.bed

And this is how example output will look like:

1       68669   68677   GGGTGTGG        0       +       ENSDARG00000034862; ENSDARG00000088581; ENSDARG00000100782; ENSDARG00000076900; ENSDARG00000075827; ENSDARG00000096578  f7; f10; F7 (4 of 4); PROZ (2 of 2); f7i; cul4a
1       71354   71362   aggtgtgg        0       +       ENSDARG00000034862; ENSDARG00000088581; ENSDARG00000100181; ENSDARG00000100782; ENSDARG00000076900; ENSDARG00000075827; ENSDARG00000096578      f7; f10; LAMP1 (2 of 2); F7 (4 of 4); PROZ (2 of 2); f7i; cul4a
1       76322   76330   AGGTGTGG        0       +       ENSDARG00000034862; ENSDARG00000088581; ENSDARG00000100181; ENSDARG00000100782; ENSDARG00000076900; ENSDARG00000075827; ENSDARG00000096578      f7; f10; LAMP1 (2 of 2); F7 (4 of 4); PROZ (2 of 2); f7i; cul4a

All above mentioned programs can be found in github.
If you want to learn more about regular expression, have a look at Python re module.

What fraction of the genome is expressed?

Recently, I was interested to find out what fraction of genome is expressed. This can be computed quite easily from RNA-Seq alignments using bedtools and some simple scripting. Note, it’s crucial to use -split parameter in bedtools genomecov to report correct coverage from split alignments. Otherwise, the introns that are spliced-out will be treated as expressed regions.

# get genomic regions with at least n reads aligned
for f in *.bam; do
  echo `date` $f;
  # get genomic intervals covered by n reads
  bedtools genomecov -split -ibam $f -g $ref.fa.fai -dz | awk -F'\t' 'BEGIN {OFS = FS} {if ($3>='$n') {print $1,$2,$2+1}}' | bedtools merge > $f.${n}reads.bed;
  # report bp covered by n reads
  awk '{sum+=$3-$2} END {print sum}' $f.${n}reads.bed
done; date

How stranded is stranded RNA-Seq protocol?

After identifying some antisense reads in stranded RNA-Seq experiments, I got interested how stranded is stranded RNA-Seq. Here, I’ll describe my methodology in more details.

  1. Quantify reads for all genes / exons
  2. Initially, I’ve been using bedtools2 for this task, but due to its limitations (bedtools coverage needed to be run twice for sense & antisense, while bedtools multicov was terribly slow for large .gtf files) I have written my own tool for coverage calculation ( &

    mkdir stranded
    for f in *.bam; do
      s=`echo $f | cut -f1 -d'.'`
      if [ ! -s stranded/$s.gtf.gz ]; then
        echo `date` $s -i $s.bam -b REF.gtf | gzip > stranded/$s.gtf.gz
    done; date
  3. Calculate sense-strand enrichment
  4. Here, I start from selecting interesting features from .gtf.gz and saving these in tmp.bed. Subsequently, I scan tmp.bed for overlapping features on antisense and print only those with no overlap (bedtools intersect -S -v -wa). Finally, sense-strand enrichment is calculated and reported.
    This is done for all features (`.`), genes (`gene`) and exons (`exon`).

    cd stranded
    for cat in . gene exon; do
      echo `date` category $cat;
      for f in *.gtf.gz; do
        # select interesting features and save to bed
        zgrep -P "\t${cat}\t" $f | awk -F'\t' -v OFS='\t' '{print $1,$4,$5,$10,$11,$7}'> tmp.bed;
        # select features with no overlap on antisense & calculate enrichment
        echo -e "${f}\t"`bedtools intersect -S -v -wa -a tmp.bed -b tmp.bed | \
          awk -F'\t' -v OFS='\t' '{sum1+=$4; sum2+=$5} END {print sum1,sum2,sum1/sum2}'`;
    rm tmp.bed; date

Summarising, I have found highly variable sense-strand read enrichment (6.39-27.05; median: 9.85) in our RNA-Seq stranded data (6 time points with 3 RNA fractions in colours). I have obtained slightly lower enrichment for genes (5.98-24.07; median: 8.74), probably due to some antisense genes / exons not annotated.
Overall, red RNA fraction tends to have higher sense-strand read enrichment that blue/orange. The question `why there is such high variability in enrichment between time points / RNA fractions` remains to be addressed…

Visit Biostars for discussion.

BAM2bigWig conversion using pybedtools

Today I was looking for a handy way of converting BAM to bigWig. Biostars turned out to be handy place for a start. Guess what, there is ready module implemented in pybedtools that does exactly that:

from pybedtools.contrib.bigwig import bam_to_bigwig
bam_to_bigwig(bam='path/to/bam', genome='hg19', output='path/to/bigwig')

Unfortunately, my genome of interested in not hosted at UCSC, so I needed to alter bam_to_bigwig slightly. In addition, I’ve replaced the read counting function mapped_read_count with my own implementation that rely on BAM index and therefore return number of alignments nearly instantly. This reduces the time by some minutes for large BAMs.

My own implementation can be found github repo: -i SOME.BAM -g GENOME.fa -o