My colleague asked me for help with identification of targets for some transcription factors (TFs). The complication is that target motifs for these TFs are known in human, (A/G)GGTGT(C/G/T)(A/G), but exact binding motif is not known in the species of interest. Nevertheless, we decided to scan the genome for matches of this motifs. To facilitate that, I’ve written small program, regex2bed.py, finding sequence motifs in the genome. The program employs regex to find matches in forward and reverse complement and reports bed-formatted output.
regex2bed.py -vcs -i DANRE.fa -r "[AG]GGTGT[CGT][AG]" > tf.bed 2> tf.log
regex2bed.py
is quite fast, scanning 1.5G genome in ~1 minute on modern desktop. The program reports some basic stats ie. number of matches in +/- strand for each chromosome to stderr.
Most likely, you will find hundred thousands of putative TFBS. Therefore, it’s good to filter some of them ie. focusing on these in proximity of some genes of interest. This can be accomplished using combination of awk, bedtools and two other scripts: bed2region.py
and intersect2bed.py
.
# crosslink with genes within 100 kb upstream of coding genes awk '$3=="gene"' genome.gtf > gene.gtf cat tf.bed | bed2region.py 100000 | bedtools intersect -s -loj -a - -b gene.gtf | intersect2bed.py > tf.genes100k.bed
And this is how example output will look like:
1 68669 68677 GGGTGTGG 0 + ENSDARG00000034862; ENSDARG00000088581; ENSDARG00000100782; ENSDARG00000076900; ENSDARG00000075827; ENSDARG00000096578 f7; f10; F7 (4 of 4); PROZ (2 of 2); f7i; cul4a 1 71354 71362 aggtgtgg 0 + ENSDARG00000034862; ENSDARG00000088581; ENSDARG00000100181; ENSDARG00000100782; ENSDARG00000076900; ENSDARG00000075827; ENSDARG00000096578 f7; f10; LAMP1 (2 of 2); F7 (4 of 4); PROZ (2 of 2); f7i; cul4a 1 76322 76330 AGGTGTGG 0 + ENSDARG00000034862; ENSDARG00000088581; ENSDARG00000100181; ENSDARG00000100782; ENSDARG00000076900; ENSDARG00000075827; ENSDARG00000096578 f7; f10; LAMP1 (2 of 2); F7 (4 of 4); PROZ (2 of 2); f7i; cul4a
All above mentioned programs can be found in github.
If you want to learn more about regular expression, have a look at Python re module.