Easy citation in LibreOffice / OpenOffice with Mendeley

Creating reference list is always a nightmare. Mendeley and its handy LibreOffice / OpenOffice plugin may be of great help to many. It was for me. Below, I’ll describe how to make it working.

# get & install mendeley from https://www.mendeley.com/download-mendeley-desktop/

# check version of your mendeley
#  Help > About Mendeley Desktop

# clone repo and build plugin
git clone git@github.com:Mendeley/openoffice-plugin.git
cd openoffice-plugin/
python build.py 1.15.2 false

# add to LibreOffice
#  Tools > Extension Manager > Add...
#   and look for `Mendeley-1.15.2.oxt`

After OpenOffice / LibreOffice restart, you should see new bar. Note, in order for the plugin to work, Mendeley has to be running.

What’s great about this plugin, you can adjust citation style by just a few clicks by clicking on `Choose Citation Style`. There is quite extensive database of predefined citation styles, so adjusting the reference style to your favourite journal will take just a few seconds 🙂
More info about the plugin on github.

Installing new version of Python without root

Some time ago I was recommending to use Python virtual environment to install local version of Python packages. However this will not solve the issue of outdated version Python in the server your are working in. Here, pythonbrew may be help for you.

# install pythonbrew to ~/.pythonbrew
curl -kL http://xrl.us/pythonbrewinstall | bash

# add to ~/.bashrc to automatically activate pythonbrew
[[ -s "$HOME/.pythonbrew/etc/bashrc" ]] && source "$HOME/.pythonbrew/etc/bashrc"                                                         

# open new terminal tab (Ctrl+Shift+T) or window (Ctrl+Shift+N)

# install python 2.7.10
pythonbrew install 2.7.10

# and enable the new version
pythonbrew switch 2.7.10

# from now on, you can enjoy the version of your choice and install dependencies
which python
python --version
#Python 2.7.10
which pip

Using pybedtools under Python 2.6

I encountered an import error from collections import OrderedDict, while using pybedtools under Python 2.6. It took me some time to find a workaround… I posted it below.

# install dependencies and pybedtools
pip install cython ordereddict
pip install pybedtools

# edit file ... lib/python2.6/site-packages/pybedtools/contrib/venn_maker.py
# comment import from collections and add try-except
#from collections import OrderedDict
    from collections import OrderedDict
except ImportError:
    from ordereddict import OrderedDict

Serving IPython notebook on public domain

I’ve been involved in teaching basic programming in Python. There are several good tutorials and on-line courses (just to mention Python@CodeCademy), but I’ve recognised there is a need for some interactive workplace for the students. I’ve got an idea to setup IPython in public domain, as many of the students don’t have Python installed locally or miss certain dependencies…
The task of installing IPython and serving it in publicly seems very easy… But I’ve encountered numerous difficulties on the way, caused by different versions of IPython (ie. split into Jupyter in v4), Apache configuration and firewall setup, just to mention a few. Anyway, I’ve succeeded and I’ve decided to share my experiences here 🙂
First of all, I strongly recommend setting up separate user for serving IPython, as only this way your personal files will be safe ?

  1. Install IPython notebook and prepare new user
  2. # install python-dev and build essentials
    sudo apt-get install build-essential python-dev
    # install ipython; v3 is recommended
    sudo pip install ipython[all]==3.2.1
    # create new user
    sudo adduser ipython
    # login as new user
    su ipython
  3. Configure IPython notebook
  4. # create new profile
    ipython profile create nbsever
    # generate pass and checksum
    ipython -c "from IPython.lib import passwd; passwd()"
    # enter your password twice, save it and copy password hash
    ## Out[1]: 'sha1:[your hashed password here]'
    # add to ~/.ipython/profile_nbserver/ipython_notebook_config.py after `c = get_config()`
    c.NotebookApp.ip = 'localhost'
    c.NotebookApp.open_browser = False
    c.NotebookApp.port = 8889
    c.NotebookApp.base_url = '/ipython'
    c.NotebookApp.password = u'sha1:[your hashed password here]'
    # create some directory for notebook files ie. ~/Public/ipython
    mkdir -p ~/Public/ipython
    cd ~/Public/ipython
    # start notebook server
    ipython notebook --profile=nbserver
  5. Configure Apache2
  6. # enable mods
    sudo a2enmod proxy proxy_http proxy_wstunnel
    sudo service apache2 restart
    # add ipython proxy config to your enabled site ie. /etc/apache2/sites-available/000-default.conf
        # IPython
        <Location "/ipython" >
            ProxyPass http://localhost:8889/ipython
            ProxyPassReverse http://localhost:8889/ipython
        <Location "/ipython/api/kernels/" >
            ProxyPass        ws://localhost:8889/ipython/api/kernels/
            ProxyPassReverse ws://localhost:8889/ipython/api/kernels/
    # restart apache2
    sudo service apache2 restart

Your public IPython will be accessible at http://yourdomain.com/ipython .
The longest time it took me to realise that c.NotebookApp.allow_origin='*' line is crucial in IPython notebook configuration, otherwise the kernel is loosing connection with an error ‘Connection failed‘ or ‘WebSocket error‘. Additionally, in one of the servers I’ve been trying, there is proxy setup that block some ports high ports, thus it was impossible to connect to WebSocket even with ApacheProxy setup…
If you want to read more especially about setting SSL-enabled notebook, have a look at jupyter documentation.

pysam installation error on Ubuntu 14.04

If you encounter error during pysam installation:

sudo easy_install -U pysam


pysam/csamtools.c:8:22: fatal error: pyconfig.h: No such file or directory
 #include "pyconfig.h"
compilation terminated.
error: Setup script exited with error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

try installing python-dev first.

sudo apt-get install python-dev

Solution found on github.

Identification of potential transcription factor binding sites (TFBS) across species

My colleague asked me for help with identification of targets for some transcription factors (TFs). The complication is that target motifs for these TFs are known in human, (A/G)GGTGT(C/G/T)(A/G), but exact binding motif is not known in the species of interest. Nevertheless, we decided to scan the genome for matches of this motifs. To facilitate that, I’ve written small program, regex2bed.py, finding sequence motifs in the genome. The program employs regex to find matches in forward and reverse complement and reports bed-formatted output.

regex2bed.py -vcs -i DANRE.fa -r "[AG]GGTGT[CGT][AG]" > tf.bed 2> tf.log

regex2bed.py is quite fast, scanning 1.5G genome in ~1 minute on modern desktop. The program reports some basic stats ie. number of matches in +/- strand for each chromosome to stderr.
Most likely, you will find hundred thousands of putative TFBS. Therefore, it’s good to filter some of them ie. focusing on these in proximity of some genes of interest. This can be accomplished using combination of awk, bedtools and two other scripts: bed2region.py and intersect2bed.py.

# crosslink with genes within 100 kb upstream of coding genes
awk '$3=="gene"' genome.gtf > gene.gtf
cat tf.bed | bed2region.py 100000 | bedtools intersect -s -loj -a - -b gene.gtf  | intersect2bed.py > tf.genes100k.bed

And this is how example output will look like:

1       68669   68677   GGGTGTGG        0       +       ENSDARG00000034862; ENSDARG00000088581; ENSDARG00000100782; ENSDARG00000076900; ENSDARG00000075827; ENSDARG00000096578  f7; f10; F7 (4 of 4); PROZ (2 of 2); f7i; cul4a
1       71354   71362   aggtgtgg        0       +       ENSDARG00000034862; ENSDARG00000088581; ENSDARG00000100181; ENSDARG00000100782; ENSDARG00000076900; ENSDARG00000075827; ENSDARG00000096578      f7; f10; LAMP1 (2 of 2); F7 (4 of 4); PROZ (2 of 2); f7i; cul4a
1       76322   76330   AGGTGTGG        0       +       ENSDARG00000034862; ENSDARG00000088581; ENSDARG00000100181; ENSDARG00000100782; ENSDARG00000076900; ENSDARG00000075827; ENSDARG00000096578      f7; f10; LAMP1 (2 of 2); F7 (4 of 4); PROZ (2 of 2); f7i; cul4a

All above mentioned programs can be found in github.
If you want to learn more about regular expression, have a look at Python re module.

Virtual environment (venv) in Python

Working on the machines with no root access is sometimes annoying, especially if you depend on multiple Python packages and your distro is somehow outdated… You may find Python virtual environment (venv) very useful.
First, you need to create venv directory structure (this is done only once):

mkdir -p ~/src/venv
cd ~/src/venv
virtualenv py27

Then you can open new BASH terminal and activate your venv by:

source ~/src/venv/py27/bin/activate

After that, you can install / upgrade any packages using pip / easy_install (even including PIP ) ie.

pip install --upgrade pip
pip install --upgrade scipy

Insipired by python-guide.

What fraction of the genome is expressed?

Recently, I was interested to find out what fraction of genome is expressed. This can be computed quite easily from RNA-Seq alignments using bedtools and some simple scripting. Note, it’s crucial to use -split parameter in bedtools genomecov to report correct coverage from split alignments. Otherwise, the introns that are spliced-out will be treated as expressed regions.

# get genomic regions with at least n reads aligned
for f in *.bam; do
  echo `date` $f;
  # get genomic intervals covered by n reads
  bedtools genomecov -split -ibam $f -g $ref.fa.fai -dz | awk -F'\t' 'BEGIN {OFS = FS} {if ($3>='$n') {print $1,$2,$2+1}}' | bedtools merge > $f.${n}reads.bed;
  # report bp covered by n reads
  awk '{sum+=$3-$2} END {print sum}' $f.${n}reads.bed
done; date

Visualisation of repeats alongside NGS data

In one of the previous projects we were interested to look for association between deletions and repeats. At that time, I’ve written small program, repeats2sam.py, identifying perfect direct and inverted repeats in chromosomes. repeats2sam.py uses repeat-match from great MUMmer package and reports SAM-formatted repeats. This can be then easily converted to BAM. It takes several minutes for small genomes (20Mb) and several hours for large genomes (~8 hours for 1.5Gb).

# identify direct & inverted repeats
repeats2sam.py -v --inverted -i $ref.fa | samtools view -SbuT $ref.fa - | samtools sort - $ref.repeats
samtools index $ref.repeats.bam

Repeats are stored as paired-end reads, so they can be easily visualised alongside any NGS data in IGV:

  • direct & inverted repeats
  • inverted repeats + RNAseq
  • inverted repeats + RNAseq zommed in

If you are interested in DNA direct repeats, either skip –inverted parameter, or select only direct repeats from your BAM file. In addition, you may want to select only repeats fulfilling certain length and distance criteria.

# select only inverted repeats; at least 21bp long and distant by less than 5000bp
i=21; samtools view $ref.repeats.inverted.bam | awk '$8<5000 && length($10)>='$i | samtools view -SbuT $ref.fa - | samtools sort - $ref.repeats.inverted.${i}bp && samtools index $ref.repeats.inverted.${i}bp.bam
# select only direct repeats; at least 21bp long and distant by less than 5000bp
i=21; samtools view $ref.repeats.direct.bam | awk '$8<5000 && length($10)>='$i | samtools view -SbuT $ref.fa - | samtools sort - $ref.repeats.direct.${i}bp && samtools index $ref.repeats.direct.${i}bp.bam

All above mentioned programs can be found in github.