Running Jupyter as public service

Some time ago, I’ve written about setting up IPython as a public service. Today, I’ll write about setting up Jupyter, IPython descendant, that beside Python supports tons of other languages and frameworks.

Jupyter notebook will be running in separate user, so your personal files are safe, but not as system service. Therefore, you will need to restart it upon system reboot. I recommend running it in SCREEN session, so you can easily login into the server and check the Jupyter state.

  1. Install & setup Jupyter
  2. #
    sudo apt-get install build-essential python-dev
    sudo pip install jupyter
    
    # create new user
    sudo adduser jupyter
     
    # login as new user
    su jupyter
    
    # make sure to add `unset XDG_RUNTIME_DIR` to ~/.bashrc
    # otherwise you'll encounter: OSError: [Errno 13] Permission denied: '/run/user/1003/jupyter'
    echo 'unset XDG_RUNTIME_DIR' >> ~/.bashrc
    source ~/.bashrc
    
    # generate ssl certificates
    mkdir ~/.ssl
    openssl req -x509 -nodes -days 999 -newkey rsa:1024 -keyout ~/.ssl/mykey.key -out ~/.ssl/mycert.pem
    
    # generate config
    jupyter notebook --generate-config
    
    # generate pass and checksum
    ipython -c "from IPython.lib import passwd; passwd()"
    # enter your password twice, save it and copy password hash
    ## Out[1]: 'sha1:[your hashed password here]'
     
    # add to ~/.jupyter/jupyter_notebook_config.py
    c.NotebookApp.ip = '*'
    c.NotebookApp.open_browser = False
    c.NotebookApp.port = 8881
    c.NotebookApp.password = u'sha1:[your hashed password here]'
    c.NotebookApp.certfile = u'/home/jupyter/.ssl/mycert.pem'
    c.NotebookApp.keyfile = u'/home/jupyter/.ssl/mykey.key'
    
    # create some directory for notebook files ie. ~/Public/jupyter
    mkdir -p ~/Public/jupyter && cd ~/Public/jupyter
     
    # start notebook server
    jupyter notebook
    
  3. Add kernels
  4. You can add multiple kernels to Jupyter. Here I’ll cover installation of some:

    • Python
    • sudo pip install ipykernel
      
      # if you wish to use matplotlib, make sure to add to 
      # ~/.ipython/profile_default/ipython_kernel_config.py
      c.InteractiveShellApp.matplotlib = 'inline'
      
    • BASH kernel
    • sudo pip install bash_kernel
      sudo python -m bash_kernel.install
      
    • Perl
    • This didn’t worked for me:/

      sudo cpan Devel::IPerl
    • IRkernel
    • Follow this tutorial.

    • Haskell
    • sudo apt-get install cabal-install
      git clone http://www.github.com/gibiansky/IHaskell
      cd IHaskell
      ./ubuntu-install.sh
      

Then, just navigate to https://YOURDOMAIN.COM:8881/, accept self-signed certificate and enjoy!
Alternatively, you can obtain certificate from Let’s encrypt.

Using existing domain encryption aka Apache proxy
If your domain is already HTTPS, you may consider setting up Jupyter on localhost and redirect all incoming traffic (already encrypted) to particular port on localhost (as suggested by @shebang).

# enable Apache mods
sudo a2enmod proxy proxy_http proxy_wstunnel && sudo service apache2 restart

# add to your Apache config
    <Location "/jupyter" >
        ProxyPass http://localhost:8881/jupyter
        ProxyPassReverse http://localhost:8881/jupyter
    </Location>
    <Location "/jupyter/api/kernels/" >
        ProxyPass        ws://localhost:8881/jupyter/api/kernels/
        ProxyPassReverse ws://localhost:8881/jupyter/api/kernels/
    </Location>
    <Location "/jupyter/api/kernels/">
        ProxyPass        ws://localhost:8881/jupyter/api/kernels/
        ProxyPassReverse ws://localhost:8881/jupyter/api/kernels/
    </Location>

# update you Jupyter config (~/.jupyter/jupyter_notebook_config.py)
c.NotebookApp.ip = 'localhost'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8881
c.NotebookApp.base_url = '/jupyter'
c.NotebookApp.password = u'sha1:[your hashed password here]'
c.NotebookApp.allow_origin = '*'

Note, it’s crucial to add Apache proxy for kernels (/jupyter/api/kernels/), otherwise you won’t be able to use terminals due to failed: Error during WebSocket handshake: Unexpected response code: 400 error.

On handy docker images

Motivated by successful stripping problematic dependencies from Redundans, I have decided to generate smaller Docker image, starting with Alpine Linux image (2Mb / 5Mb after downloading) instead of Ubuntu (49Mb / 122Mb). Previously, I couldn’t really rely on Alpine Linux, because it was impossible to make these problematic dependencies running… But now it’s whole new world of possibilities 😉

There are very few dependencies left, so I have started… (You can find all the commands below).

  1. First, I have check what can be installed from package manager.
    Only Python and Perl.

  2. Then I have checked if any of binaries are working.
    For example, GapCloser is provided as binary. It took me some time to find source code…
    Anyway, none of the binaries worked out of the box. It was expected, as Alpine Linux is super stripped…

  3. I have installed build-base in order to be able to build things.
    Additionally, BWA need zlib-dev.

  4. Alpine Linux doesn’t use standard glibc library, but musl-libc (you can read more about differences between the two), so some programmes (ie. BWA) may be quite reluctant to compile.
    After some hours of trying & thanks to the help of mp15, I have found a solution, not so complicated 🙂

  5. I have realised, that Dockerfile doesn’t like standard BASH brace expansion, that is working otherwise in Docker Alpine console…
    so ls *.{c,h} should be ls *.c *.h

  6. After that, LAST and GapCloser compilation were easy, relatively 😉

Below, you can find the code from Docker file (without RUN commands).

apk add --update --no-cache python perl bash wget build-base zlib-dev
mkdir -p /root/src && cd /root/src && wget http://downloads.sourceforge.net/project/bio-bwa/bwa-0.7.15.tar.bz2 && tar xpfj bwa-0.7.15.tar.bz2 && ln -s bwa-0.7.15 bwa && cd bwa && \
cp kthread.c kthread.c.org && echo "#include <stdint.h>" > kthread.c && cat kthread.c.org >> kthread.c && \
sed -ibak 's/u_int32_t/uint32_t/g' `grep -l u_int32_t *.c *.h` && make && cp bwa /bin/ && \
cd /root/src && wget http://liquidtelecom.dl.sourceforge.net/project/soapdenovo2/GapCloser/src/r6/GapCloser-src-v1.12-r6.tgz && tar xpfz GapCloser-src-v1.12-r6.tgz && ln -s v1.12-r6/ GapCloser && cd GapCloser && make && cp bin/GapCloser /bin/ && \
cd /root/src && wget http://last.cbrc.jp/last-744.zip && unzip last-744.zip && ln -s last-744 last && cd last && make && make install && \
cd /root/src && rm -r last* bwa* GapCloser* v* 

# SSPACE && redundans in /root/srt
cd /root/src && wget -q http://www.baseclear.com/base/download/41SSPACE-STANDARD-3.0_linux-x86_64.tar.gz && tar xpfz 41SSPACE-STANDARD-3.0_linux-x86_64.tar.gz && ln -s SSPACE-STANDARD-3.0_linux-x86_64 SSPACE && wget -O- -q http://cpansearch.perl.org/src/GBARR/perl5.005_03/lib/getopts.pl > SSPACE/dotlib/getopts.pl && \
wget --no-check-certificate -q -O redundans.tgz https://github.com/lpryszcz/redundans/archive/master.tar.gz && tar xpfz redundans.tgz && mv redundans-master redundans && ln -s /root/src/redundans /redundans && rm *gz

apk del wget build-base zlib-dev 
apk add libstdc++

After building & pushing, I have noticed that Alpine-based image is slightly smaller (99Mb), than the one based on Ubuntu (127Mb). Surprisingly, Alpine-based image is larger (273Mb) than Ubuntu-based (244Mb) after downloading. So, I’m afraid all of these hours didn’t really bring any substantial reduction in the image size.

Conclusion?
I was very motivated to build my application on Alpine Linux and expected substantial size reduction. But I’d say that relying on Alpine Linux image doesn’t always pay off in terms of smaller image size, forget about production time… And this I know from my own experience.
But maybe I didn’t something wrong? I’d be really glad for some advices/comments!

Nevertheless, stripping a few dependencies from my application (namely Biopython, numpy & scipy), resulted in much more compact image even using Ubuntu-based image (127Mb vs 191Mb; and 244Mb vs 440Mb after downloading). So I think this is the way to go 🙂

On simplifying dependencies

Lately, to make Redundans more user friendly, I have simplified it’s dependencies, by replacing Biopython, numpy, scipy and SQLite with some (relatively) simple functions or modules.

Here, I will just focus on replacing Biopython, particularly SeqIO.index_db with FastaIndex. You may ask yourself, why I have invested time in reinventing the wheel. I’m big fan of Biopython, yet it’s huge project and some solutions are not optimal or require problematic dependencies. This is the case with SeqIO.db_index, that relies on SQLite3. Here again, I’m a big fan of SQLite, yet building Biopython with SQLite enabled proved not to be very straightforward for non-standard systems or less experience users. Beside, on some NFS settings, the SQLite3 db cannot be created at all.

Ok, let’s start from the basics. SeqIO.index_db allows random access to sequence files, so for example you can rapidly retrieve any entry from very large file. This is achieved by storing the ID and position of each entry from particular file in database, SQLite3 db. Then, if you want to retrieve particular record, SeqIO.index_db looks up if this record is present in SQLite3 db, retrieves record position in the file and reads only small chunk of this file instead of parsing entire file every time you want to get some record(s).
Similar feature is offered by samtools faidx, but in this case, the coordinates of each entry are stored in tab-delimited file .fai (more info about .fai). This format can be easily read & write by any programme, so I have decided to use it. In addition, I have realised, that samtools faidx is flexible enough, so you can add additional columns to the .fai without interrupting its functionality, but about that later…

In Redundans, I’ve been using SeqIO.index_db during assembly reduction (fasta2homozygous.py). Additionally, beside storing index, I’ve been also generating statistics for every FastA file, like number of contigs, cumulative size, N50, N90, GC and so on. I have realised, that these two can be easily combined, by extending .fai with four additional columns, storing number of occurencies for A, C, G & T in every sequence. Such .fai is compatible with samtools faidx and provides very easy way of calculating bunch of statistics about this file.
All of these, I’ve implemented in FastaIndex. Beside being dependency-free & very handy indexer, it can be used also as alternative to samtools faidx to retrieve sequences from large FastA files.

# retrieve bases from 20 to 60 from NODE_2
./FastaIndex.py -i test/run1/contigs.fa -r NODE_2_length_7674_cov_46.7841_ID_3:20-60
>NODE_2_length_7674_cov_46.7841_ID_3
CATAGAACGACTGGTATAAGCCAAACATGACCCATTGTTGC
#Time elapsed: 0:00:00.014243

samtools faidx test/run1/contigs.fa NODE_2_length_7674_cov_46.7841_ID_3:20-60
>NODE_2_length_7674_cov_46.7841_ID_3:20-60
CATAGAACGACTGGTATAAGCCAAACATGACCCATTGTTGC

Tracing exceptions in multiprocessing in Python

I had problems with debugging my programme using multiprocessing.Pool.

Traceback (most recent call last):
  File "src/homologies2mysql_multi.py", line 294, in <module>
    main()
  File "src/homologies2mysql_multi.py", line 289, in main
    o.noupload, o.verbose)
  File "src/homologies2mysql_multi.py", line 242, in homologies2mysql
    for i, data in enumerate(p.imap_unordered(worker, pairs), 1):
  File "/usr/lib64/python2.6/multiprocessing/pool.py", line 520, in next
    raise value
ValueError: need more than 1 value to unpack

I could run it without multiprocessing, but then I’d have to wait some days for the program to reach the point where it crashes.
Luckily, Python is equipped with traceback, that allows handy tracing of exceptions.
Then, you can add a decorator to problematic function, that will report nice error message:

import traceback, functools, multiprocessing
 
def trace_unhandled_exceptions(func):
    @functools.wraps(func)
    def wrapped_func(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except:
            print 'Exception in '+func.__name__
            traceback.print_exc()
    return wrapped_func
 
@trace_unhandled_exceptions
def go():
    print(1)
    raise Exception()
    print(2)
 
p = multiprocessing.Pool(1)
 
p.apply_async(go)
p.close()
p.join()

The error message will look like:

1
Exception in go
Traceback (most recent call last):
  File "<stdin>", line 5, in wrapped_func
  File "<stdin>", line 4, in go
Exception

Solution found on StackOverflow.

Batch convert of .xlsx (Microsoft Office) to .tsv (tab-delimited) files

I had to retrieve data from multiple .xlsx files with multiple sheets. This can be done manually, but it will be rather time-consuming tasks, plus Office quotes text fields, which is not very convenient for downstream analysis…
I have found handy script, xlsx2tsv.py, that does the job, but it reports only one sheet at the time. Thus, I have rewritten xlsx2tsv.py a little to save all sheets from given .xlsx file into separate folder. In addition, multiple .xlsx files can be process at once. My version can be found on github.

xlsx2tsv.py *.xlsx

Installing new version of Python without root

Some time ago I was recommending to use Python virtual environment to install local version of Python packages. However this will not solve the issue of outdated version Python in the server your are working in. Here, pythonbrew may be help for you.

# install pythonbrew to ~/.pythonbrew
curl -kL http://xrl.us/pythonbrewinstall | bash

# add to ~/.bashrc to automatically activate pythonbrew
[[ -s "$HOME/.pythonbrew/etc/bashrc" ]] && source "$HOME/.pythonbrew/etc/bashrc"                                                         

# open new terminal tab (Ctrl+Shift+T) or window (Ctrl+Shift+N)

# install python 2.7.10
pythonbrew install 2.7.10

# and enable the new version
pythonbrew switch 2.7.10

# from now on, you can enjoy the version of your choice and install dependencies
which python
#/home/.../.pythonbrew/pythons/Python-2.7.10/bin/python
python --version
#Python 2.7.10
which pip
#/home/.../.pythonbrew/pythons/Python-2.7.10/bin/pip

Serving IPython notebook on public domain

I’ve been involved in teaching basic programming in Python. There are several good tutorials and on-line courses (just to mention Python@CodeCademy), but I’ve recognised there is a need for some interactive workplace for the students. I’ve got an idea to setup IPython in public domain, as many of the students don’t have Python installed locally or miss certain dependencies…
The task of installing IPython and serving it in publicly seems very easy… But I’ve encountered numerous difficulties on the way, caused by different versions of IPython (ie. split into Jupyter in v4), Apache configuration and firewall setup, just to mention a few. Anyway, I’ve succeeded and I’ve decided to share my experiences here 🙂
First of all, I strongly recommend setting up separate user for serving IPython, as only this way your personal files will be safe ?

  1. Install IPython notebook and prepare new user
  2. # install python-dev and build essentials
    sudo apt-get install build-essential python-dev
    
    # install ipython; v3 is recommended
    sudo pip install ipython[all]==3.2.1
    
    # create new user
    sudo adduser ipython
    
    # login as new user
    su ipython
    
  3. Configure IPython notebook
  4. # create new profile
    ipython profile create nbsever
    
    # generate pass and checksum
    ipython -c "from IPython.lib import passwd; passwd()"
    # enter your password twice, save it and copy password hash
    ## Out[1]: 'sha1:[your hashed password here]'
    
    # add to ~/.ipython/profile_nbserver/ipython_notebook_config.py after `c = get_config()`
    c.NotebookApp.ip = 'localhost'
    c.NotebookApp.open_browser = False
    c.NotebookApp.port = 8889
    c.NotebookApp.base_url = '/ipython'
    c.NotebookApp.password = u'sha1:[your hashed password here]'
    c.NotebookApp.allow_origin='*'
    
    # create some directory for notebook files ie. ~/Public/ipython
    mkdir -p ~/Public/ipython
    cd ~/Public/ipython
    
    # start notebook server
    ipython notebook --profile=nbserver
    
  5. Configure Apache2
  6. # enable mods
    sudo a2enmod proxy proxy_http proxy_wstunnel
    sudo service apache2 restart
    
    # add ipython proxy config to your enabled site ie. /etc/apache2/sites-available/000-default.conf
        # IPython
        <Location "/ipython" >
            ProxyPass http://localhost:8889/ipython
            ProxyPassReverse http://localhost:8889/ipython
        </Location>
    
        <Location "/ipython/api/kernels/" >
            ProxyPass        ws://localhost:8889/ipython/api/kernels/
            ProxyPassReverse ws://localhost:8889/ipython/api/kernels/
        </Location>
        #END
          
    # restart apache2
    sudo service apache2 restart
    

Your public IPython will be accessible at http://yourdomain.com/ipython .
The longest time it took me to realise that c.NotebookApp.allow_origin='*' line is crucial in IPython notebook configuration, otherwise the kernel is loosing connection with an error ‘Connection failed‘ or ‘WebSocket error‘. Additionally, in one of the servers I’ve been trying, there is proxy setup that block some ports high ports, thus it was impossible to connect to WebSocket even with ApacheProxy setup…
If you want to read more especially about setting SSL-enabled notebook, have a look at jupyter documentation.

Identification of potential transcription factor binding sites (TFBS) across species

My colleague asked me for help with identification of targets for some transcription factors (TFs). The complication is that target motifs for these TFs are known in human, (A/G)GGTGT(C/G/T)(A/G), but exact binding motif is not known in the species of interest. Nevertheless, we decided to scan the genome for matches of this motifs. To facilitate that, I’ve written small program, regex2bed.py, finding sequence motifs in the genome. The program employs regex to find matches in forward and reverse complement and reports bed-formatted output.

regex2bed.py -vcs -i DANRE.fa -r "[AG]GGTGT[CGT][AG]" > tf.bed 2> tf.log

regex2bed.py is quite fast, scanning 1.5G genome in ~1 minute on modern desktop. The program reports some basic stats ie. number of matches in +/- strand for each chromosome to stderr.
Most likely, you will find hundred thousands of putative TFBS. Therefore, it’s good to filter some of them ie. focusing on these in proximity of some genes of interest. This can be accomplished using combination of awk, bedtools and two other scripts: bed2region.py and intersect2bed.py.

# crosslink with genes within 100 kb upstream of coding genes
awk '$3=="gene"' genome.gtf > gene.gtf
cat tf.bed | bed2region.py 100000 | bedtools intersect -s -loj -a - -b gene.gtf  | intersect2bed.py > tf.genes100k.bed

And this is how example output will look like:

1       68669   68677   GGGTGTGG        0       +       ENSDARG00000034862; ENSDARG00000088581; ENSDARG00000100782; ENSDARG00000076900; ENSDARG00000075827; ENSDARG00000096578  f7; f10; F7 (4 of 4); PROZ (2 of 2); f7i; cul4a
1       71354   71362   aggtgtgg        0       +       ENSDARG00000034862; ENSDARG00000088581; ENSDARG00000100181; ENSDARG00000100782; ENSDARG00000076900; ENSDARG00000075827; ENSDARG00000096578      f7; f10; LAMP1 (2 of 2); F7 (4 of 4); PROZ (2 of 2); f7i; cul4a
1       76322   76330   AGGTGTGG        0       +       ENSDARG00000034862; ENSDARG00000088581; ENSDARG00000100181; ENSDARG00000100782; ENSDARG00000076900; ENSDARG00000075827; ENSDARG00000096578      f7; f10; LAMP1 (2 of 2); F7 (4 of 4); PROZ (2 of 2); f7i; cul4a

All above mentioned programs can be found in github.
If you want to learn more about regular expression, have a look at Python re module.

Virtual environment (venv) in Python

Working on the machines with no root access is sometimes annoying, especially if you depend on multiple Python packages and your distro is somehow outdated… You may find Python virtual environment (venv) very useful.
First, you need to create venv directory structure (this is done only once):

mkdir -p ~/src/venv
cd ~/src/venv
virtualenv py27

Then you can open new BASH terminal and activate your venv by:

source ~/src/venv/py27/bin/activate

After that, you can install / upgrade any packages using pip / easy_install (even including PIP ) ie.

pip install --upgrade pip
pip install --upgrade scipy

Insipired by python-guide.

Generating word clouds in Python

Often I needed to visualise functions associated with set of genes. word_cloud is very handy word cloud generator written in Python.
Install it

sudo pip install git+git://github.com/amueller/word_cloud.git

Incorporate it into your Python code

import matplotlib.pyplot as plt
from wordcloud import WordCloud
 
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

wordcloud = WordCloud().generate(text)
img=plt.imshow(wordcloud)
plt.axis("off")
plt.show()
 
#or save as png
img.write_png("wordcloud.png")

It’s implemented in metaPhOrs.