Github push fails due to large files

Lately, I have had lots of problems with pushing large files to github. I am maintaining compilation of materials and software deposited by other people, so cannot control the size of files… and this makes push to fail often.

git push
remote: error: GH001: Large files detected. You may want to try Git Large File Storage - https://git-lfs.github.com.
remote: error: Trace: 6f0f7f66995a394598595375954732db
remote: error: See http://git.io/iEPt8g for more information.
remote: error: File chip_seq/reads/sox2_chip.fastq.gz is 109.69 MB; this exceeds GitHub's file size limit of 100.00 MB

To remove large files from commit, execute

git filter-branch -f --index-filter 'git rm --cached --ignore-unmatch chip_seq/reads/sox2_chip.fastq.gz'
git push

To add large files using git-lfs, execute

# tract by git lfs files larger than 50MB, skipping those in .git folder
find . -type f -size +50M ! -iwholename "*.git*" | rev | cut -f1 -d'/' | rev | xargs git lfs track
# 
git add --all . && git commit -m "final" && git push origin

Make sure that your file are smaller than 2GB, otherwise your push will fail again ๐Ÿ˜‰

Then, to before pull in another machine, make sure to install git-lfs

git lfs install
git pull

Stream audio & video from webcam using VLC

Yesterday, I’ve posted about streaming webcam image to www using motion. This solution, although very simple, has many limitations, lack of sound, usage of high bandwidth and low image quality, just to mention a few. In a way, motion stream is just a set of jpeg files.
In order to solve all of these, I have spend quite some time playing with VLC, an open source cross-platform multimedia player, that is able to transcode and stream audio & video.
Streaming can be started from graphical interface, just go to:

Media >> Stream… >> Capture Device, select your devices, Add HTML destination (ie. :8080/webcam.ogg), select Video-Theora + Vorbis (OGG) profile & press Stream.

You stream will be available at: http://localhost:8081/webcam.ogg

But normally, using command line is preferred under Linux:

vlc v4l2:// :input-slave=alsa:// :v4l2-standard=1 :v4l2-dev=/dev/video0 :v4l2-width=1280 :v4l2-height=720 :sout="#transcode{vcodec=theo,vb=2000,acodec=vorb,ab=128,channels=2,samplerate=44100}:http{dst=:8081/webcam.ogg}" -I dummy

Initially, I had problem with streaming sound along with video. Adding, `:input-slave=alsa:// :v4l2-standard=1` solved this. You can try another values for `:v4l2-standard` ie. 0, 1 or 2, depending which microphone you want to use.

Above command will stream HD video (1280×720) in .ogg format (natively suported by most browsers) @ ~2Mbps (2000kbps). If you have slower connection, you can change `vb=2000` to `vb=1000` (~1Mbps) and play with lower resolutions. You can check available resolutions of your camera by:

lsusb -v | egrep -B10 'Width|Height'

This stream, however, is available to everyone. To limit it only to localhost, you can use iptables:

sudo iptables -A INPUT -p tcp -s localhost --dport 8081 -j ACCEPT && sudo iptables -A INPUT -p tcp --dport 8081 -j DROP && vlc v4l2:// :input-slave=alsa:// :v4l2-standard=1 :v4l2-dev=/dev/video0 :v4l2-width=1280 :v4l2-height=720 :sout="#transcode{vcodec=theo,vb=2000,acodec=vorb,ab=128,channels=2,samplerate=44100}:http{dst=:8081/webcam.ogg}" -I dummy

Now, you can create apache2 proxy, similarly to previous post:

# install apache2-utils
sudo apt install apache2-utils
 
# setup new user & passwd
sudo htpasswd -c /etc/apache2/.htpasswd webcam

# configure apache2 - add to your VirtualHost config
    # webcam
    <Location "/webcam.ogg">
        ProxyPass http://localhost:8081/webcam.ogg
        ProxyPassReverse http://localhost:8081/webcam.ogg
        # htpasswd
        AuthType Basic
        AuthName "Restricted Content"
        AuthUserFile /etc/apache2/.htpasswd
        Require valid-user
    </Location>

Malformed column reporting and joining in BASH by paste or awk

I’ve spent some hours trying to figure out, why the heck my scripts using awk and paste are returning malformed output. Simply, lines were wrongly pasted together, some columns were missing, while some were malformed… and in case of awk, trying to print columns in unsorted order (ie. column #3 before column #2 awk '{print $3,$2}') was producing malformed output.
After some time, I have realised it was due to windows-like new line escape \r\n, instead of standard Linux-like \n (of course I got this file from third party using Windows…).

Below, you can find more details.

# first, let's create dummy files containing 4 lines and 5 columns, each line ending with \r\n
python -c "with open('wrong.tsv','w') as out: out.write(''.join('line%s\t%s\r\n'%(i, '\t'.join('column%s'%j for j in range(1,5))) for i in range(1,4)))"
# and ending just with \n
python -c "with open('correct.tsv','w') as out: out.write(''.join('line%s\t%s\n'%(i, '\t'.join('column%s'%j for j in range(1,5))) for i in range(1,4)))"

# now let's paste wrong and correct files
paste wrong.tsv wrong.tsv
line1	line1n1	column1	column2	column3	column4
line2	line2n1	column1	column2	column3	column4
line3	line3n1	column1	column2	column3	column4

paste correct.tsv correct.tsv
line1	column1	column2	column3	column4	line1	column1	column2	column3	column4
line2	column1	column2	column3	column4	line2	column1	column2	column3	column4
line3	column1	column2	column3	column4	line3	column1	column2	column3	column4

# can you see the difference?

Simply, \r is interpreted as return to the beginning of the line in Unix, thus pasting lines containing such character will fail.
In order to convert files containing \r\n into Unix style \n, simply execute:

# replaces file and creates backup: inputfile.bak
sed -i.bak 's/\r$//' inputfile

# creates outputfile with correct formatting
tr -d '\r' < inputfile > outputfile

You can read more on new-line escape characters at Wikipedia.

Enable HTTPS for your domains in 5 minutes & for free!

For a while, I’ve been thinking about encryption domains, like this one. But cost & complications associated with enabling SSL encryption prohibited me to do so…
Today, I’ve realised, Let’s encrypt, new certificate authority, that is completely free, automated and open, makes SSL encryption super easy!
Try it yourself (this if for Ubuntu 14.04 & Apache, for another system configuration check https://certbot.eff.org/):

sudo apt-get install git

sudo git clone https://github.com/letsencrypt/letsencrypt /opt/letsencrypt
cd /opt/letsencrypt
sudo ./letsencrypt-auto --apache -d DOMAIN1 -d DOMAIN2

# setup weekly cron autorenewal on Monday at 2:30
sudo crontab -e
# and paste `30 2 * * 1 /opt/letsencrypt/letsencrypt-auto renew >> /var/log/le-renew.log`

If you wish to redirect all traffic domain through HTTPS, do following:

# enable mod_rewrite engine in apache2
sudo a2enmod rewrite

# add to your apache conf file
    # redirect to HTTPS
    RewriteEngine on
    RewriteCond %{HTTPS} off [OR]
    RewriteCond %{HTTP_HOST} ^YOUR_DOMAIN\.COM*
    RewriteRule ^(.*)$ https://YOUR_DOMAIN.COM/$1 [L,R=301]

# reload apache2 configuration
sudo service apache2 reload

Voilร !

Inspired by digitalocean.
Thanks to @sheebang for underlining the importance of renewing the certificates!

On handy docker images

Motivated by successful stripping problematic dependencies from Redundans, I have decided to generate smaller Docker image, starting with Alpine Linux image (2Mb / 5Mb after downloading) instead of Ubuntu (49Mb / 122Mb). Previously, I couldn’t really rely on Alpine Linux, because it was impossible to make these problematic dependencies running… But now it’s whole new world of possibilities ๐Ÿ˜‰

There are very few dependencies left, so I have started… (You can find all the commands below).

  1. First, I have check what can be installed from package manager.
    Only Python and Perl.

  2. Then I have checked if any of binaries are working.
    For example, GapCloser is provided as binary. It took me some time to find source code…
    Anyway, none of the binaries worked out of the box. It was expected, as Alpine Linux is super stripped…

  3. I have installed build-base in order to be able to build things.
    Additionally, BWA need zlib-dev.

  4. Alpine Linux doesn’t use standard glibc library, but musl-libc (you can read more about differences between the two), so some programmes (ie. BWA) may be quite reluctant to compile.
    After some hours of trying & thanks to the help of mp15, I have found a solution, not so complicated ๐Ÿ™‚

  5. I have realised, that Dockerfile doesn’t like standard BASH brace expansion, that is working otherwise in Docker Alpine console…
    so ls *.{c,h} should be ls *.c *.h

  6. After that, LAST and GapCloser compilation were easy, relatively ๐Ÿ˜‰

Below, you can find the code from Docker file (without RUN commands).

apk add --update --no-cache python perl bash wget build-base zlib-dev
mkdir -p /root/src && cd /root/src && wget http://downloads.sourceforge.net/project/bio-bwa/bwa-0.7.15.tar.bz2 && tar xpfj bwa-0.7.15.tar.bz2 && ln -s bwa-0.7.15 bwa && cd bwa && \
cp kthread.c kthread.c.org && echo "#include <stdint.h>" > kthread.c && cat kthread.c.org >> kthread.c && \
sed -ibak 's/u_int32_t/uint32_t/g' `grep -l u_int32_t *.c *.h` && make && cp bwa /bin/ && \
cd /root/src && wget http://liquidtelecom.dl.sourceforge.net/project/soapdenovo2/GapCloser/src/r6/GapCloser-src-v1.12-r6.tgz && tar xpfz GapCloser-src-v1.12-r6.tgz && ln -s v1.12-r6/ GapCloser && cd GapCloser && make && cp bin/GapCloser /bin/ && \
cd /root/src && wget http://last.cbrc.jp/last-744.zip && unzip last-744.zip && ln -s last-744 last && cd last && make && make install && \
cd /root/src && rm -r last* bwa* GapCloser* v* 

# SSPACE && redundans in /root/srt
cd /root/src && wget -q http://www.baseclear.com/base/download/41SSPACE-STANDARD-3.0_linux-x86_64.tar.gz && tar xpfz 41SSPACE-STANDARD-3.0_linux-x86_64.tar.gz && ln -s SSPACE-STANDARD-3.0_linux-x86_64 SSPACE && wget -O- -q http://cpansearch.perl.org/src/GBARR/perl5.005_03/lib/getopts.pl > SSPACE/dotlib/getopts.pl && \
wget --no-check-certificate -q -O redundans.tgz https://github.com/lpryszcz/redundans/archive/master.tar.gz && tar xpfz redundans.tgz && mv redundans-master redundans && ln -s /root/src/redundans /redundans && rm *gz

apk del wget build-base zlib-dev 
apk add libstdc++

After building & pushing, I have noticed that Alpine-based image is slightly smaller (99Mb), than the one based on Ubuntu (127Mb). Surprisingly, Alpine-based image is larger (273Mb) than Ubuntu-based (244Mb) after downloading. So, I’m afraid all of these hours didn’t really bring any substantial reduction in the image size.

Conclusion?
I was very motivated to build my application on Alpine Linux and expected substantial size reduction. But I’d say that relying on Alpine Linux image doesn’t always pay off in terms of smaller image size, forget about production time… And this I know from my own experience.
But maybe I didn’t something wrong? I’d be really glad for some advices/comments!

Nevertheless, stripping a few dependencies from my application (namely Biopython, numpy & scipy), resulted in much more compact image even using Ubuntu-based image (127Mb vs 191Mb; and 244Mb vs 440Mb after downloading). So I think this is the way to go ๐Ÿ™‚

MinION fast5 to fastq

Finally, I got my first data from MinION. The first obstacle I found is the native MinION format, Fast5. Most programs require FastQ. Luckily there are poretools, which makes conversion Fast5 –> FastQ very easy.

# install poretools
sudo pip install poretools

# convert fast5 to fastq
poretools fastq fast5/ > out.fastq

If you want to achieve better basecalling accuracy, have a look at DeepNano: alternative basecaller for MinION reads. It achieves better basecalling accuracy than native MinION basecaller for both 1D (~7%) and 2D (~2%) reads.

Connecting to MySQL without passwd prompt

If you are (like me) annoyed by providing password at every mysql login, you can skip it. Also it makes easier programmatic access to any MySQL db, as not passwd prompting is necessary ๐Ÿ™‚
Create `~/.my.cnf` file:

[client]
user=username
password="pass"
 
[mysql]
user=username
password="pass"

And login without `-p` parameter:

mysql -h host -u username dbname

If you want to use `~/.my.cnf` file in MySQLdb, just connect using this:

import MySQLdb
cnx = MySQLdb.connect(host=host, port=port, read_default_file="~/.my.cnf")

Get gene names for set of RefSeq IDs

Today I needed to annotate set of RefSeq IDs in .bed file with gene names.
Firstly, I was looking for a way to get gene names for RefSeq IDs. I have found simple solution on BioStars.

mysql --user=genome -N --host=genome-mysql.cse.ucsc.edu -A -D danRer10 \
  -e "select name,name2 from refGene" > refSeq2gene.txt

Secondly, Iโ€™ve written simple Python script to add the gene name to .bed file in place of score which I donโ€™t need at all.

#!/usr/bin/env python
# Add gene name instead of score to BED file
 
# USAGE: cat bed | bed2gene.py refSeq2gene.txt > with_gene_names.bed
 
import sys
 
fn = sys.argv[1]
refSeq2gene = {}
for l in open(fn):
    refSeq, gene = l[:-1].split('\t')
    refSeq2gene[refSeq] = gene
     
sys.stderr.write(" %s accesssions loaded!\n"%len(refSeq2gene))
     
for l in sys.stdin:
    ldata = l[:-1].split('\t')
    chrom, s, e, refSeq = ldata[:4]
    if refSeq in refSeq2gene:
        ldata[4] = refSeq2gene[refSeq]
    else:
        ldata[4] = "-"
    sys.stdout.write("\t".join(ldata)+"\n")

Hope, some will find it useful.

PhylomeDB: database for collections of gene phylogenies

PhylomeDB is a public database for complete collections of gene phylogenies (phylomes).
Users can interactively explore the evolutionary history of genes through the visualization of phylogenetic trees and multiple sequence alignments. Moreover, phylomeDB provides genome-wide orthology and paralogy predictions which are based on the analysis of the phylogenetic trees.

The automated pipeline used to reconstruct trees aims at providing a high-quality phylogenetic analysis of different genomes, including Maximum Likelihood inference, alignment trimming and evolutionary model testing. PhylomeDB includes also a public download section with the complete set of trees, alignments and orthology predictions.

PhylomeDB uses metaPhOrs (meta-Phylogeny based Orthologs) as a method for predicting orthologs and paralogs. metaPhOrs combines resources of several databases (PhylomeDB, EnsemblCompara, EggNOG, OrthoMCL, COG, Fungal Orthogroups, and TreeFam) to test robustness of each prediction.
All PhylomeDB resources are also accessible through ETE2: a Python Environment for phylogenetic Tree Exploration.

Cross-posted from BioStars.

Generating word clouds in Python

Often I needed to visualise functions associated with set of genes. word_cloud is very handy word cloud generator written in Python.
Install it

sudo pip install git+git://github.com/amueller/word_cloud.git

Incorporate it into your Python code

import matplotlib.pyplot as plt
from wordcloud import WordCloud
 
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

wordcloud = WordCloud().generate(text)
img=plt.imshow(wordcloud)
plt.axis("off")
plt.show()
 
#or save as png
img.write_png("wordcloud.png")

Itโ€™s implemented in metaPhOrs.