Raspberry Pi2 with Ubuntu Sever and Drupal?

I decided to celebrate 25th B-day of Linux by putting the latest Ubuntu 16.04 on my Raspberry Pi 2 and setting up a webserver.
This is how I did it:

  1. First, get Ubuntu armf image and prepare memory card
  2. # get image
    wget http://cdimage.ubuntu.com/ubuntu/releases/16.04/release/ubuntu-16.04-preinstalled-server-armhf+raspi2.img.xz
    
    # make sure your SD card is on sdb ie by df -h
    xzcat ubuntu-16.04-preinstalled-server-armhf+raspi2.img.xz | sudo dd of=/dev/sdb
    
  3. Configure new user & setup Drupal8 webserver
  4. # create new user & change hostname
    sudo adduser USERNAME && sudo usermod -a -G sudo USERNAME
    # edit /etc/hostname and add `127.0.1.1 newHostname` to /etc/hosts
    sudo reboot
    
    # generate locales
    sudo locale-gen en_US.UTF-8
    sudo dpkg-reconfigure locales
    
    # install software
    sudo apt install htop apache2 mysql-server libapache2-mod-php php-mysql php-sqlite3 php-curl php-xml php-gd git sqlite3 emacs-nox
    

My first impressions?
sudo apt is veeery slow. At first, I thought it’s due to old SD card I’ve been using, but it’s also true for newer SD card.
Some packages are missing (ie. git-lfs), but you can get them using some workarounds.

But everything just works!
You can check the mirror of https://ngschool.eu/ running on RPi2 here.
Maybe it’s not speed devil, but it stable and uses almost no energy 🙂

Cheers!

Inspired by Ubuntu’s Insights.

On handy docker images

Motivated by successful stripping problematic dependencies from Redundans, I have decided to generate smaller Docker image, starting with Alpine Linux image (2Mb / 5Mb after downloading) instead of Ubuntu (49Mb / 122Mb). Previously, I couldn’t really rely on Alpine Linux, because it was impossible to make these problematic dependencies running… But now it’s whole new world of possibilities 😉

There are very few dependencies left, so I have started… (You can find all the commands below).

  1. First, I have check what can be installed from package manager.
    Only Python and Perl.

  2. Then I have checked if any of binaries are working.
    For example, GapCloser is provided as binary. It took me some time to find source code…
    Anyway, none of the binaries worked out of the box. It was expected, as Alpine Linux is super stripped…

  3. I have installed build-base in order to be able to build things.
    Additionally, BWA need zlib-dev.

  4. Alpine Linux doesn’t use standard glibc library, but musl-libc (you can read more about differences between the two), so some programmes (ie. BWA) may be quite reluctant to compile.
    After some hours of trying & thanks to the help of mp15, I have found a solution, not so complicated 🙂

  5. I have realised, that Dockerfile doesn’t like standard BASH brace expansion, that is working otherwise in Docker Alpine console…
    so ls *.{c,h} should be ls *.c *.h

  6. After that, LAST and GapCloser compilation were easy, relatively 😉

Below, you can find the code from Docker file (without RUN commands).

apk add --update --no-cache python perl bash wget build-base zlib-dev
mkdir -p /root/src && cd /root/src && wget http://downloads.sourceforge.net/project/bio-bwa/bwa-0.7.15.tar.bz2 && tar xpfj bwa-0.7.15.tar.bz2 && ln -s bwa-0.7.15 bwa && cd bwa && \
cp kthread.c kthread.c.org && echo "#include <stdint.h>" > kthread.c && cat kthread.c.org >> kthread.c && \
sed -ibak 's/u_int32_t/uint32_t/g' `grep -l u_int32_t *.c *.h` && make && cp bwa /bin/ && \
cd /root/src && wget http://liquidtelecom.dl.sourceforge.net/project/soapdenovo2/GapCloser/src/r6/GapCloser-src-v1.12-r6.tgz && tar xpfz GapCloser-src-v1.12-r6.tgz && ln -s v1.12-r6/ GapCloser && cd GapCloser && make && cp bin/GapCloser /bin/ && \
cd /root/src && wget http://last.cbrc.jp/last-744.zip && unzip last-744.zip && ln -s last-744 last && cd last && make && make install && \
cd /root/src && rm -r last* bwa* GapCloser* v* 

# SSPACE && redundans in /root/srt
cd /root/src && wget -q http://www.baseclear.com/base/download/41SSPACE-STANDARD-3.0_linux-x86_64.tar.gz && tar xpfz 41SSPACE-STANDARD-3.0_linux-x86_64.tar.gz && ln -s SSPACE-STANDARD-3.0_linux-x86_64 SSPACE && wget -O- -q http://cpansearch.perl.org/src/GBARR/perl5.005_03/lib/getopts.pl > SSPACE/dotlib/getopts.pl && \
wget --no-check-certificate -q -O redundans.tgz https://github.com/lpryszcz/redundans/archive/master.tar.gz && tar xpfz redundans.tgz && mv redundans-master redundans && ln -s /root/src/redundans /redundans && rm *gz

apk del wget build-base zlib-dev 
apk add libstdc++

After building & pushing, I have noticed that Alpine-based image is slightly smaller (99Mb), than the one based on Ubuntu (127Mb). Surprisingly, Alpine-based image is larger (273Mb) than Ubuntu-based (244Mb) after downloading. So, I’m afraid all of these hours didn’t really bring any substantial reduction in the image size.

Conclusion?
I was very motivated to build my application on Alpine Linux and expected substantial size reduction. But I’d say that relying on Alpine Linux image doesn’t always pay off in terms of smaller image size, forget about production time… And this I know from my own experience.
But maybe I didn’t something wrong? I’d be really glad for some advices/comments!

Nevertheless, stripping a few dependencies from my application (namely Biopython, numpy & scipy), resulted in much more compact image even using Ubuntu-based image (127Mb vs 191Mb; and 244Mb vs 440Mb after downloading). So I think this is the way to go 🙂

On simplifying dependencies

Lately, to make Redundans more user friendly, I have simplified it’s dependencies, by replacing Biopython, numpy, scipy and SQLite with some (relatively) simple functions or modules.

Here, I will just focus on replacing Biopython, particularly SeqIO.index_db with FastaIndex. You may ask yourself, why I have invested time in reinventing the wheel. I’m big fan of Biopython, yet it’s huge project and some solutions are not optimal or require problematic dependencies. This is the case with SeqIO.db_index, that relies on SQLite3. Here again, I’m a big fan of SQLite, yet building Biopython with SQLite enabled proved not to be very straightforward for non-standard systems or less experience users. Beside, on some NFS settings, the SQLite3 db cannot be created at all.

Ok, let’s start from the basics. SeqIO.index_db allows random access to sequence files, so for example you can rapidly retrieve any entry from very large file. This is achieved by storing the ID and position of each entry from particular file in database, SQLite3 db. Then, if you want to retrieve particular record, SeqIO.index_db looks up if this record is present in SQLite3 db, retrieves record position in the file and reads only small chunk of this file instead of parsing entire file every time you want to get some record(s).
Similar feature is offered by samtools faidx, but in this case, the coordinates of each entry are stored in tab-delimited file .fai (more info about .fai). This format can be easily read & write by any programme, so I have decided to use it. In addition, I have realised, that samtools faidx is flexible enough, so you can add additional columns to the .fai without interrupting its functionality, but about that later…

In Redundans, I’ve been using SeqIO.index_db during assembly reduction (fasta2homozygous.py). Additionally, beside storing index, I’ve been also generating statistics for every FastA file, like number of contigs, cumulative size, N50, N90, GC and so on. I have realised, that these two can be easily combined, by extending .fai with four additional columns, storing number of occurencies for A, C, G & T in every sequence. Such .fai is compatible with samtools faidx and provides very easy way of calculating bunch of statistics about this file.
All of these, I’ve implemented in FastaIndex. Beside being dependency-free & very handy indexer, it can be used also as alternative to samtools faidx to retrieve sequences from large FastA files.

# retrieve bases from 20 to 60 from NODE_2
./FastaIndex.py -i test/run1/contigs.fa -r NODE_2_length_7674_cov_46.7841_ID_3:20-60
>NODE_2_length_7674_cov_46.7841_ID_3
CATAGAACGACTGGTATAAGCCAAACATGACCCATTGTTGC
#Time elapsed: 0:00:00.014243

samtools faidx test/run1/contigs.fa NODE_2_length_7674_cov_46.7841_ID_3:20-60
>NODE_2_length_7674_cov_46.7841_ID_3:20-60
CATAGAACGACTGGTATAAGCCAAACATGACCCATTGTTGC

Using docker for application development

I found Docker super useful, but going through a manual is quite time consuming. Here, very stripped manual to create your first image and push it online 🙂

# install docker
wget -qO- https://get.docker.com/ | sh
 
# add your user to docker group
sudo usermod -aG docker $USER
 
# check if it's working
docker run docker/whalesay cowsay "hello world!"
 
# create an account on https://hub.docker.com
# and login
docker login -u $USER --email=EMAIL
 
# run image
docker run -it ubuntu
 
# make some changes ie. create user, install needed software etc
 
# finally open new terminal & commit changes (SESSIONID=HOSTNAME)
docker commit SESSIONID $USER/image:version
 
# mount local directory `pwd`/test as /test in read/write mode
docker run -it -v `pwd`/test:/test:rw $USER/image:version some command with arguments
 
# push image
docker push $USER/image:version

From now, you can get your image from any other machine connected to Internet by executing:

docker run -it $USER/image:version
# ie. redundans image
docker run -it -w /root/src/redundans lpryszcz/redundans:v0.11b ./redundans.py -v -i test/{600,5000}_{1,2}.fq.gz -f test/contigs.fa -o test/run1
 
# you can create alias latest, then version can be skipped on running
docker tag lpryszcz/redundans:v0.11b lpryszcz/redundans:latest
docker push lpryszcz/redundans:latest
 
docker run -it lpryszcz/redundans

You can add info about your repository at https://hub.docker.com/r/$USER/image/

pysam installation error on Ubuntu 14.04

If you encounter error during pysam installation:

sudo easy_install -U pysam

... 

pysam/csamtools.c:8:22: fatal error: pyconfig.h: No such file or directory
 #include "pyconfig.h"
                      ^
compilation terminated.
error: Setup script exited with error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

try installing python-dev first.

sudo apt-get install python-dev

Solution found on github.

Detection of RNA editing

I’ve been asked to help analysing data from RNA editing experiment. Typically, such experiments consists of DNA and RNA sequencing. The purpose of these is to identify the differences between DNA template and resulting RNA product.
The canonical RNA editing enzyme, ADAR, deaminase adenine to inosine (A-to-I editing). Inosine is subsequently detected as guanine by the sequencer. Thus the identification of RNA editing can be simplified down to genotyping of DNA and RNA samples and finding differences between these two.
Importantly, RNA editing can be promiscuous, meaning multiple As are changed into Is in given region. This makes an alignment challenging. In addition, you may ignore most of edited sites due to quality filtering (many SNP clustered together), if you genotype your libraries with standard tools/pipelines ie. GATK. Finally, RNAseq is more error-prone than DNAseq, due to low fidelity of reverse-transcriptase, so you may want to account for that as well.
Ideally, you want stranded RNAseq, then only A>G editing will be observed. But in reality, you may analyse also unstranded RNAseq libraries, then you expect both A>G and T>C edited sites, from transcripts of genes on Watson (+) and Crick (-) strand, respectively.

DNA/RNAseq alignment

For the alignment part, I’ve been using BWA MEM (DNAseq) and STAR (RNAseq). If you use STAR, you should reassign mapping qualities (mapQ). It’s also recommended to split reads at N CIGAR. These two can be accomplished with GATK:

java -jar ~/src/GATK/GenomeAnalysisTK.jar -T SplitNCigarReads -R REF.fa -I SAMPLE.bam -o SAMPLE.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS > SAMPLE.split.log 2>&1 &

Finally, you may want to realign both DNAseq and RNAseq around indels using GATK IndelRealigner.

RNA editing identification from BAM files

I’ve written program, bam2RNAediting.py (currently REDiscover), to detect RNA editing in a set of DNAseq and RNAseq experiments. It’s in early developmental phase, but it’s functional already. It can be run simply by:

bam2RNAediting.py -d bwamem/DNAseq*.bam -r star/RNAseq*.bam > bam2RNAediting.txt

Quality control

Simple quality control for RNA editing experiment could be looking at the enrichment of A>G (and T>C) edited sites over other changes. If RNA editing is present in your sample you would expect to find quite strong enrichment of A>G (and T>C) changes between DNAseq and RNAseq. This is indeed what I’ve found in my samples.

Let me know how it works for you. I’m looking for your feedback!

What fraction of the genome is expressed?

Recently, I was interested to find out what fraction of genome is expressed. This can be computed quite easily from RNA-Seq alignments using bedtools and some simple scripting. Note, it’s crucial to use -split parameter in bedtools genomecov to report correct coverage from split alignments. Otherwise, the introns that are spliced-out will be treated as expressed regions.

# get genomic regions with at least n reads aligned
n=3
ref=genome
for f in *.bam; do
  echo `date` $f;
  # get genomic intervals covered by n reads
  bedtools genomecov -split -ibam $f -g $ref.fa.fai -dz | awk -F'\t' 'BEGIN {OFS = FS} {if ($3>='$n') {print $1,$2,$2+1}}' | bedtools merge > $f.${n}reads.bed;
  # report bp covered by n reads
  awk '{sum+=$3-$2} END {print sum}' $f.${n}reads.bed
done; date

Visualisation of repeats alongside NGS data

In one of the previous projects we were interested to look for association between deletions and repeats. At that time, I’ve written small program, repeats2sam.py, identifying perfect direct and inverted repeats in chromosomes. repeats2sam.py uses repeat-match from great MUMmer package and reports SAM-formatted repeats. This can be then easily converted to BAM. It takes several minutes for small genomes (20Mb) and several hours for large genomes (~8 hours for 1.5Gb).

# identify direct & inverted repeats
ref=human
repeats2sam.py -v --inverted -i $ref.fa | samtools view -SbuT $ref.fa - | samtools sort - $ref.repeats
samtools index $ref.repeats.bam

Repeats are stored as paired-end reads, so they can be easily visualised alongside any NGS data in IGV:

  • direct & inverted repeats
  • inverted repeats + RNAseq
  • inverted repeats + RNAseq zommed in

If you are interested in DNA direct repeats, either skip –inverted parameter, or select only direct repeats from your BAM file. In addition, you may want to select only repeats fulfilling certain length and distance criteria.

# select only inverted repeats; at least 21bp long and distant by less than 5000bp
i=21; samtools view $ref.repeats.inverted.bam | awk '$8<5000 && length($10)>='$i | samtools view -SbuT $ref.fa - | samtools sort - $ref.repeats.inverted.${i}bp && samtools index $ref.repeats.inverted.${i}bp.bam
# select only direct repeats; at least 21bp long and distant by less than 5000bp
i=21; samtools view $ref.repeats.direct.bam | awk '$8<5000 && length($10)>='$i | samtools view -SbuT $ref.fa - | samtools sort - $ref.repeats.direct.${i}bp && samtools index $ref.repeats.direct.${i}bp.bam

All above mentioned programs can be found in github.

Wicked-fast transcript quantification with Salmon

Traditionally, in order to quantify transcript abundance from RNA-Seq, one has to first align the reads onto the reference and analyse these alignments. While widely accepted, this approach has several disadvantages:

  • alignment step is slow, especially in splice-aware mode
  • spliced alignments are error-prone
  • huge intermediate files are produced (.bam)
  • transcript quantification from these intermediate files is also slow

At the RECOMB2015, Rob Patro presented two algorithms enabling rapid transcript quantifications from RNA-Seq. Sailfish is alignment-free method, while its successor, Salmon, perform light-weight alignment, identifying just the super maximal exact matches (SMEMs).

On my data, Salmon is ~3 times faster (6-7min) and uses ~20 times less memory (1.1GB) than STAR (~20min / ~24GB). Note, STAR results (.bam) need to be further analysed (ie. cufflinks) in order to quantify transcripts abundances. This step takes another 10min, thus we get transcript abundances after 6-7 min from Salmon and 30min from STAR + cufflinks. Just for comparison, similar analysis done with tophat2 takes above 6 hours!

  1. Install dependencies & salmon
  2. # install dependencies
    ## here it's important to install boost1.55 & remove older version before
    sudo apt-get remove libboost-all-dev
    sudo apt-get install libbz2-dev libtbb-dev libboost1.55-all-dev
     
    # clone salmon repo
    git clone git@github.com:COMBINE-lab/salmon.git
     
    # and build
    cd salmon
    cmake -DBOOST_ROOT=/usr/include/boost -DTBB_INSTALL_DIR=/usr/include/tbb
    make && make install && make test
     
    # add to .bashrc
    echo "# salmon" >> ~/.bashrc
    echo "export PATH=$PATH:"`pwd`"/bin" >> ~/.bashrc
     
    # open new BASH window (Ctrl + Shift + T) or reload environmental variables
    source ~/.bashrc
    
  3. Index transcriptome
  4. salmon index -t transcripts.fa -i transcripts.index
  5. Quantify transcript abundances
  6. You have to specify library correctly (-l).

    salmon quant -p 4 -l SF -i ref/transcripts.index -r <(zcat sample1.fq.gz) -o sample1

Note, here the reads are decompressed using process substitution – this is very handy way of providing the preprocess data as input through Unix pipes.

Detecting isoform switching from RNA-Seq

Alternative splicing produces an array of transcripts from individual gene. Some of these alternative transcripts are expressed in tissue-specific fashion and may play different roles in the cell.

Recently, I got interested in identifying isoform switching from RNA-Seq data, this is detecting changes in the most expressed isoform between conditions. As I couldn’t find any ready solution, I have written my own script for this task (fpkm2major.py). This script simply parse isoforms.fpkm_tracking output from cufflinks and report genes with evidence of isoform switching into tab-delimited output file.

For each gene, number of transcripts, cumulative expression, major isoforms for each condition and each major isoform are reported. Lowly expressed genes in given condition are coded with `0` (--minFPKM < 1), while genes without clear major isoform are marked with `-1` (--frac 0.25).

fpkm2major.py -v -i sample1/isoforms.fpkm_tracking -o major_isoforms.txt
 
# combine data from multiple samples
paste sample1/isoforms.fpkm_tracking <(cut -f10- sample2/isoforms.fpkm_tracking) \
 | fpkm2major.py -v -o major_isoforms2.txt
 
# visualise using cummeRbund
R
library(cummeRbund)
cuff <- readCufflinks("sample1")
 
# expression plot for isoforms of given gene
geneid="ENSDARG00000013615"
gene<-getGene(cuff, geneid)
expressionPlot(isoforms(gene), showErrorbars=T)
dev.copy(svg, paste(geneid,".sample1.svg"), width=12, height=12); dev.off()