Malformed column reporting and joining in BASH by paste or awk

I’ve spent some hours trying to figure out, why the heck my scripts using awk and paste are returning malformed output. Simply, lines were wrongly pasted together, some columns were missing, while some were malformed… and in case of awk, trying to print columns in unsorted order (ie. column #3 before column #2 awk '{print $3,$2}') was producing malformed output.
After some time, I have realised it was due to windows-like new line escape \r\n, instead of standard Linux-like \n (of course I got this file from third party using Windows…).

Below, you can find more details.

# first, let's create dummy files containing 4 lines and 5 columns, each line ending with \r\n
python -c "with open('wrong.tsv','w') as out: out.write(''.join('line%s\t%s\r\n'%(i, '\t'.join('column%s'%j for j in range(1,5))) for i in range(1,4)))"
# and ending just with \n
python -c "with open('correct.tsv','w') as out: out.write(''.join('line%s\t%s\n'%(i, '\t'.join('column%s'%j for j in range(1,5))) for i in range(1,4)))"

# now let's paste wrong and correct files
paste wrong.tsv wrong.tsv
line1	line1n1	column1	column2	column3	column4
line2	line2n1	column1	column2	column3	column4
line3	line3n1	column1	column2	column3	column4

paste correct.tsv correct.tsv
line1	column1	column2	column3	column4	line1	column1	column2	column3	column4
line2	column1	column2	column3	column4	line2	column1	column2	column3	column4
line3	column1	column2	column3	column4	line3	column1	column2	column3	column4

# can you see the difference?

Simply, \r is interpreted as return to the beginning of the line in Unix, thus pasting lines containing such character will fail.
In order to convert files containing \r\n into Unix style \n, simply execute:

# replaces file and creates backup: inputfile.bak
sed -i.bak 's/\r$//' inputfile

# creates outputfile with correct formatting
tr -d '\r' < inputfile > outputfile

You can read more on new-line escape characters at Wikipedia.

Running Jupyter as public service

Some time ago, I’ve written about setting up IPython as a public service. Today, I’ll write about setting up Jupyter, IPython descendant, that beside Python supports tons of other languages and frameworks.

Jupyter notebook will be running in separate user, so your personal files are safe, but not as system service. Therefore, you will need to restart it upon system reboot. I recommend running it in SCREEN session, so you can easily login into the server and check the Jupyter state.

  1. Install & setup Jupyter
  2. #
    sudo apt-get install build-essential python-dev
    sudo pip install jupyter
    
    # create new user
    sudo adduser jupyter
     
    # login as new user
    su jupyter
    
    # make sure to add `unset XDG_RUNTIME_DIR` to ~/.bashrc
    # otherwise you'll encounter: OSError: [Errno 13] Permission denied: '/run/user/1003/jupyter'
    echo 'unset XDG_RUNTIME_DIR' >> ~/.bashrc
    source ~/.bashrc
    
    # generate ssl certificates
    mkdir ~/.ssl
    openssl req -x509 -nodes -days 999 -newkey rsa:1024 -keyout ~/.ssl/mykey.key -out ~/.ssl/mycert.pem
    
    # generate config
    jupyter notebook --generate-config
    
    # generate pass and checksum
    ipython -c "from IPython.lib import passwd; passwd()"
    # enter your password twice, save it and copy password hash
    ## Out[1]: 'sha1:[your hashed password here]'
     
    # add to ~/.jupyter/jupyter_notebook_config.py
    c.NotebookApp.ip = '*'
    c.NotebookApp.open_browser = False
    c.NotebookApp.port = 8881
    c.NotebookApp.password = u'sha1:[your hashed password here]'
    c.NotebookApp.certfile = u'/home/jupyter/.ssl/mycert.pem'
    c.NotebookApp.keyfile = u'/home/jupyter/.ssl/mykey.key'
    
    # create some directory for notebook files ie. ~/Public/jupyter
    mkdir -p ~/Public/jupyter && cd ~/Public/jupyter
     
    # start notebook server
    jupyter notebook
    
  3. Add kernels
  4. You can add multiple kernels to Jupyter. Here I’ll cover installation of some:

    • Python
    • sudo pip install ipykernel
      
      # if you wish to use matplotlib, make sure to add to 
      # ~/.ipython/profile_default/ipython_kernel_config.py
      c.InteractiveShellApp.matplotlib = 'inline'
      
    • BASH kernel
    • sudo pip install bash_kernel
      sudo python -m bash_kernel.install
      
    • Perl
    • This didn’t worked for me:/

      sudo cpan Devel::IPerl
    • IRkernel
    • Follow this tutorial.

    • Haskell
    • sudo apt-get install cabal-install
      git clone http://www.github.com/gibiansky/IHaskell
      cd IHaskell
      ./ubuntu-install.sh
      

Then, just navigate to https://YOURDOMAIN.COM:8881/, accept self-signed certificate and enjoy!
Alternatively, you can obtain certificate from Let’s encrypt.

Using existing domain encryption aka Apache proxy
If your domain is already HTTPS, you may consider setting up Jupyter on localhost and redirect all incoming traffic (already encrypted) to particular port on localhost (as suggested by @shebang).

# enable Apache mods
sudo a2enmod proxy proxy_http proxy_wstunnel && sudo service apache2 restart

# add to your Apache config
    <Location "/jupyter" >
        ProxyPass http://localhost:8881/jupyter
        ProxyPassReverse http://localhost:8881/jupyter
    </Location>
    <Location "/jupyter/api/kernels/" >
        ProxyPass        ws://localhost:8881/jupyter/api/kernels/
        ProxyPassReverse ws://localhost:8881/jupyter/api/kernels/
    </Location>
    <Location "/jupyter/api/kernels/">
        ProxyPass        ws://localhost:8881/jupyter/api/kernels/
        ProxyPassReverse ws://localhost:8881/jupyter/api/kernels/
    </Location>

# update you Jupyter config (~/.jupyter/jupyter_notebook_config.py)
c.NotebookApp.ip = 'localhost'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8881
c.NotebookApp.base_url = '/jupyter'
c.NotebookApp.password = u'sha1:[your hashed password here]'
c.NotebookApp.allow_origin = '*'

Note, it’s crucial to add Apache proxy for kernels (/jupyter/api/kernels/), otherwise you won’t be able to use terminals due to failed: Error during WebSocket handshake: Unexpected response code: 400 error.

Working with large binary files in git

Git is great, there is no doubt about that. Being able to revert any changes and recover lost data is simply priceless. But recently, I have started to be concerned about the size of some of my repositories. Some, especially those containing changing binary files, were really large!!!
You can check the size of your repository by simple command:

git count-objects -vH

Here, git Large File Storage (LSF) comes into action. Below, I’ll describe how to install and mark large binary files, so they are not uploaded as a whole, but only relevant chunks of changed binary file is uploaded.

  1. Installation of git-lfs
  2. # add packagecloud repo
    curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
    
    # install git-lsf
    sudo apt-get install git-lfs 
    
    # end enable it
    git lfs install
    
  3. Marking and commiting binary file
  4. # mark large binary file
    git lfs track some.file
    
    # add, commit & push changes
    git add some.file
    git commit -m "some.file as LSF"
    git push origin master
    

Using docker for application development

I found Docker super useful, but going through a manual is quite time consuming. Here, very stripped manual to create your first image and push it online 🙂

# install docker
wget -qO- https://get.docker.com/ | sh
 
# add your user to docker group
sudo usermod -aG docker $USER
 
# check if it's working
docker run docker/whalesay cowsay "hello world!"
 
# create an account on https://hub.docker.com
# and login
docker login -u $USER --email=EMAIL
 
# run image
docker run -it ubuntu
 
# make some changes ie. create user, install needed software etc
 
# finally open new terminal & commit changes (SESSIONID=HOSTNAME)
docker commit SESSIONID $USER/image:version
 
# mount local directory `pwd`/test as /test in read/write mode
docker run -it -v `pwd`/test:/test:rw $USER/image:version some command with arguments
 
# push image
docker push $USER/image:version

From now, you can get your image from any other machine connected to Internet by executing:

docker run -it $USER/image:version
# ie. redundans image
docker run -it -w /root/src/redundans lpryszcz/redundans:v0.11b ./redundans.py -v -i test/{600,5000}_{1,2}.fq.gz -f test/contigs.fa -o test/run1
 
# you can create alias latest, then version can be skipped on running
docker tag lpryszcz/redundans:v0.11b lpryszcz/redundans:latest
docker push lpryszcz/redundans:latest
 
docker run -it lpryszcz/redundans

You can add info about your repository at https://hub.docker.com/r/$USER/image/

Working efficiently with millions of files

Working with millions of intermediate files can be very challenging, especially if you need to store them in distributed / network file system (NFS). This will make listing / navigating the directories to take ages… and removing of these files very time-consuming.
During building metaPhOrs DB, I needed to store some ~7.5 million of intermediate files that were subsequently processed in HPC. Saving these amount of files in the NFS would seriously affect not only myself, but also overall system performance.
One could store files in an archive, but then if you want to retrieve the data you would need to parse rather huge archives (tens-to-hundreds of GB) in order to retrieve rather small portions of data.
I have realised that TAR archives are natively supported in Python and can be indexed (see `tar_indexer`), which provide easy integration into existing code and random-access. If you work with text data, you can even zlib.compress the data stored inside you archives!
Below, I’m providing relevant parts of my code:
BASH

# index content of multiple tar archives
tar2index.py -v -i db_*/*.tar -d archives.db3
 
# search for some_file in mutliple archives
tar2index.py -v -f some_file -d archives.db3

Python

import sqlite3, time
import tarfile, zlib, cStringIO
 
###
# lookup function
def tar_lookup(dbpath, file_name):
    """Return file name inside tar, tar file name, offset and file size."""
    cur = sqlite3.connect(dbpath).cursor()
    cur.execute("""SELECT o.file_name, f.file_name, offset, file_size
                FROM offset_data as o JOIN file_data as f ON o.file_id=f.file_id
                WHERE o.file_name like ?""", (file_name,))
    return cur.fetchall()
 
###
# saving to archive
    # open tarfile
    tar = tarfile.open(tarpath, "w")
    # save files to tar
    for fname, txt in files_generator:
        # compress file content (optionally)
        gztxt = zlib.compress(txt)
        # get tarinfo
        ti = tarfile.TarInfo(fname)
        ti.size  = len(gztxt)
        ti.mtime = time.time()
        # add to tar
        tar.addfile(ti, cStringIO.StringIO(gztxt))
 
###
# reading from indexed archive(s)
# NOTE: before you need to run tar2index.py on your archives
    tarfnames = tar_lookup(index_path, file_name)
    for i, (name, tarfn, offset, file_size) in enumerate(tarfnames, 1):
        tarf = open(tarfn)
        # move pointer to right archive place
        tarf.seek(offset)
        # read tar fragment & uncompress
        txt = zlib.decompress(tarf.read(file_size))

Conflicting config for htop on machines sharing same /home directory

My friend spotted a problem with htop configuration. Simply when htop was executed on two different Ubuntu distros (10.04 and 14.04) the config was reset.
After some interrogation, we have spotted that 10.04 stores htop config to ~/.htoprc, while 14.04 to ~/.config/htop/htoprc. It was enough to remove one of them and link another one as below:

rm .htoprc
ln -s .config/htop/htoprc .htoprc

Connecting to MySQL without passwd prompt

If you are (like me) annoyed by providing password at every mysql login, you can skip it. Also it makes easier programmatic access to any MySQL db, as not passwd prompting is necessary 🙂
Create `~/.my.cnf` file:

[client]
user=username
password="pass"
 
[mysql]
user=username
password="pass"

And login without `-p` parameter:

mysql -h host -u username dbname

If you want to use `~/.my.cnf` file in MySQLdb, just connect using this:

import MySQLdb
cnx = MySQLdb.connect(host=host, port=port, read_default_file="~/.my.cnf")

Batch convert of .xlsx (Microsoft Office) to .tsv (tab-delimited) files

I had to retrieve data from multiple .xlsx files with multiple sheets. This can be done manually, but it will be rather time-consuming tasks, plus Office quotes text fields, which is not very convenient for downstream analysis…
I have found handy script, xlsx2tsv.py, that does the job, but it reports only one sheet at the time. Thus, I have rewritten xlsx2tsv.py a little to save all sheets from given .xlsx file into separate folder. In addition, multiple .xlsx files can be process at once. My version can be found on github.

xlsx2tsv.py *.xlsx

Pushing to multiple github repositories

Today I’ve faced problem with syncing two github repositories. Yes, I know, I shouldn’t keep two, but sometimes it’s difficult to avoid. Anyway, the problem is super easy to solve. It’s enough to edit `.git/config` by adding new remote:

[remote "Origin"]
    url = git@github.com:user1/repo1.git
    url = git@github.com:user2/repo2.git

Of course, more than two repos can be added. Then, after next push all repositories will be synced.

git push Origin master

Everything up-to-date

Counting objects: 61, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (61/61), done.
Writing objects: 100% (61/61), 5.73 KiB | 0 bytes/s, done.
Total 61 (delta 41), reused 0 (delta 0)
To git@github.com:user2/repo2.git
   8b97528..8aed8c2  master -> master

Inspired by ruiabreu.

Easy citation in LibreOffice / OpenOffice with Mendeley

Creating reference list is always a nightmare. Mendeley and its handy LibreOffice / OpenOffice plugin may be of great help to many. It was for me. Below, I’ll describe how to make it working.

# get & install mendeley from https://www.mendeley.com/download-mendeley-desktop/

# check version of your mendeley
#  Help > About Mendeley Desktop

# clone repo and build plugin
git clone git@github.com:Mendeley/openoffice-plugin.git
cd openoffice-plugin/
python build.py 1.15.2 false

# add to LibreOffice
#  Tools > Extension Manager > Add...
#   and look for `Mendeley-1.15.2.oxt`

After OpenOffice / LibreOffice restart, you should see new bar. Note, in order for the plugin to work, Mendeley has to be running.

What’s great about this plugin, you can adjust citation style by just a few clicks by clicking on `Choose Citation Style`. There is quite extensive database of predefined citation styles, so adjusting the reference style to your favourite journal will take just a few seconds 🙂
More info about the plugin on github.