All Posts

Hello world!

Welcome to my new blog! I’ll gradually describe all my adventures with self-hosting of various services, including blog/website, files and photos.

I started my website/blog in 2014. It was served using wordpress that was set up on virtual machine (VM) in google cloud.

I felt for a long time that running a dedicated VM for a blog is an overkill. There was huge overhead in maintaining both, the VM (regular system updates and backups) as well as wordpress itself (updates and database backups).

Read more ...


Monitoring GPU usage

If you (like me) happen to be the performance freak, most likely you are well aware of process viewers like htop. Since I’ve started working with GPU-computing I missed htop-like tool tailored to monitor GPU usage. This is becoming more of an issue if you’re working in multi-GPU setups.

You can use nvidia-smi which is shipped with NVIDIA drivers, but it’s not very interactive.

gpustat provide nice and interactive view of the processes running and resources used across your GPUs, but you’ll need to switch between windows if you want to also monitor CPU usage.

nvidia-smi output

Read more ...


Python code profiling and accelerating your calculations with numba

You wrote up your excellent idea as Python program/module but you are unsatisfied with its performance. The chances are high most of us have been there at least once. I’ve been there last week.

I found excellent method for outlier detection (Enhanced Isolation Forest). eIF was initially written in Python and later optimised in Cython (using C++). C++ is ~40x faster than vanilla Python version, but it lacks the possibility to save the model (which is crucial for my project). Since adding model saving to C++ version is rather complicated buisness, I’ve decided to optimise Python code. Initially I hoped for ~5-10x speed improvement. The final effect surprised me, as rewritten Python code was ~40x faster than initial version matching C++ version performance!

How is it possible? Speeding up your code isn’t trivial. First you need to find which parts of your code are slow (so-called code profiling). Once you know that, you can start tinkering with the code itself (code optimisation).

line_profiler output

Read more ...


Create badges for workshop / conference attendees

Previously, I’ve written on how to create Abstract book easily. Today, another friend asked for help with generation of badges for attendees of a conference he is co-organising. We have automatised the process for #NGSchool. You can find templates and code in my github repo.

First of all, for our courses we need to create user accounts for all participants in remote machines. Therefore we decided to print user data (username and auto-generated password) on the back of every badge (easy to fold). And since user data are auto-generated, we can easily create user accounts in remote machines using newusers. But for a conference, this can be skipped.

All you need to have to start is

Read more ...


Create book of abstracts from spreadsheet / google forms

Lately a friend of mine complained about interoperability of abstract submissions from numerous applicants.

Having the Book of Abstracts is crucial and we faced similar problem organising #NGSchool events.

Note, you’ll need to be somewhat familiar with LaTeX in order to edit the main.tex file to your liking. If you are not afraid of that, the way to proceed is as follows:

A twitter post from @sj_capella: It might sound silly but it is worrying. We made available an abstracts template for a conference ... I have seen like 20~30 different formats (out of 75 submissions) which made me wonder about #Interoperability

Read more ...


Connecting to MySQL without password prompt

If you are (like me) annoyed by providing password at every mysql login, you can skip it. Also it makes easier programmatic access to any MySQL db, as not passwd prompting is necessary :)

Create ~/.my.cnf file:

And login without -p parameter:

Read more ...


Apache2 reading from sshfs share

Today, I have encountered problems trying to read data from sshfs share in apache2. I was getting 403 Forbidden error. It turned out you need to enable other_user in sshfs, so other users than the one mounting the share can access the data, as apache2 is using www-data user.

Inspired by serverfault and unix.stackexchange.

Read more ...


Encrypted swapfile

Sometimes, it’s worth to encrypt swap space, especially if you process some privacy-sensitive data.

Inspired by AskUbuntu.

Read more ...


Multiprocessing in Python and garbage collection

Working with multiple threads in Python often leads to high RAM consumption. Unfortunately, automatic garbage collection in child processes isn’t working well. But there are two alternatives:

When using Pool(), you can specify number of task after which the child will be restarted resulting in memory release.

If you use Process(), you can simply delete unwanted objects and call gc.collect() inside the child. Note, this may slow down your child process substantially!

Read more ...


Progress of long processes in BASH

You can view progress of your process execution in UNIX using pv or bar. With pv, you can even report progress of multiple modules of your pipeline.

This is very useful for tracing large database dump/restore progress:

Read more ...


TAR random access

I was often challenged with accessing thousands/millions files from network file system (NFS). As I update some of the stored files once in a while, I have decided to store these files in multiple TAR archives. The data complexity was therefore reduced. But still, there was an issue with random access to the files within each archive.

First, I had a look at tar indexer. Its simplicity is brilliant. Yet, it stores index in raw text file and it can handle only single tar file. Therefore, I have ended up writing my own tar_indexer tool using sqlite3 for index storing and allowing indexing of multiple tar archives. This can be easily incorporated into any Python project.

Note, only raw (uncompressed) tar files are accepted as native tar.gz cannot be random accessed. But you can compress each file using zlib before adding it to tar. At least, this is what I do.

Read more ...