Posted in 2014

Multiprocessing in Python and garbage collection

Working with multiple threads in Python often leads to high RAM consumption. Unfortunately, automatic garbage collection in child processes isn’t working well. But there are two alternatives:

When using Pool(), you can specify number of task after which the child will be restarted resulting in memory release.

If you use Process(), you can simply delete unwanted objects and call gc.collect() inside the child. Note, this may slow down your child process substantially!

Read more ...


Progress of long processes in BASH

You can view progress of your process execution in UNIX using pv or bar. With pv, you can even report progress of multiple modules of your pipeline.

This is very useful for tracing large database dump/restore progress:

Read more ...


TAR random access

I was often challenged with accessing thousands/millions files from network file system (NFS). As I update some of the stored files once in a while, I have decided to store these files in multiple TAR archives. The data complexity was therefore reduced. But still, there was an issue with random access to the files within each archive.

First, I had a look at tar indexer. Its simplicity is brilliant. Yet, it stores index in raw text file and it can handle only single tar file. Therefore, I have ended up writing my own tar_indexer tool using sqlite3 for index storing and allowing indexing of multiple tar archives. This can be easily incorporated into any Python project.

Note, only raw (uncompressed) tar files are accepted as native tar.gz cannot be random accessed. But you can compress each file using zlib before adding it to tar. At least, this is what I do.

Read more ...