Working efficiently with millions of files

Working with millions of intermediate files can be very challenging, especially if you need to store them in distributed / network file system (NFS). This will make listing / navigating the directories to take ages… and removing of these files very time-consuming.
During building metaPhOrs DB, I needed to store some ~7.5 million of intermediate files that were subsequently processed in HPC. Saving these amount of files in the NFS would seriously affect not only myself, but also overall system performance.
One could store files in an archive, but then if you want to retrieve the data you would need to parse rather huge archives (tens-to-hundreds of GB) in order to retrieve rather small portions of data.
I have realised that TAR archives are natively supported in Python and can be indexed (see `tar_indexer`), which provide easy integration into existing code and random-access. If you work with text data, you can even zlib.compress the data stored inside you archives!
Below, I’m providing relevant parts of my code:
BASH

# index content of multiple tar archives
tar2index.py -v -i db_*/*.tar -d archives.db3
 
# search for some_file in mutliple archives
tar2index.py -v -f some_file -d archives.db3

Python

import sqlite3, time
import tarfile, zlib, cStringIO
 
###
# lookup function
def tar_lookup(dbpath, file_name):
    """Return file name inside tar, tar file name, offset and file size."""
    cur = sqlite3.connect(dbpath).cursor()
    cur.execute("""SELECT o.file_name, f.file_name, offset, file_size
                FROM offset_data as o JOIN file_data as f ON o.file_id=f.file_id
                WHERE o.file_name like ?""", (file_name,))
    return cur.fetchall()
 
###
# saving to archive
    # open tarfile
    tar = tarfile.open(tarpath, "w")
    # save files to tar
    for fname, txt in files_generator:
        # compress file content (optionally)
        gztxt = zlib.compress(txt)
        # get tarinfo
        ti = tarfile.TarInfo(fname)
        ti.size  = len(gztxt)
        ti.mtime = time.time()
        # add to tar
        tar.addfile(ti, cStringIO.StringIO(gztxt))
 
###
# reading from indexed archive(s)
# NOTE: before you need to run tar2index.py on your archives
    tarfnames = tar_lookup(index_path, file_name)
    for i, (name, tarfn, offset, file_size) in enumerate(tarfnames, 1):
        tarf = open(tarfn)
        # move pointer to right archive place
        tarf.seek(offset)
        # read tar fragment & uncompress
        txt = zlib.decompress(tarf.read(file_size))

Speeding up TAR.GZ compression with PIGZ

Most of you probably noticed TAR.GZ compression isn’t very fast. Recently, during routine system backup I have realised TAR.GZ is limited not by disk read/write, but GZIP compression (98% of computation).

time sudo tar cpfz backup/ubuntu1404.tgz --one-file-system /

real 6m20.999s
user 6m1.800s
sys  0m19.043s

GZIP in its standard implementation is single CPU bound, while most of modern computers can run 4-8 threads concurrently. But there are also multi-core implementations of GZIP ie. PIGZ. I have decided to install PIGZ and plug it with TAR as follows:

sudo apt-get install lbzip2 pigz

time sudo tar cpf backup/ubuntu1404.pigz.tgz --one-file-system --use-compress-program=pigz /

real 1m43.693s
user 8m34.168s
sys  0m20.243s

As you can see, TAR.GZ using PIGZ on 4-cores i7-4770k (using 8 threads) is 2.5 times faster than GZIP! And you get standard TAR.GZ archive as output:)

The same applies to BZIP2 compression using LBZIP2.