Investigate & reduce the size of Drupal sqlite3 database

Today while performing regular Drupal update and backup, I’ve realised Drupal sqlite3 database sites/default/files/.ht.sqliteis over 440 Mb! I found it peculiar, as our website isn’t storing that much information and the size grew significantly since last time I’ve looked it up couple of months ago. I’ve decided to investigate what’s eating up so much DB space.

Investigate what’s eating up space within your sqlite3 db

There is super useful program called sqlite3_analyzer. This program analyses your database file and reports what’s actually taking your disk space. You can download it from here (download precompiled sqlite3-tools). Note, under Linux you’ll likely need to install 32bit-libraries ie. under Ubuntu/Debian execute

sudo apt install libc6-i386 lib32stdc++6 lib32gcc1 lib32ncurses5 lib32z1  

Once you have the program, simply execute sqlite3_analyzer DB_NAME | less and the program will produce detailed report about your DB space consumption. For me it looked like that:

Can you spot how much space the actual data is taking? Yes, only 4.7% (20k pages). And what’s taking most of the space? Freelist.

Quick googling taught me, that freelist is simply empty space left after deletes or data moving. You may ask, why isn’t it cleaned up later? You see, having entire database with all tables in one file is very handy, but troublesome. Every time given table is edited, the space that is freed isn’t used, but rather marked as freelist. And those regions get cleaned up only when vacuumcommand is issued. This should happen automatically from time-to-time if auto vacuum is enabled. I couldn’t know why isn’t it working by default with Drupal…

Reduce the size of sqlite3 DB file

Nevertheless, I’ve decided to perform vacuummanually. Of course I’ve backed-up the db, just in case (you should always do that!). But sqlite3 .ht.sqlite vacuum returned Error: no such collation sequence: NOCASE_UTF8. At this point, I though maybe simple DB dump and recovery would solve my problem – after all that’s more or less what happens under the hood when you perform vacuum.

sqlite3 .ht.sqlite.bck .dump > db.sql
sqlite3 .ht.sqlite < db.sql

DB recovered after dump was indeed smaller (16 Mb), but it was missing some tables (sqlite3 .ht.sqlite .tables). Interestingly, when I’ve investigated the schema of the missing tables (sqlite3 .ht.sqlite.bck .schema block_content), I’ve realised that all of those contain NOCASE_UTF8 in table schema. I found that really peculiar! After further googling and rather lengthy reading, I’ve realised NOCASE_UTF8 is invalid in sqlite3, but it can be replaced simply with NOCASE.

Replace DB schema directly on sqlite3 db

In the brave (and firstly stupid I though) attempt, I’ve decided just to replace wrong statements directly on the DB file using sed (sed 's/NOCASE_UTF8/NOCASE/g' .ht.sqlite.bck > .ht.sqlite). As expected, the database file got corrupted. This is because all tables location are stored internally in the same file, so truncating some text from the DB file isn’t the wisest idea as I’ve expected. Then, I’ve decided to replace NOCASE_UTF8, but keeping the same size of the statement after replacement using white spaces. To my surprise it worked & allowed me to reduce the size of DB from 440 to 30 Mb 🙂

sed 's/NOCASE_UTF8/NOCASE     /g' .ht.sqlite.bck > .ht.sqlite
sqlite3 .ht.sqlite vacuum
-rw-rw-r--  1 lpryszcz www-data  32638976 Feb 28 13:57 .ht.sqlite
-rw-rw-r-- 1 lpryszcz www-data 451850240 Feb 28 13:45 .ht.sqlite.bck

Finally, to make sure, that there is no data missing between old and new, reduced DB, you can use sqldiff .ht.sqlite .ht.sqlite.bck. It’ll simply report all SQL command that will transform one DB into another and nothing if DB contain identical information.

Hopefully replacing NOCASE_UTF8 with NOCASE will allow auto vacuum to proceed as expected on the Drupal DB in the future!

On handy docker images

Motivated by successful stripping problematic dependencies from Redundans, I have decided to generate smaller Docker image, starting with Alpine Linux image (2Mb / 5Mb after downloading) instead of Ubuntu (49Mb / 122Mb). Previously, I couldn’t really rely on Alpine Linux, because it was impossible to make these problematic dependencies running… But now it’s whole new world of possibilities 😉

There are very few dependencies left, so I have started… (You can find all the commands below).

  1. First, I have check what can be installed from package manager.
    Only Python and Perl.

  2. Then I have checked if any of binaries are working.
    For example, GapCloser is provided as binary. It took me some time to find source code…
    Anyway, none of the binaries worked out of the box. It was expected, as Alpine Linux is super stripped…

  3. I have installed build-base in order to be able to build things.
    Additionally, BWA need zlib-dev.

  4. Alpine Linux doesn’t use standard glibc library, but musl-libc (you can read more about differences between the two), so some programmes (ie. BWA) may be quite reluctant to compile.
    After some hours of trying & thanks to the help of mp15, I have found a solution, not so complicated 🙂

  5. I have realised, that Dockerfile doesn’t like standard BASH brace expansion, that is working otherwise in Docker Alpine console…
    so ls *.{c,h} should be ls *.c *.h

  6. After that, LAST and GapCloser compilation were easy, relatively 😉

Below, you can find the code from Docker file (without RUN commands).

apk add --update --no-cache python perl bash wget build-base zlib-dev
mkdir -p /root/src && cd /root/src && wget http://downloads.sourceforge.net/project/bio-bwa/bwa-0.7.15.tar.bz2 && tar xpfj bwa-0.7.15.tar.bz2 && ln -s bwa-0.7.15 bwa && cd bwa && \
cp kthread.c kthread.c.org && echo "#include <stdint.h>" > kthread.c && cat kthread.c.org >> kthread.c && \
sed -ibak 's/u_int32_t/uint32_t/g' `grep -l u_int32_t *.c *.h` && make && cp bwa /bin/ && \
cd /root/src && wget http://liquidtelecom.dl.sourceforge.net/project/soapdenovo2/GapCloser/src/r6/GapCloser-src-v1.12-r6.tgz && tar xpfz GapCloser-src-v1.12-r6.tgz && ln -s v1.12-r6/ GapCloser && cd GapCloser && make && cp bin/GapCloser /bin/ && \
cd /root/src && wget http://last.cbrc.jp/last-744.zip && unzip last-744.zip && ln -s last-744 last && cd last && make && make install && \
cd /root/src && rm -r last* bwa* GapCloser* v* 

# SSPACE && redundans in /root/srt
cd /root/src && wget -q http://www.baseclear.com/base/download/41SSPACE-STANDARD-3.0_linux-x86_64.tar.gz && tar xpfz 41SSPACE-STANDARD-3.0_linux-x86_64.tar.gz && ln -s SSPACE-STANDARD-3.0_linux-x86_64 SSPACE && wget -O- -q http://cpansearch.perl.org/src/GBARR/perl5.005_03/lib/getopts.pl > SSPACE/dotlib/getopts.pl && \
wget --no-check-certificate -q -O redundans.tgz https://github.com/lpryszcz/redundans/archive/master.tar.gz && tar xpfz redundans.tgz && mv redundans-master redundans && ln -s /root/src/redundans /redundans && rm *gz

apk del wget build-base zlib-dev 
apk add libstdc++

After building & pushing, I have noticed that Alpine-based image is slightly smaller (99Mb), than the one based on Ubuntu (127Mb). Surprisingly, Alpine-based image is larger (273Mb) than Ubuntu-based (244Mb) after downloading. So, I’m afraid all of these hours didn’t really bring any substantial reduction in the image size.

Conclusion?
I was very motivated to build my application on Alpine Linux and expected substantial size reduction. But I’d say that relying on Alpine Linux image doesn’t always pay off in terms of smaller image size, forget about production time… And this I know from my own experience.
But maybe I didn’t something wrong? I’d be really glad for some advices/comments!

Nevertheless, stripping a few dependencies from my application (namely Biopython, numpy & scipy), resulted in much more compact image even using Ubuntu-based image (127Mb vs 191Mb; and 244Mb vs 440Mb after downloading). So I think this is the way to go 🙂