Lately, to make Redundans more user friendly, I have simplified it’s dependencies, by replacing Biopython, numpy, scipy and SQLite with some (relatively) simple functions or modules.
Here, I will just focus on replacing Biopython, particularly SeqIO.index_db with FastaIndex. You may ask yourself, why I have invested time in reinventing the wheel. I’m big fan of Biopython, yet it’s huge project and some solutions are not optimal or require problematic dependencies. This is the case with SeqIO.db_index, that relies on SQLite3. Here again, I’m a big fan of SQLite, yet building Biopython with SQLite enabled proved not to be very straightforward for non-standard systems or less experience users. Beside, on some NFS settings, the SQLite3 db cannot be created at all.
Ok, let’s start from the basics. SeqIO.index_db allows random access to sequence files, so for example you can rapidly retrieve any entry from very large file. This is achieved by storing the ID and position of each entry from particular file in database, SQLite3 db. Then, if you want to retrieve particular record, SeqIO.index_db looks up if this record is present in SQLite3 db, retrieves record position in the file and reads only small chunk of this file instead of parsing entire file every time you want to get some record(s).
Similar feature is offered by
samtools faidx, but in this case, the coordinates of each entry are stored in tab-delimited file
.fai (more info about .fai). This format can be easily read & write by any programme, so I have decided to use it. In addition, I have realised, that
samtools faidx is flexible enough, so you can add additional columns to the
.fai without interrupting its functionality, but about that later…
In Redundans, I’ve been using SeqIO.index_db during assembly reduction (fasta2homozygous.py). Additionally, beside storing index, I’ve been also generating statistics for every FastA file, like number of contigs, cumulative size, N50, N90, GC and so on. I have realised, that these two can be easily combined, by extending
.fai with four additional columns, storing number of occurencies for A, C, G & T in every sequence. Such
.fai is compatible with
samtools faidx and provides very easy way of calculating bunch of statistics about this file.
All of these, I’ve implemented in FastaIndex. Beside being dependency-free & very handy indexer, it can be used also as alternative to samtools faidx to retrieve sequences from large FastA files.
# retrieve bases from 20 to 60 from NODE_2
./FastaIndex.py -i test/run1/contigs.fa -r NODE_2_length_7674_cov_46.7841_ID_3:20-60
#Time elapsed: 0:00:00.014243
samtools faidx test/run1/contigs.fa NODE_2_length_7674_cov_46.7841_ID_3:20-60