Github push fails due to large files

Lately, I have had lots of problems with pushing large files to github. I am maintaining compilation of materials and software deposited by other people, so cannot control the size of files… and this makes push to fail often.

git push
remote: error: GH001: Large files detected. You may want to try Git Large File Storage - https://git-lfs.github.com.
remote: error: Trace: 6f0f7f66995a394598595375954732db
remote: error: See http://git.io/iEPt8g for more information.
remote: error: File chip_seq/reads/sox2_chip.fastq.gz is 109.69 MB; this exceeds GitHub's file size limit of 100.00 MB

To remove large files from commit, execute

git filter-branch -f --index-filter 'git rm --cached --ignore-unmatch chip_seq/reads/sox2_chip.fastq.gz'
git push

To add large files using git-lfs, execute

# tract by git lfs files larger than 50MB, skipping those in .git folder
find . -type f -size +50M ! -iwholename "*.git*" | rev | cut -f1 -d'/' | rev | xargs git lfs track
# 
git add --all . && git commit -m "final" && git push origin

Make sure that your file are smaller than 2GB, otherwise your push will fail again 😉

Then, to before pull in another machine, make sure to install git-lfs

git lfs install
git pull

On simplifying dependencies

Lately, to make Redundans more user friendly, I have simplified it’s dependencies, by replacing Biopython, numpy, scipy and SQLite with some (relatively) simple functions or modules.

Here, I will just focus on replacing Biopython, particularly SeqIO.index_db with FastaIndex. You may ask yourself, why I have invested time in reinventing the wheel. I’m big fan of Biopython, yet it’s huge project and some solutions are not optimal or require problematic dependencies. This is the case with SeqIO.db_index, that relies on SQLite3. Here again, I’m a big fan of SQLite, yet building Biopython with SQLite enabled proved not to be very straightforward for non-standard systems or less experience users. Beside, on some NFS settings, the SQLite3 db cannot be created at all.

Ok, let’s start from the basics. SeqIO.index_db allows random access to sequence files, so for example you can rapidly retrieve any entry from very large file. This is achieved by storing the ID and position of each entry from particular file in database, SQLite3 db. Then, if you want to retrieve particular record, SeqIO.index_db looks up if this record is present in SQLite3 db, retrieves record position in the file and reads only small chunk of this file instead of parsing entire file every time you want to get some record(s).
Similar feature is offered by samtools faidx, but in this case, the coordinates of each entry are stored in tab-delimited file .fai (more info about .fai). This format can be easily read & write by any programme, so I have decided to use it. In addition, I have realised, that samtools faidx is flexible enough, so you can add additional columns to the .fai without interrupting its functionality, but about that later…

In Redundans, I’ve been using SeqIO.index_db during assembly reduction (fasta2homozygous.py). Additionally, beside storing index, I’ve been also generating statistics for every FastA file, like number of contigs, cumulative size, N50, N90, GC and so on. I have realised, that these two can be easily combined, by extending .fai with four additional columns, storing number of occurencies for A, C, G & T in every sequence. Such .fai is compatible with samtools faidx and provides very easy way of calculating bunch of statistics about this file.
All of these, I’ve implemented in FastaIndex. Beside being dependency-free & very handy indexer, it can be used also as alternative to samtools faidx to retrieve sequences from large FastA files.

# retrieve bases from 20 to 60 from NODE_2
./FastaIndex.py -i test/run1/contigs.fa -r NODE_2_length_7674_cov_46.7841_ID_3:20-60
>NODE_2_length_7674_cov_46.7841_ID_3
CATAGAACGACTGGTATAAGCCAAACATGACCCATTGTTGC
#Time elapsed: 0:00:00.014243

samtools faidx test/run1/contigs.fa NODE_2_length_7674_cov_46.7841_ID_3:20-60
>NODE_2_length_7674_cov_46.7841_ID_3:20-60
CATAGAACGACTGGTATAAGCCAAACATGACCCATTGTTGC

Pushing to multiple github repositories

Today I’ve faced problem with syncing two github repositories. Yes, I know, I shouldn’t keep two, but sometimes it’s difficult to avoid. Anyway, the problem is super easy to solve. It’s enough to edit `.git/config` by adding new remote:

[remote "Origin"]
    url = git@github.com:user1/repo1.git
    url = git@github.com:user2/repo2.git

Of course, more than two repos can be added. Then, after next push all repositories will be synced.

git push Origin master

Everything up-to-date

Counting objects: 61, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (61/61), done.
Writing objects: 100% (61/61), 5.73 KiB | 0 bytes/s, done.
Total 61 (delta 41), reused 0 (delta 0)
To git@github.com:user2/repo2.git
   8b97528..8aed8c2  master -> master

Inspired by ruiabreu.

Easy citation in LibreOffice / OpenOffice with Mendeley

Creating reference list is always a nightmare. Mendeley and its handy LibreOffice / OpenOffice plugin may be of great help to many. It was for me. Below, I’ll describe how to make it working.

# get & install mendeley from https://www.mendeley.com/download-mendeley-desktop/

# check version of your mendeley
#  Help > About Mendeley Desktop

# clone repo and build plugin
git clone git@github.com:Mendeley/openoffice-plugin.git
cd openoffice-plugin/
python build.py 1.15.2 false

# add to LibreOffice
#  Tools > Extension Manager > Add...
#   and look for `Mendeley-1.15.2.oxt`

After OpenOffice / LibreOffice restart, you should see new bar. Note, in order for the plugin to work, Mendeley has to be running.

What’s great about this plugin, you can adjust citation style by just a few clicks by clicking on `Choose Citation Style`. There is quite extensive database of predefined citation styles, so adjusting the reference style to your favourite journal will take just a few seconds 🙂
More info about the plugin on github.