I’ve been playing with Python trying to randomise the order of paired-end (PE) reads in FastQ. After very unsuccessful afternoon (Python implementation was randomising 1M PE reads in 10 minutes (!)), I’ve decided to try BASH.
BASH-based solution is simple and efficient (12 seconds for 1M PE reads):
paste <(zcat test.1.fq.gz) <(zcat test.2.fq.gz) | paste - - - - | shuf | awk -F'\t' '{OFS=FS; print $1,$3,$5,$7 > "random.1.fq"; print $2,$4,$6,$8 > "random.2.fq";}'
If you are interested in random subset of your FastQ file(s) ie 100K, you can specify it with shuf -n 100000.
For large FastQ files it’s good to follow the progress of randomisation. This can be than by pluging pv inside the process. Additionally, the output files can be gzipped on the fly, saving lots of disks I/O operations. Finally, reads can be sampled/randomised from more than one library (reads1_1/2 and reads2_1/2), as follows:
pv -cN zcat reads1_1.fastq.gz reads2_1.fastq.gz | zcat | paste - <(zcat reads1_2.fastq.gz reads2_2.fastq.gz) | paste - - - - | pv -cN shuf | shuf | pv -cN awk | awk -F't' '{OFS=FS; print $1,$3,$5,$7 | "gzip > random_1.fq.gz"; print $2,$4,$6,$8 | "gzip > random_2.fq.gz";}'