Batch convert of .xlsx (Microsoft Office) to .tsv (tab-delimited) files

I had to retrieve data from multiple .xlsx files with multiple sheets. This can be done manually, but it will be rather time-consuming tasks, plus Office quotes text fields, which is not very convenient for downstream analysis…
I have found handy script, xlsx2tsv.py, that does the job, but it reports only one sheet at the time. Thus, I have rewritten xlsx2tsv.py a little to save all sheets from given .xlsx file into separate folder. In addition, multiple .xlsx files can be process at once. My version can be found on github.

xlsx2tsv.py *.xlsx

Batch conversion of images using ImageMagic

Today I needed to convert multiple .pdf files into .tiff images with specific DPI and LZW compression. I found it’s very simple using ImageMagic.

# install
sudo apt-get install imagemagick

# convert .pdf to lzw compressed .tiff changing dpi to 300
mkdir tiffs
for f in *.pdf; do 
  echo `date` $f; 
  convert -density 300 -compress lzw $f tiffs/$f.tiff; 
done
date

For more options, have a look at ImageMagic site.

Fetching from MySQL in batches

Today, I have encountered problems during fetching large MySQL table. Simply, the server was getting out of memory due to large table size.

I knew, you can fetch subset of data from MySQL table using LIMIT. Standard `LIMIT N` returns first N rows of your table.

However there is second way of calling LIMIT, `LIMIT S, N`; where S is the starting row and N is total number of rows to be returned. Thus, you can easily plug it into FOR LOOP to fetch from large MySQL table in for example 1 million rows batches:

step=1000000
t=large_table_name
user=mysql_username
pass=mysql_pass
db=database_name
#get row count
c=`mysql -u$user -p$pass $db -N -A -e"select count(*) from $t"`
echo `date` $t "with $c rows";
#write table header
echo -e $header | gzip > $t.txt.gz
#get table in batch
for i in `seq 1 $step $c`; do
  cmd="select * from $t limit $i, $step"
  echo -en "r$i / $c"
  mysql -u$user -p$pass $db -A -N -e"$cmd" | gzip >> $t.txt.gz
done
echo ""

Hope, someone will find it useful:)