I’ve spent some hours trying to figure out, why the heck my scripts using awk
and paste
are returning malformed output. Simply, lines were wrongly pasted together, some columns were missing, while some were malformed… and in case of awk, trying to print columns in unsorted order (ie. column #3 before column #2 awk '{print $3,$2}'
) was producing malformed output.
After some time, I have realised it was due to windows-like new line escape \r\n
, instead of standard Linux-like \n
(of course I got this file from third party using Windows…).
Below, you can find more details.
# first, let's create dummy files containing 4 lines and 5 columns, each line ending with \r\n python -c "with open('wrong.tsv','w') as out: out.write(''.join('line%s\t%s\r\n'%(i, '\t'.join('column%s'%j for j in range(1,5))) for i in range(1,4)))" # and ending just with \n python -c "with open('correct.tsv','w') as out: out.write(''.join('line%s\t%s\n'%(i, '\t'.join('column%s'%j for j in range(1,5))) for i in range(1,4)))" # now let's paste wrong and correct files paste wrong.tsv wrong.tsv line1 line1n1 column1 column2 column3 column4 line2 line2n1 column1 column2 column3 column4 line3 line3n1 column1 column2 column3 column4 paste correct.tsv correct.tsv line1 column1 column2 column3 column4 line1 column1 column2 column3 column4 line2 column1 column2 column3 column4 line2 column1 column2 column3 column4 line3 column1 column2 column3 column4 line3 column1 column2 column3 column4 # can you see the difference?
Simply, \r
is interpreted as return to the beginning of the line in Unix, thus pasting lines containing such character will fail.
In order to convert files containing \r\n
into Unix style \n
, simply execute:
# replaces file and creates backup: inputfile.bak sed -i.bak 's/\r$//' inputfile # creates outputfile with correct formatting tr -d '\r' < inputfile > outputfile
You can read more on new-line escape characters at Wikipedia.