Malformed column reporting and joining in BASH by paste or awk

I’ve spent some hours trying to figure out, why the heck my scripts using awk and paste are returning malformed output. Simply, lines were wrongly pasted together, some columns were missing, while some were malformed… and in case of awk, trying to print columns in unsorted order (ie. column #3 before column #2 awk '{print $3,$2}') was producing malformed output.
After some time, I have realised it was due to windows-like new line escape \r\n, instead of standard Linux-like \n (of course I got this file from third party using Windows…).

Below, you can find more details.

# first, let's create dummy files containing 4 lines and 5 columns, each line ending with \r\n
python -c "with open('wrong.tsv','w') as out: out.write(''.join('line%s\t%s\r\n'%(i, '\t'.join('column%s'%j for j in range(1,5))) for i in range(1,4)))"
# and ending just with \n
python -c "with open('correct.tsv','w') as out: out.write(''.join('line%s\t%s\n'%(i, '\t'.join('column%s'%j for j in range(1,5))) for i in range(1,4)))"

# now let's paste wrong and correct files
paste wrong.tsv wrong.tsv
line1	line1n1	column1	column2	column3	column4
line2	line2n1	column1	column2	column3	column4
line3	line3n1	column1	column2	column3	column4

paste correct.tsv correct.tsv
line1	column1	column2	column3	column4	line1	column1	column2	column3	column4
line2	column1	column2	column3	column4	line2	column1	column2	column3	column4
line3	column1	column2	column3	column4	line3	column1	column2	column3	column4

# can you see the difference?

Simply, \r is interpreted as return to the beginning of the line in Unix, thus pasting lines containing such character will fail.
In order to convert files containing \r\n into Unix style \n, simply execute:

# replaces file and creates backup: inputfile.bak
sed -i.bak 's/\r$//' inputfile

# creates outputfile with correct formatting
tr -d '\r' < inputfile > outputfile

You can read more on new-line escape characters at Wikipedia.

Leave a Reply

Your email address will not be published. Required fields are marked *