Thursday, August 13, 2015

Line Endings in a Mixed Environment Application

If you have to operate on text strings and files in an application that can be used interchangeaby in Windows and other environments, it can be a bit confusing. Below is what I found (all on Python 3.4).

When reading a file into a Python string:

File contentsWindowsOthers
'A' \x0D \x0A 'B''A\nB' (len=3)'\A\r\nB' (len=4)
'A' \x0A 'B''A\n\B' (len=3)'\A\nB' (len=3)

When writing a Python string to a file, this is the file content:

StringWindowsOthers
'A\nB''A' \x0D \x0A 'B''A' \x0A 'B'
'A\r\nB''A' \x0D \x0D \x0A 'B''A' \x0D \x0A 'B'

If you copy a file from a non-Windows to a Windows system, the file will not have CR, but the Python app in Windows will read nicely. But if you write it out again, then the new file will have different line endings from the original.

If you copy a file from Windows to a non-Windows system, the non-Windows Python app reading it will result in strings with extra \r characters, and these should be stripped away before the strings are actually used.

So I guess, in conclusion, to ensure that things work as expected across all environments, in Python strings, always make sure there are no \r present at all.

A complication can come in when you are processing Http streams, where header lines are terminated with CRLF while the body uses only LF. So you have to handle this accordingly.

Note: for the beginner, in files, CR is \x0D or \r, and LF is \x0A or \n