What's with Ruby's ZipInputStream screwing up my line endings? - ruby

I'd be happy with ZipInputStream taking indecent liberties with the line endings that are stored in a file if it would at least get them right for the platform I'm storing the file on. Unfortunately, I pull a text file (.txt, .cpp. .etc.) out of a zip and the \n (0x0A) gets replaced with a \r\n (0x0d0a) and, as you can imagine, this is causing me a great deal of trouble.
Is there a flag I can set to tell it either to avoid changing the line endings altogether or to use one of my choosing?
Thanks.
(I've checked the zip file, my creation of it, etc. I've extracted it using other zip tools and verified that it is archived properly. I've stepped through my project with rdebug and seen that the ZipInputStream call to read() is returning \r\n for line endings.)

if you have an open(filename) or open(filename,"r") call in your code, try to replace it with open(filename,"rb")

Related

Why would an auto conversion of LF to CRLF by Xerces result in CRCRLF?

From the Xerces documention on setNewLine, “However, Xerces-C++ always uses LF when this property is set to null since otherwise automatic translation of LF to CR-LF on Windows for text files would result in such files containing CR-CR-LF. If you need Windows-style end of line sequences in your output, consider writing to a file opened in text mode or explicitly set this property to CR-LF.” That statement makes no sense to me.
https://xerces.apache.org/xerces-c/apiDocs-3/classDOMLSSerializer.html#a56882d2fe0b4a0ecb1b3968febbcf4a3
Why an auto conversion of line endings results in a duplicate CR is beyond me. I do not understand why that would ever be reasonable. I have tried changing the code to explicitly set the line ending to CR-LF as described in the documentation and that does not work. I still end up with xml files that have CRCRLF as the line ending and then I have to manually remove the duplicate CR with a text editor such as notepad++.

Weird txt behavior

I have a centos server. I cloned a GitHub repository. And I have .txt file in that repository which contains 1 line. For some reason it does that:
[root#0-0-0-0 Some]# cat some.txt
some text[root#0-0-0-0 Some]#
And also while read i; do echo "$i"; done < some.txt don't see that line. What could cause that? And how to avoid it. If I edit it with vim adding a new line and then deleting that new line (so it still contains only one line) it starts to work properly.
The text file has no newline character at the end of it. Some programs will treat it as a valid text file whose last line doesn't happen to end in a newline. Others (apparently including bash's built-in read command, at least by default) will treat it as invalid, and perhaps ignore the last line (which isn't considered a "line" because it's not marked as one).
vim's default behavior is to quietly add a newline to the end of a file if you modify and save it.
You can add a newline to a file that lacks one by editing it with vim (or another editor that behaves similarly), or by adding it from the shell:
echo '' >> some.txt
In general, it's a good idea to ensure that text files end in a newline character in the first place, at least if they're intended to be used on UNIX-like systems.

InstallScript GetLine() can not read text file contains result from command prompt

My Installation needs to check the result of a command from cmd.exe. Thus, I redirect the result of the command to a text file and then try to read the file to get the result as follows:
// send command to cmd to execute and redirect the result to a text file
// try to read the file
szDir = "D:\\";
szFileName = "MyFile.txt";
if Is(FILEEXISTS, szDir ^ szFileName) then
listID = ListCreate(STRINGLIST);
if listID != LIST_NULL then
if OpenFIleMode(FILE_MODE_NORMAL) = 0 then
if OpenFile(nFileHandle, szDir, szFileName) = 0 then
// I run into problem here
while (GetLine(nFileHandle, szCurLine) = 0 )
ListAddString(listID, szCurLine, AFTER);
endwhile;
CloseFile(nFileHandle);
endif;
endif;
endif;
endif;
The problem is that right after the command prompt is executed and the result is redirected to MyFile.txt, I can set open file mode, open the file but I can not read any text into my list. ListReadFromFile() does not helps. If I open the file, edit and save it manually, my script works.
After debugging, I figured that GetLine() returns an error code (-1) which means the file pointer must be at the end of file or other errors. However, FILE_MODE_NORMAL sets the file as read only and SET THE FILE POINTER AT THE BEGINNING OF THE FILE.
What did I possibly do wrong? Is this something to do with read/write access of the file? I tried this command without result:
icacls D:\MyFile.txt /grant Administrator:(R,W)
I am using IstallShield 2018 and Windows 10 64-bit btw. Your help is much appreciated.
EDIT 1: I suspected the encoding and tried a few things:
After running "wslconfig /l", the content of MyFile.txt opened in Notepad++ is without an encoding, but still appeared normal and readable. I tried to converted the content to UTF-8 but it did not work.
If I add something to the file (echo This line is appended >> MyFile.txt), the encoding changed to UTF-8, but the content in step 1 is changeed also. NULL (\0) is added to between every character and even repelace new line character. Maybe this is why GetLine() failed to read the file.
Work around: after step 1, I run "find "my_desired_content" MyFile.txt" > TempFile.txt and read TempFile.txt (which is encoded in UTF-8).
My ultimate goal is to check if "my_desired_content" apeears in the result of "wslconfig /l" so this is fine. However, what I don't understand is that both MyFile.txt and TempFile.txt are created from cmd command but they are encoded differently?
The problem is due to the contents of the file. Assuming this is the file generated by your linked question, you can examine its contents in a hex editor to find out the following facts:
Its contents are encoded in UTF-16 (LE) without a BOM
Its newlines are encoded as CR or CR CR instead of CR LF
I thought the newlines would be more important than the text encoding, but it turns out I had it backwards. If I change each of these things independently, GetLine seems to function correctly for either CR, CR CR, or CR LF, but only handles UTF-16 when the BOM is present. (That is, in a hex editor, the file starts with FF FE 57 00 instead of 57 00 for a file starting with the character W.)
I'm at a bit of a loss for the best way to address this. If you're up for a challenge, you could read the file with FILE_MODE_BINARYREADONLY, and can use your extra knowledge about what should be in the file to ensure you interpret its encoding correctly. Note that for most of UTF-16, you can create a single code unit by combining two bytes in the following manner:
szResult[i] = (nHigh << 8) + nLow;
where nHigh and nLow are probably values like szBuffer[2*i + 1] and szBuffer[2*i], assuming you filled a STRING szBuffer by calling ReadBytes.
Other unproven ideas include editing it in binary to ensure the BOM (FF FE) is present, figuring out ways to ensure the file is originally created with the BOM, figuring out ways to create it in an alternate encoding, finding another command you can invoke to "fix" the file, or lodging a request with the vendor (my employer) and hoping the development team changes something to better handle this case.
Here's an easier workaround. If you can safely assume that the command will append UTF-16 characters without a signature, you can append this output to a file that has just a signature. How do you get such a file?
You could create a file with just the BOM in your development environment, and add it to your Support Files. If you need to use it multiple times, copy it around first.
You could create it with code. Just call the following (error checking omitted for clarity)
OpenFileMode(FILE_MODE_APPEND_UNICODE);
CreateFile(nFileHandle, szDir, szFileName);
CloseFile(nFileHandle);
and if szDir ^ szFileName didn't exist, it will now be a file with just the UTF-16 signature.
Assuming this file is called sig.txt, you can then invoke the command
wslconfig /l >> sig.txt to write to that file. Note the doubled >> for append. The resulting file will include the Unicode signature you created ahead of time, plus the Unicode data output from wslconfig, and GetLine should interpret things correctly.
The biggest problem here is that this hardcodes around the behavior of wslconfig, and that behavior may change at any point. This is why Christopher alludes to recommending an API, and I agree completely. In the mean time, You could try to make this more robust by invoking it in a cmd /U (but my understanding of what that does or guarantees is fuzzy at best), or by trying the original way and then with the BOM.
This whole WSL thing is pretty new. I don't see any APIs it but rather then screen scrapping command outputs you might want to look at this registry key:
HKEY_CURRENT_USER\SOFTWARE\Microsoft\Windows\CurrentVersion\Lxss
It seems to have the list of installed distros that come from the store. Coming from the store probably explains why this is HKCU and not HKLM.
A brave new world.... sigh.

«Inconsistent Line Ending» in Visual Studio when editing from outside VS

I have written a script that checks out a file that changes a value in a line of a file and checks in the code. But after that when I open the file it gives me a popup
Inconsistent line Ending
The Line endings in the following file are
not consistent. Do you want to normalize the line ending.
Is there a way to avoid this? When I compare I do not see any difference. Would it cause any issues for compiling the program?
The problem you met is about a different endline encoding. I bet that the script you wrote for changing files insert a line ending like \n. It is a «*nix» notation, usually also called «LF». Windows notation for a newline for some (I guess historic) reason requires two characters, it is called «CR/LF». That is, you need in your script insert not just the \n, but \r\n. Just for you interest, there is also just «CR» notation, i.e. \r — it was used in older MACs.
The message you see complains about the fact, that a file now have different line endings. That is, every line in the file was most likely in «CR/LF», and now there's a line in another notation. You ought to have the same line notation throughout the file, with disregard would it be «Unix», «MAC», or «Windows» one.
When I compare I do not see any difference.
It is non-printable characters, and usually not shown in text-diff utilities.
Would it cause any issues for compiling the program?
Hardly it could cause any compile problems. Anyway, now you know what is the problem, and how to fix it.
This is the code i used in powershell to
$enc = New-Object System.Text.UTF8Encoding( $false ) # required to save the file with UTF8 Without BOM
$wrt = New-Object System.XML.XMLTextWriter( $phyicalPath, $enc )
$wrt.Formatting = 'Indented'
$webconfig.Save($wrt)
$wrt.Close()
(Get-Content $phyicalPath)|Set-Content -Path $phyicalPath -Force # normalize line ending

Ruby: cannot parse Excel file exported as CSV in OS X

I'm using Ruby's CSV library to parse some CSV. I have a seemingly well-formed CSV file that I created by exporting an Excel file as CSV.
However CSV.open(filename, 'r') causes a CSV::IllegalFormatError.
There are no rogue commas or quotation marks in the file, nor anything else that I can see that might cause problems.
I suspect the problem could be to do with line endings. I am able to parse data entered manually via a text editor (Aquamacs). It is just when I try with data exported from Excel (for OS X) that problems occur. When I open up the exported CSV in vim, all the text appears on one line, with ^M appearing between lines.
From the docs, it seems that you can provide open with a row separator; however I am unsure what it should be in this case.
Try: CSV.open('filename', 'r', ?,, ?\r)
As cantlin notes, for Ruby 2 it's:
CSV.new('file.csv', 'r', :col_sep => ?,, :row_sep => ?\r)
I'm pretty sure these will DTRT for you. You can also "fix" the file itself (in which case keep the old open) with the following vim command: :%s/\r/\r/g
Yes, I know that command looks like a total no-op, but it will work.
Stripping \r characters seemed to work for me
CSV.parse(File.read('filename').gsub(/\r/, '')) do |row|
...
end
Another option is to open the CSV file or the original spreadsheet in Excel and save it as "Windows Comma Separated" rather than "Comma Separated Values". This will output the file with line endings that FasterCSV is able to understand.
"""
When I open up the exported CSV in vim, all the text appears on one line, with ^M appearing between lines.
From the docs, it seems that you can provide open with a row separator; however I am unsure what it should be in this case.
"""
Read back a sentence ... ^M means keyboard Ctrl-M aka '\x0D' (M is the 13th letter of the ASCII alphabet; 0x0D == 13) aka ASCII CR (carriage return) aka '\r' ... IOW what Macs used to use as a line terminator before OS X.
It seems newer versions of the CSV parser and/or any component it uses read DOS/Windows line endings without issues. Mac OS X's stock one (not sure the version) was not cutting it, installed Ruby 2.0.0 and it parsed the file just fine, without the special arguments...
I had similar problem. I got an error:
"error_message"=>"Illegal quoting in line 1.", "error_class"=>"CSV::MalformedCSVError"
The problem was the file had Windows line endings, which are of course other than Unix. What helped me was defining row_sep: "\r\n":
CSV.open(path, 'w', headers: :first_row, col_sep: ';', quote_char: '"', row_sep: "\r\n")

Resources