Ruby: cannot parse Excel file exported as CSV in OS X - ruby

I'm using Ruby's CSV library to parse some CSV. I have a seemingly well-formed CSV file that I created by exporting an Excel file as CSV.
However CSV.open(filename, 'r') causes a CSV::IllegalFormatError.
There are no rogue commas or quotation marks in the file, nor anything else that I can see that might cause problems.
I suspect the problem could be to do with line endings. I am able to parse data entered manually via a text editor (Aquamacs). It is just when I try with data exported from Excel (for OS X) that problems occur. When I open up the exported CSV in vim, all the text appears on one line, with ^M appearing between lines.
From the docs, it seems that you can provide open with a row separator; however I am unsure what it should be in this case.

Try: CSV.open('filename', 'r', ?,, ?\r)
As cantlin notes, for Ruby 2 it's:
CSV.new('file.csv', 'r', :col_sep => ?,, :row_sep => ?\r)
I'm pretty sure these will DTRT for you. You can also "fix" the file itself (in which case keep the old open) with the following vim command: :%s/\r/\r/g
Yes, I know that command looks like a total no-op, but it will work.

Stripping \r characters seemed to work for me
CSV.parse(File.read('filename').gsub(/\r/, '')) do |row|
...
end

Another option is to open the CSV file or the original spreadsheet in Excel and save it as "Windows Comma Separated" rather than "Comma Separated Values". This will output the file with line endings that FasterCSV is able to understand.

"""
When I open up the exported CSV in vim, all the text appears on one line, with ^M appearing between lines.
From the docs, it seems that you can provide open with a row separator; however I am unsure what it should be in this case.
"""
Read back a sentence ... ^M means keyboard Ctrl-M aka '\x0D' (M is the 13th letter of the ASCII alphabet; 0x0D == 13) aka ASCII CR (carriage return) aka '\r' ... IOW what Macs used to use as a line terminator before OS X.

It seems newer versions of the CSV parser and/or any component it uses read DOS/Windows line endings without issues. Mac OS X's stock one (not sure the version) was not cutting it, installed Ruby 2.0.0 and it parsed the file just fine, without the special arguments...

I had similar problem. I got an error:
"error_message"=>"Illegal quoting in line 1.", "error_class"=>"CSV::MalformedCSVError"
The problem was the file had Windows line endings, which are of course other than Unix. What helped me was defining row_sep: "\r\n":
CSV.open(path, 'w', headers: :first_row, col_sep: ';', quote_char: '"', row_sep: "\r\n")

Related

Ruby Write txt file DOS/Windows

Maybe this is a beginner question, but I could not find the problem yet.
I need to write a text file with Ruby.
I can write and create the file to export, but the time I export the file and it is read in other software, it tells me it is a UNIX file and the program requires it to be DOS / Windows.
How can I do this with Ruby?
I use Rails 4 in the project.
Example of how I am writing.
File.open(filePath, "w+"){ |file| file.write("blablabla\n")}
Use \r\n instead:
File.open(filePath, "w+"){ |file| file.write("blablabla\r\n")}
Using \n (0x0a) only is 'unix style'.
Using \r\n (0x0d 0x0a) is 'windows style'.
Although most software should be able to handle both.
It isn't very clearly documented but File.open also accepts these String#encode options:
File.open('a.txt', 'w+', crlf_newline: true){ |file| file.write("blablabla\n")}
and
File.open('a.txt', 'w+', newline: :crlf){ |file| file.write("blablabla\n")}
Either will force Ruby to write CRLF instead of LF (CR is \r and LF is \n).

Illegal Quoting error with Ruby CSV parsing

I know there's lots of similar questions but I haven't found a solution yet. I'm trying to use the CSV parsing library with Ruby 1.9.1 but I keep getting:
/usr/lib/ruby/1.9.1/csv.rb:1925:in `block (2 levels) in shift': Illegal quoting in line 1. (CSV::MalformedCSVError)
My CSV files were created in Windows 7 but it's Ubuntu 12.04 that I'm using to run the Ruby script, which looks like this:
require 'csv'
CSV.foreach('out.csv', :col_sep => ';') do |row|
puts row
end
Nothing complicated, just a test, so I assumed it must be the Windows control characters causing problems. Vim shows up this:
"Part 1";;;;^M
;;;;^M
;;;;^M
Failure to Lodge Income Tax Return(s);;;;^M
NAME;ADDRESS;OCCUPATION;"NO OF CHARGES";"FINE/PENALTY £"^M
some name;"some,address";Bookkeeper;3;1,250.00^M
some name;"some,address";Haulier;1;600.00^M
some name;"some,address";Scaffolding Hire;1;250.00^M
some name;"some,address";Farmer;2;500.00^M
some name;"some,address";Builder;2;3000.00
I've tried removing those control characters for carraige returns that Windows added (^M), but %s/^V^M//g and %s/^M//g result in no pattern found. If I run %s/\r//g then the ^M characters are removed, but the same error still persists when I run the Ruby script. I've also tried running set ffs=unix,dos but it has no effect. Thanks.
Update:
If I remove the double quotes around the Part 1 on the first line, then the script prints out what it should and then throws a new error: Unquoted fields do not allow \r or \n (line 10). If I then remove the \r characters, the script runs fine.
I understand that I would have to remove the \r characters, but why will it only work if I unquote the first value?
The problem causing the Illegal quoting error was due to a Byte-Order-Mark (BOM) at the very beginning of the file. It didn't show up in editors, but the Ruby CSV lib was choking on it unless :encoding => 'bom|utf-8' was set.
Once that was fixed, I still needed to remove all the '^M' characters by running %s/\r//g in vim. And everything was working fine after that.

Excel saves tab delimited files without newline (UNIX/Mac os X)

This is a common issue I have and my solution is a bit brash. So I'm looking for a quick fix and explanation of the problem.
The problem is that when I decide to save a spreadsheet in excel (mac 2011) as a tab delimited file it seems to do it perfectly fine. Until I try to parse the file line by line using Perl. For some reason it slurps the whole document in one line.
My brutish solution is to open the file in a web browser and copy and paste the information into the tab delimited file in TextEdit (I never use rich text format). I tried introducing a newline in the end of the file before doing this fix and it does not resolve the issue.
What's going on here? An explanation would be appreciated.
~Thanks!~
The problem is the actual character codes that define new lines on different systems. Windows systems commonly use a CarriageReturn+LineFeed (CRLF) and *NIX systems use only a LineFeed (LF).
These characters can be represented in RegEx as \r\n or \n (respectively).
Sometimes, to hash through a text file, you need to parse New Line characters. Try this for DOS-to-UNIX in perl:
perl -pi -e 's/\r\n/\n/g' input.file
or, for UNIX-to-DOS using sed:
$ sed 's/$'"/`echo \\\r`/" input.txt > output.txt
or, for DOS-to-UNIX using sed:
$ sed 's/^M$//' input.txt > output.txt
Found a pretty simple solution to this. Copy data from Excel to clipboard, paste it into a google spreadsheet. Download google spreadsheet file as a 'tab-separated values .tsv'. This gets around the problem and you have tab delimiters with an end of line for each line.
Yet another solution ...
for a tab-delimited file, save the document as a Windows Formatted Text (.txt) file type
for a comma-separated file, save the document as a `Windows Comma Separated (.csv)' file type
Perl has a useful regex pattern \R which will match any common line ending. It actually matches any vertical whitespace -- the same as \v -- or the CR LF combination, so it's the same as \r\n|\v
This is useful here because you can slurp your entire file into a single scalar and then split /\R/, which will give you a list of file records, already chomped (if you want to keep the line terminators you can split /\R\K/ instead
Another option is the PerlIO::eol module. It provides a new Perl IO layer that will normalize line endings no matter what the contents of the file are
Once you have loaded the module with use PerlIO::eol you can use it in an open statement
open my $fh, '<:eol(LF)', 'myfile.tsv' or die $!;
or you can use the open pragma to set it as the default layer for all input file handles
use open IN => ':raw:eol(LF)';
which will work fine with an input file from any platform

Ruby / CSV - Convert Dos to Unix

I currently use: http://emacswiki.org/emacs/DosToUnix to manually convert DOS CSVs to UNIX. Just wondering if there's a ruby function for the CSV library that I'm missing? And / or if it's possible build a quick script / Monkey Patch.
Yes. The CSV docs say:
The String appended to the end of each row. This can be set to the special :auto setting, which requests that CSV automatically discover this from the data. Auto-discovery reads ahead in the data looking for the next "\r\n", "\n", or "\r" sequence.
:auto is the default, so you should be able to feed your DOS CSV to Ruby unmodified.
However, if you want to convert to UNIX line endings:
str.gsub(/\r\n/, "\n")

Ruby on a Mac -- Regular Expression Spanning Two Lines of Text

On the PC, the following Ruby regular expression matches data. However, when run on the Mac against the same input text file, no matches occur. Am I matching line returns in a way that should work cross-platform?
data = nil
File.open(ARGV[0], "r") do |file|
data = file.readlines.join("").scan(/^Name: (.*?)[\r\n]+Email: (.*?)$/)
end
Versions
PC: ruby 1.9.2p135
Mac: ruby 1.8.6
Thank you,
Ben
The problem was the ^ and $ pattern characters! Ruby doesn't consider \r (a.k.a. ^M) a line boundary. If I modified my pattern, replacing both ^ and $ with "\r", the pattern matched as desired.
data = file.readlines.join.scan(/\rName: (.*?)\rEmail: (.*?)\r/)
Instead of modifying the pattern, I opted to do a gsub on the text, replacing \r with \n before calling scan.
data = file.readlines.join.gsub(/\r/, "\n").scan(/^Name: (.*?)\nEmail: (.*?)$/)
Thank you each for your responses to my question.
When going from Windows -> Unix based (MAC) I've had this issue: ^M =? \r\n. The Carriage return gets rendered as a Control-M which may or may not be interpreted correctly by your regexp~
On Unix (OS X is a Unix), end of lines are \n, not \r\n. Putting simply [\n] will work on Mac.
To have a cross-platform script, may be you could first replace each \r\n sequence by a \n character?

Resources