Ruby / CSV - Convert Dos to Unix - ruby

I currently use: http://emacswiki.org/emacs/DosToUnix to manually convert DOS CSVs to UNIX. Just wondering if there's a ruby function for the CSV library that I'm missing? And / or if it's possible build a quick script / Monkey Patch.

Yes. The CSV docs say:
The String appended to the end of each row. This can be set to the special :auto setting, which requests that CSV automatically discover this from the data. Auto-discovery reads ahead in the data looking for the next "\r\n", "\n", or "\r" sequence.
:auto is the default, so you should be able to feed your DOS CSV to Ruby unmodified.
However, if you want to convert to UNIX line endings:
str.gsub(/\r\n/, "\n")

Related

std::endl equivalent on Ruby?

I just can't find it.
Found you can remove them with chomp, but not how to create them.
There is a global variable $/ which represent input record separator (default to newline (\n)).
>> $/
=> "\n"
Methods like Kernel#gets use this to determine input boundary.
As long as you work with files in text mode (the default), Ruby itself does the translation of the operating system's end-of-line character sequences to "\n" in Ruby:
When reading from a file in text mode, all line endings will appear as "\n".
When writing to a file in text mode, all newline characters "\n" will be written as the operating system's end-of-line character sequence.
So for all practical purposes when dealing with files in text mode, you can use "\n" as a constant to mean the OS-specific line ending, like std::endl.
Source: How to make your Ruby code work on Windows PCs, section "Get your file modes right".

Windows Perl --> Unix not working after port, possible encoding issue

I've got a Perl program that I wrote on Windows. It starts with:
$unused_header = <STDIN>;
my #header_fields = split('\|\^\|', $unused_header, -1);
Which should split input that consists of a very large file of:
The|^|Quick|^|Brown|^|Fox|!|
Into:
{The, Quick, Brown, Fox|!|}
Note: This line just does the headre alone, theres another one like it to do the repetitive data lines.
It worked great on windows, but on linux it fails. However, if I define a string with the same contents within Perl, and run the split on that, it works fine.
I think it's a UTF-16 encoding handling issue, but I'm not sure how to handle it. Does anyone know how I can get perl to understand the UTF-16 being piped into STDIN?
I found: http://www.haboogo.com/matching_patterns/2009/01/utf-16-processing-issue-in-perl.html but I'm not sure what to do with it.
If STDIN is UTF-16, use one of the following
binmode(STDIN, ':encoding(UTF-16le)'); # Byte order used by Windows.
binmode(STDIN, ':encoding(UTF-16be)'); # The other byte order.
binmode(STDIN, ':encoding(UTF-16)'); # Use BOM to determine byte order.
Tom has written a lengthy answer with regards to perl and unicode. It contains some bolierplate code to properly and fully support UTF-8, but you can replace with UTF-16 as needed.
I doubt it's a UTF-xx encoding issue, as neither Windows Perl nor Unix Perl will try to read data with those encodings unless you tell it to.
If the Unix script is reading the exact same file as the Windows script but behaves differently, maybe it's a line-ending issue. The dos2unix command on most Unix-y systems can change the line endings on a file, or you can strip off the line-endings yourself in the Perl script
$unused_header = <STDIN>;
$unused_header =~ s/\r?\n$//; # chop \r\n (Windows) or \n (Unix)

Is it possible to specify newline type while reading a file in ruby

I frequently deal with UTF-16LE files encoded on windows which have a \r\n carriage return. There is no problem converting the file to UTF-8 by using:
File.new(filepath, 'r:utf-16le:utf-8')
But this of course does not get rid of the \r. The way I currently get rid of them is with
str.gsub("\r", "")
But it would be nice to take care of it while reading the file in. String#encode has :cr_newline, :crlf_newline, and :universal_newline options which convert all newlines to a desired kind of newline. Is there a way to apply these or similar options while reading in a file?
The method IO#gets takes an optional argument that allows you to pass a string to define how to separate the lines:
file = File.new(filepath, 'r:utf-16le:utf-8')
while (line = file.gets("\r\n"))
...
end

ruby mechanize: how read downloaded binary csv file

I'm not very familiar using ruby with binary data. I'm using mechanize to download a large number of csv files to my local disk. I then need to search these files for specific strings.
I use the save_as method in mechanize to save the file (which saves the file as binary). The content type of the file (according to mechanize) is:
application/vnd.ms-excel;charset=x-UTF-16LE-BOM
From here, I'm not sure how to read the file. I've tried reading it in as a normal file in ruby, but I just get the binary data. I've also tried just using standard unix tools (strings/grep) to try and search without any luck.
When I run the 'file' command on one of the files, I get:
foo.csv: Little-endian UTF-16 Unicode Pascal program text, with very long lines, with CRLF, CR, LF line terminators
I can see the data just fine with cat or vi. With vi I also see some control characters.
I've also tried both the csv and fastercsv ruby libraries, but I get 'IllegalFormatError' exception for these. I've also tried this solution without any luck.
Any help would be greatly appreciated. Thanks.
You can use the command 'iconv' to conver to UTF-8,
# iconv -f 'UTF-16LE' -t 'UTF-8' bad_file.csv > good_file.csv
There is also a wrapper for iconv in the standard library, you could use that to convert the file after reading it into your program.

Ruby: cannot parse Excel file exported as CSV in OS X

I'm using Ruby's CSV library to parse some CSV. I have a seemingly well-formed CSV file that I created by exporting an Excel file as CSV.
However CSV.open(filename, 'r') causes a CSV::IllegalFormatError.
There are no rogue commas or quotation marks in the file, nor anything else that I can see that might cause problems.
I suspect the problem could be to do with line endings. I am able to parse data entered manually via a text editor (Aquamacs). It is just when I try with data exported from Excel (for OS X) that problems occur. When I open up the exported CSV in vim, all the text appears on one line, with ^M appearing between lines.
From the docs, it seems that you can provide open with a row separator; however I am unsure what it should be in this case.
Try: CSV.open('filename', 'r', ?,, ?\r)
As cantlin notes, for Ruby 2 it's:
CSV.new('file.csv', 'r', :col_sep => ?,, :row_sep => ?\r)
I'm pretty sure these will DTRT for you. You can also "fix" the file itself (in which case keep the old open) with the following vim command: :%s/\r/\r/g
Yes, I know that command looks like a total no-op, but it will work.
Stripping \r characters seemed to work for me
CSV.parse(File.read('filename').gsub(/\r/, '')) do |row|
...
end
Another option is to open the CSV file or the original spreadsheet in Excel and save it as "Windows Comma Separated" rather than "Comma Separated Values". This will output the file with line endings that FasterCSV is able to understand.
"""
When I open up the exported CSV in vim, all the text appears on one line, with ^M appearing between lines.
From the docs, it seems that you can provide open with a row separator; however I am unsure what it should be in this case.
"""
Read back a sentence ... ^M means keyboard Ctrl-M aka '\x0D' (M is the 13th letter of the ASCII alphabet; 0x0D == 13) aka ASCII CR (carriage return) aka '\r' ... IOW what Macs used to use as a line terminator before OS X.
It seems newer versions of the CSV parser and/or any component it uses read DOS/Windows line endings without issues. Mac OS X's stock one (not sure the version) was not cutting it, installed Ruby 2.0.0 and it parsed the file just fine, without the special arguments...
I had similar problem. I got an error:
"error_message"=>"Illegal quoting in line 1.", "error_class"=>"CSV::MalformedCSVError"
The problem was the file had Windows line endings, which are of course other than Unix. What helped me was defining row_sep: "\r\n":
CSV.open(path, 'w', headers: :first_row, col_sep: ';', quote_char: '"', row_sep: "\r\n")

Resources