reading header from a UTF8 CSV file using Ruby - ruby

I am trying to read a CSV file in Ruby 1.9.3 (I am not using Rails.)
sessions = CSV.read("c:/scripts/ruby/testcsvencoding.csv", :headers => true,
:encoding => "UTF-8")
sessions.each do | session |
p session['col1'] <-- does not work
p session[0] <--- works
end
The file contains:
col1, col2
a,1
b,2
I saw what seems like "Avoding “Invalid byte sequence in UTF-8″ with Ruby and CSV files", but it may not be the same problem as mine.
When I try the workaround there I get an error.
Is there any way to solve this? Is this a known problem?
This is on Windows

That error means there's a bad utf-8 byte sequence in your data. If that bothers you, fix the data. Otherwise try ascii-8bit.

Related

Stripping byte order mark from zip stream in ruby

I know that File supposedly takes encoding: 'bom|utf-8' but afaict there's no equivalent for streams. My server is getting a zip file containing one csv that has the bom. It seems silly to save the csv as a file vs just using CSV.new(Zip::InputStream::open(zip_file).get_next_entry.get_input_stream), but afaict none of those can detect and strip the byte order mark (bom) and CSV fails trying to parse the header if the bom is there.
I see that CSV.new takes encoding as an option, but, in 2.3.0 at least, it doesn't recognize bom (ArgumentError: unknown encoding name - bom)
Looks like handling the BOM is implemented in IO - maybe you can wrap your zip stream around an IO object?
https://ruby-doc.org/core-2.3.1/IO.html#method-c-new-label-Open+Mode
Since you can rewind streams, the answer is to get the first chars, see if they're the bom, if they are, consume them; otherwise, rewind the stream.
BYTE_ORDER_MARKS_LENGTHS =
{"\xEF".bytes.first => 2, "\xFE".bytes.first => 1, "\xFF".bytes.first => 1}
# checks if input_stream starts with a byte order mark and if so skips over it
def skip_bom(input_stream)
entry = BYTE_ORDER_MARKS_LENGTHS[input_stream.read(1).bytes.first]
if entry
input_stream.read(entry)
else
input_stream.rewind
end
end
My situation was similar, but I also needed to remove extra double-quotes:
Zip::File.open(zipfolder) do |zipfile|
zipfile.each do |zip_entry|
zip_entry.get_input_stream.each_line do |line|
line_without_bom_or_quotes = line.force_encoding('UTF-8').gsub('"', '')
row = CSV.parse_line(line_without_bom_or_quotes)
puts "DETAIL: #{row.inspect}"
end
end
end

Why is Ruby failing to convert CP-1252 to UTF-8?

I have a CSV files saved from Excel which is CP-1252/Windows-1252. I tried the following, but it still comes out corrupted. Why?
csv_text = File.read(arg[:file], encoding: 'cp1252').encode('utf-8')
# csv_text = File.read(arg[:file], encoding: 'cp1252')
csv = CSV.parse csv_text, :headers => true
csv.each do |row|
# create model
p model
The result
>rake import:csv["../file.csv"] | grep Brien
... name: "Oâ?TBrien ...
However it works in the console
> "O\x92Brien".force_encoding("cp1252").encode("utf-8")
=> "O'Brien"
I can open the CSV file in Notepad++, Encoding > Character Sets > Western European > Windows-1252, see the correct characters, then Encoding > Convert to UTF-8. However, there are many files an I want Ruby to handle this.
Similar: How to change the encoding during CSV parsing in Rails. But this doesn't explain why this is failing.
Ruby 2.4, Reference: https://ruby-doc.org/core-2.4.3/IO.html#method-c-read
Wow, it was caused by the shitty grep in DevKit.
>rake import:csv["../file.csv"]
... name: "O'Brien ...
>where grep
C:\DevKit2\bin\grep.exe
I also did not need the .encode('utf-8').
Let that be a lesson kids. Never take anything for granted. Trust no one!

Problems with Ruby encoding in Windows

I wrote a simple code that reads an email from MS-Outlook, using 'win32ole', and then save its subjects to an CSV file. Everything goes well except the encoding system. When I open my CSV file the words such as "André" are printed as "Andr\x82". I want my output format to be equal to my input.
# encoding: 'CP850'
require 'win32ole'
require 'CSV'
Encoding.default_external = 'CP850'
ol = WIN32OLE.new('Outlook.Application')
inbox = ol.GetNamespace("MAPI").GetDefaultFolder(6)
email_subjecs = []
inbox.Items.each do |m|
email_subjects << m.Subject
end
CSV.open('MyFile.csv',"w") do |csv|
csv << email_subjects
end
O.S: Windows 7 64bit
Encoding.default_external -> CP850
Languadge -> PT
ruby -v -> 1.9.2p290 (2011-07-09) [i386-mingw32]
It seems a simple problem related to external windows encoding and I tryied many solution posted here but I realy can't solve this.
1) Your file name is missing a closing quote.
2) The default open mode for CSV.open() is 'rb', so you can't possibly write to a file with the code you posted.
3) You didn't post the encoding of the text you are trying to write to the file.
4) You didn't post the encoding that you want the the data to be written in.
5)
When I open my CSV file the words such as "é" are printed as "\x82"
Tell your viewing device not to do that.
The magic comment only sets the encoding the current (.rb) file should be read as. It does not set default_external. Try set RUBYOPT=-E utf-8, open your file with CSV.open('MyFile.csv', encoding: 'UTF-8'), or set Encoding.default_external at the top of your file (discouraged).

Illegal quoting in line 1. (CSV::MalformedCSVError)

I'm getting a Illegal quoting in line 1. (CSV::MalformedCSVError) when I try to read a CSV that I download using Selenium WebDriver.
CSV.foreach( "foo.csv" ) do |row|
# anger :(
end
but when I copy the contents and paste it into a new file and save it again, it works just fine:
CSV.foreach( "bar.csv" ) do |row|
# works fine
end
Here's the first 5 lines of the CSV in question, in case it helps...
"Name","W","L","ERA","GS","G","SV","IP","H","ER","HR","SO","BB","WHIP","K/9","BB/9","FIP","WAR","playerid"
"Craig Kimbrel","5","1","1.79","0","65","35","65.0","42","13","4","95","19","0.95","13.16","2.65","1.84","1.7","6655"
"Aroldis Chapman","2","1","1.93","0","30","27","30.0","18","6","2","47","12","0.99","14.24","3.56","2.22","0.6","10233"
"Greg Holland","5","2","2.39","0","65","34","65.0","47","17","5","83","21","1.05","11.53","2.95","2.48","1.3","7196"
"Kenley Jansen","5","2","2.16","0","65","32","65.0","46","16","6","86","19","1.00","11.97","2.64","2.51","0.9","3096"
I haven't been able to find or come up with a way to get my raw, selenium-downloaded CSV to be read correctly. Anyone run into this or have any ideas on what could be wrong with my data, or how I can fix this programmatically?
Thank you!
It's very likely that your file has a byte-order mark U+FEFF at the very beginning. You are probably losing it when you copy and paste again.
The proper solution is:
CSV.foreach("foo.csv", "r:bom|utf-8") { ... }

Read binary file as string in Ruby

I need an easy way to take a tar file and convert it into a string (and vice versa). Is there a way to do this in Ruby? My best attempt was this:
file = File.open("path-to-file.tar.gz")
contents = ""
file.each {|line|
contents << line
}
I thought that would be enough to convert it to a string, but then when I try to write it back out like this...
newFile = File.open("test.tar.gz", "w")
newFile.write(contents)
It isn't the same file. Doing ls -l shows the files are of different sizes, although they are pretty close (and opening the file reveals most of the contents intact). Is there a small mistake I'm making or an entirely different (but workable) way to accomplish this?
First, you should open the file as a binary file. Then you can read the entire file in, in one command.
file = File.open("path-to-file.tar.gz", "rb")
contents = file.read
That will get you the entire file in a string.
After that, you probably want to file.close. If you don’t do that, file won’t be closed until it is garbage-collected, so it would be a slight waste of system resources while it is open.
If you need binary mode, you'll need to do it the hard way:
s = File.open(filename, 'rb') { |f| f.read }
If not, shorter and sweeter is:
s = IO.read(filename)
To avoid leaving the file open, it is best to pass a block to File.open. This way, the file will be closed after the block executes.
contents = File.open('path-to-file.tar.gz', 'rb') { |f| f.read }
how about some open/close safety.
string = File.open('file.txt', 'rb') { |file| file.read }
Ruby have binary reading
data = IO.binread(path/filaname)
or if less than Ruby 1.9.2
data = IO.read(path/file)
on os x these are the same for me... could this maybe be extra "\r" in windows?
in any case you may be better of with:
contents = File.read("e.tgz")
newFile = File.open("ee.tgz", "w")
newFile.write(contents)
You can probably encode the tar file in Base64. Base 64 will give you a pure ASCII representation of the file that you can store in a plain text file. Then you can retrieve the tar file by decoding the text back.
You do something like:
require 'base64'
file_contents = Base64.encode64(tar_file_data)
Have look at the Base64 Rubydocs to get a better idea.
Ruby 1.9+ has IO.binread (see #bardzo's answer) and also supports passing the encoding as an option to IO.read:
Ruby 1.9
data = File.read(name, {:encoding => 'BINARY'})
Ruby 2+
data = File.read(name, encoding: 'BINARY')
(Note in both cases that 'BINARY' is an alias for 'ASCII-8BIT'.)
If you can encode the tar file by Base64 (and storing it in a plain text file) you can use
File.open("my_tar.txt").each {|line| puts line}
or
File.new("name_file.txt", "r").each {|line| puts line}
to print each (text) line in the cmd.

Resources