Ruby csv error : invalid byte sequence in UTF-8 - ruby

[sagar#BL-53 RcTools]$ irb
1.9.3p0 :001 > require 'csv'
=> true
1.9.3p0 :002 > master = CSV.read("./public/jobs/in/Appexchange_Applications_Companies_487.csv")
ArgumentError: invalid byte sequence in UTF-8
from /home/sagar/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/csv.rb:1855:in `sub!'
from /home/sagar/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/csv.rb:1855:in `block in shift'
from /home/sagar/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/csv.rb:1849:in `loop'
from /home/sagar/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/csv.rb:1849:in `shift'
from /home/sagar/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/csv.rb:1791:in `each'
from /home/sagar/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/csv.rb:1805:in `to_a'
from /home/sagar/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/csv.rb:1805:in `read'
from /home/sagar/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/csv.rb:1411:in `block in read'
from /home/sagar/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/csv.rb:1354:in `open'
from /home/sagar/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/csv.rb:1411:in `read'
from (irb):2
from /home/sagar/.rvm/rubies/ruby-1.9.3-p0/bin/irb:16:in `<main>'
1.9.3p0 :003 >
But when i do
1.9.3p0 :003 > master = CSV.open("./public/jobs/in/Appexchange_Applications_Companies_487.csv","r")
=> <#CSV io_type:File io_path:"./public/jobs/in/Appexchange_Applications_Companies_487.csv" encoding:UTF-8 lineno:0 col_sep:"," row_sep:"\r\n" quote_char:"\"">
1.9.3p0 :004 >
I just want to know why this is happening and what is solution. And i want to read the csv because it returns an array of that csv.
So if i read file in first way like
master = CSV.read("./public/jobs/in/Appexchange_Applications_Companies_487.csv")
It returns me an array
1.9.3p0 :008 > master.class
=> Array
But in second case, class is CSV.
What is solution to read csv in first way.

Regarding the error: First, make sure that you are using the correct character encodings. If you do, then you have probably invalid data in your csv file. You can probably fix it using iconv (see link posted by Chetan Muneshwar).
Regarding the second part of your question: CSV.open just opens the file for reading but does no reading yet. CSV.read will open the file, read its contents, and close it again. So to just get the data out of the file use CSV.read.

Related

How do I parse a tab-delimited line that contains a quote?

I'm using Ruby 2.4. How do I parse a tab-delimited line that contains a quote character? This is what's happening to me now ...
2.4.0 :003 > line = "11\tDave\tO\"malley"
=> "11\tDave\tO\"malley"
2.4.0 :004 > CSV.parse(line, col_sep: "\t")
CSV::MalformedCSVError: Illegal quoting in line 1.
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1912:in `block (2 levels) in shift'
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1868:in `each'
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1868:in `block in shift'
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1828:in `loop'
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1828:in `shift'
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1770:in `each'
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1784:in `to_a'
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1784:in `read'
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1324:in `parse'
from (irb):4
from /Users/davea/.rvm/gems/ruby-2.4.0#global/gems/railties-5.0.1/lib/rails/commands/console.rb:65:in `start'
from /Users/davea/.rvm/gems/ruby-2.4.0#global/gems/railties-5.0.1/lib/rails/commands/console_helper.rb:9:in `start'
from /Users/davea/.rvm/gems/ruby-2.4.0#global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:78:in `console'
from /Users/davea/.rvm/gems/ruby-2.4.0#global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:49:in `run_command!'
from /Users/davea/.rvm/gems/ruby-2.4.0#global/gems/railties-5.0.1/lib/rails/commands.rb:18:in `<top (required)>'
from bin/rails:4:in `require'
from bin/rails:4:in `<main>'
Although teh example illustrates my point, I can't easily control the input coming in. So, although an answer coudl be< "Remove all quotes from teh string before parsing," I want to preserve the data as closely as possible.
That's a malformed document if you're trying to adhere to the CSV standard. Instad you might just brute-force it and pray there's no tabs in the data itself:
line.split(/\t/)
The CSV parsing library comes in handy when you're dealing with data like this:
"1\t2\t\"3a\t3b\"\t4"
Update: If you're prepared to abuse the CSV library a little then you can do this:
CSV.parse("11\tDave\tO\"malley", col_sep: "\t", quote_char: "\0")
That basically kills quote detection, so if there is other data that depends on that being processed correctly this may not work out.
"11\tDave\tO\"malley" is not valid CSV data. Strangely enough, the answer is to use two double-quotes, and to double quote each element
2.3.1 :001 > require 'csv'
=> true
2.3.1 :002 > line = "\"11\"\t\"Dave\"\t\"O\"\"malley\""
=> "\"11\"\t\"Dave\"\t\"O\"\"malley\""
2.3.1 :003 > puts line # for clarity
"11" "Dave" "O""malley"
=> nil
2.3.1 :004 > CSV.parse(line, col_sep: "\t")
=> [["11", "Dave", "O\"malley"]]

ruby - zlib header error when uncompressing tar.gz

I wanted to write a ruby program to unpackage tar.gz files, and I ran into some issues. After reading the documentation on the ruby-doc site, Zlib::GzipReader and Zlib::Inflate. Then I found this Module someone wrote on GitHub, and that didn't work either. So, using the examples from the Ruby page for Zlib::GzipReader, I these were able to run successfully.
irb(main):027:0* File.open('zlib_inflate.tar.gz') do |f|
irb(main):028:1* gz = Zlib::GzipReader.new(f)
irb(main):029:1> print gz.read
irb(main):030:1> gz.close
irb(main):031:1> end
#
#
#
irb(main):023:0* Zlib::GzipReader.open('zlib_inflate.tar.gz') { |gz|
irb(main):024:1* print gz.read
irb(main):025:1> }
Then, when trying to use the Zlib::Inflate options, I kept running into incorrect header check errors.
irb(main):047:0* zstream = Zlib::Inflate.new
=> #<Zlib::Inflate:0x00000002b15790 #dictionaries={}>
irb(main):048:0> buf = zstream.inflate('zlib_inflate.tar.gz')
Zlib::DataError: incorrect header check
from (irb):48:in `inflate'
from (irb):48
from C:/ruby/Ruby200p353/Ruby200-x64/bin/irb:12:in `<main>'
#
#
#
irb(main):049:0> cf = File.open("zlib_inflate.tar.gz")
=> #<File:zlib_inflate.tar.gz>
irb(main):050:0> ucf = File.open("ruby_inflate.rb", "w+")
=> #<File:ruby_inflate.rb>
irb(main):051:0> zi = Zlib::Inflate.new(Zlib::MAX_WBITS)
=> #<Zlib::Inflate:0x00000002c097f0 #dictionaries={}>
irb(main):052:0> ucf << zi.inflate(cf.read)
Zlib::DataError: incorrect header check
from (irb):52:in `inflate'
from (irb):52
from C:/ruby/Ruby200p353/Ruby200-x64/bin/irb:12:in `<main>'
#
#
#
irb(main):057:0* File.open("zlib_inflate.tar.gz") {|cf|
irb(main):058:1* zi = Zlib::Inflate.new
irb(main):059:1> File.open("zlib_inflate.rb", "w+") {|ucf|
irb(main):060:2* ucf << zi.inflate(cf.read)
irb(main):061:2> }
irb(main):062:1> zi.close
irb(main):063:1> }
Zlib::DataError: incorrect header check
from (irb):60:in `inflate'
from (irb):60:in `block (2 levels) in irb_binding'
from (irb):59:in `open'
from (irb):59:in `block in irb_binding'
from (irb):57:in `open'
from (irb):57
from C:/ruby/Ruby200p353/Ruby200-x64/bin/irb:12:in `<main>'
How would I go about taking a tar.gz file, and extract the contents so the files are not null? I did read this other post about Ruby Zlib from here, but that did not work for me as well. What am I doing wrong?

Ruby to_json issue with error "illegal/malformed utf-8"

I got an error JSON::GeneratorError: source sequence is illegal/malformed utf-8 when trying to convert a hash into json string. I am wondering if this has anything to do with encoding, and how can I make to_json just treat \xAE as it is?
$ irb
2.0.0-p247 :001 > require 'json'
=> true
2.0.0-p247 :002 > a = {"description"=> "iPhone\xAE"}
=> {"description"=>"iPhone\xAE"}
2.0.0-p247 :003 > a.to_json
JSON::GeneratorError: source sequence is illegal/malformed utf-8
from (irb):3:in `to_json'
from (irb):3
from /Users/cchen21/.rvm/rubies/ruby-2.0.0-p247/bin/irb:16:in `<main>'
\xAE is not a valid character in UTF-8, you have to use \u00AE instead:
"iPhone\u00AE"
#=> "iPhone®"
Or convert it accordingly:
"iPhone\xAE".force_encoding("ISO-8859-1").encode("UTF-8")
#=> "iPhone®"
Every string in Ruby has a underlaying encoding. Depending on your LANG and LC_ALL environment variables, the interactive shell might be executing and interpreting your strings in a given encoding.
$ irb
1.9.3p392 :008 > __ENCODING__
=> #<Encoding:UTF-8>
(ignore that I’m using Ruby 1.9 instead of 2.0, the ideas are still the same).
__ENCODING__ returns the current source encoding. Yours will probably also say UTF-8.
When you create literal strings and use byte escapes (the \xAE) in your code, Ruby is trying to interpret that according to the string encoding:
1.9.3p392 :003 > a = {"description" => "iPhone\xAE"}
=> {"description"=>"iPhone\xAE"}
1.9.3p392 :004 > a["description"].encoding
=> #<Encoding:UTF-8>
So, the byte \xAE at the end of your literal string will be tried to be treated as a UTF-8 stream byte, but it is invalid. See what happens when I try to print it:
1.9.3-p392 :001 > puts "iPhone\xAE"
iPhone�
=> nil
You either need to provide the registered mark character in a valid UTF-8 encoding (either using the real character, or providing the two UTF-8 bytes):
1.9.3-p392 :002 > a = {"description1" => "iPhone®", "description2" => "iPhone\xc2\xae"}
=> {"description1"=>"iPhone®", "description2"=>"iPhone®"}
1.9.3-p392 :005 > a.to_json
=> "{\"description1\":\"iPhone®\",\"description2\":\"iPhone®\"}"
Or, if your input is ISO-8859-1 (Latin 1) and you know it for sure, you can tell Ruby to interpret your string as another encoding:
1.9.3-p392 :006 > a = {"description1" => "iPhone\xAE".force_encoding('ISO-8859-1') }
=> {"description1"=>"iPhone\xAE"}
1.9.3-p392 :007 > a.to_json
=> "{\"description1\":\"iPhone®\"}"
Hope it helps.

Why is YAML throwing float ArgumentErrors on strings?

I have some nested strings in a complex hash that triggers "ArgumentError" exceptions. What's the best practiced way in dealing with this?
require 'yaml'
{
a: 'hello',
b: [{f:'hello',g:Hash.new,i:{a:'hello'}}],
c: {e:"+."}
}.to_yaml #=> `Float': invalid value for Float(): "+" (ArgumentError)
Full error dump:
/Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/scalar_scanner.rb:99:in `Float': invalid value for Float(): "+" (ArgumentError)
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/scalar_scanner.rb:99:in `tokenize'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/visitors/yaml_tree.rb:272:in `visit_String'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/visitors/yaml_tree.rb:128:in `accept'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/visitors/yaml_tree.rb:324:in `block in visit_Hash'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/visitors/yaml_tree.rb:322:in `each'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/visitors/yaml_tree.rb:322:in `visit_Hash'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/visitors/yaml_tree.rb:128:in `accept'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/visitors/yaml_tree.rb:324:in `block in visit_Hash'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/visitors/yaml_tree.rb:322:in `each'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/visitors/yaml_tree.rb:322:in `visit_Hash'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/visitors/yaml_tree.rb:128:in `accept'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/visitors/yaml_tree.rb:92:in `push'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych.rb:244:in `dump'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/core_ext.rb:14:in `psych_to_yaml'
This appears to be a bug in the bundled psych. Patching ~/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/scalar_scanner.rb at line 99 from:
Float(string.gsub(/[,_]|\.$/, ''))
to:
Float(string.gsub(/[,_]|\.$/, '')) rescue ArgumentError
fixes the issue. This is essentially what's in the psych gem as well as the Ruby 1.9 bundled version.
If you'd rather not patch your Ruby, using the psych-1.3.4 gem is another option; just be sure to require 'psych' rather than 'yaml':
gem 'psych', '=1.3.4'
require 'psych'
{a: 'hello', b: [{f:'hello',g:Hash.new,i:{a:'hello'}}], c: {e:"0+."}}.to_yaml
# => "---\n:a: hello\n:b:\n- :f: hello\n :g: {}\n :i:\n :a: hello\n:c:\n :e: 0+.\n"
This can be reproduced with a simpler example:
"+.".to_yaml
This appears to be a bug in the version of psych bundled with ruby 2.0.0 (and other versions, I'm sure):
when FLOAT
if string == '.'
#string_cache[string] = true
string
else
Float(string.gsub(/[,_]|\.$/, ''))
end
The problem is that that "+." looks like a valid floating point number like +.5.
This is fixed in Ruby 2.2.1 (or probably an earlier version), which specifically checks for the case where there may be a leading sign (+ or -):
when FLOAT
if string =~ /\A[-+]?\.\Z/
#string_cache[string] = true
string
else
Float(string.gsub(/[,_]|\.$/, ''))
end

Ruby - IO#ioctl throws range error in 1.9.3 but not 1.9.1 for 32bit

ruby-1.9.1-p243 :008 > a,b = IO.pipe
=> [#<IO:0x2010670>, #<IO:0x2010654>]
ruby-1.9.1-p243 :009 > a.ioctl(0x80000000, "\x00\x00")
Errno::ENOTTY: Inappropriate ioctl for device
from (irb):9:in `ioctl'
from (irb):9
from /Users/dlampa/.rvm/rubies/ruby-1.9.1-p243/bin/irb:16:in `<main>'
This would work if the IO object were an appropriate device and the command was valid.
ruby-1.9.3-p0 :022 > a,b = IO.pipe
=> [#<IO:fd 9>, #<IO:fd 10>]
ruby-1.9.3-p0 :023 > a.ioctl(0x80000000, "\x00\x00")
RangeError: bignum too big to convert into `long'
from (irb):23:in `ioctl'
from (irb):23
from /Users/dlampa/.rvm/rubies/ruby-1.9.3-p0/bin/irb:16:in `<main>'
On MacOSX 10.6.8
I'm assuming that this has something to do with bit signing (maybe), and have packed and unpacked things as signed and unsigned, but nothing seems to prevent 1.9.3 from throwing that range error. Of course, there's no problem in 64-bit implementations. I'm totally confused. Can someone explain why this is happening and perhaps provide a workaround?
Edit:
The command also works in 1.9.2-p290, so there was definitely a change from 1.9.2 to 1.9.3
ruby-1.9.2-p290 :001 > a,b = IO.pipe
=> [#<IO:fd 3>, #<IO:fd 4>]
ruby-1.9.2-p290 :002 > a.ioctl(0x80000000, "\x00\x00")
Errno::ENOTTY: Inappropriate ioctl for device
from (irb):2:in `ioctl'
from (irb):2
from /Users/dlampa/.rvm/rubies/ruby-1.9.2-p290/bin/irb:16:in `<main>'

Resources