How do I parse a tab-delimited line that contains a quote? - ruby

I'm using Ruby 2.4. How do I parse a tab-delimited line that contains a quote character? This is what's happening to me now ...
2.4.0 :003 > line = "11\tDave\tO\"malley"
=> "11\tDave\tO\"malley"
2.4.0 :004 > CSV.parse(line, col_sep: "\t")
CSV::MalformedCSVError: Illegal quoting in line 1.
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1912:in `block (2 levels) in shift'
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1868:in `each'
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1868:in `block in shift'
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1828:in `loop'
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1828:in `shift'
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1770:in `each'
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1784:in `to_a'
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1784:in `read'
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1324:in `parse'
from (irb):4
from /Users/davea/.rvm/gems/ruby-2.4.0#global/gems/railties-5.0.1/lib/rails/commands/console.rb:65:in `start'
from /Users/davea/.rvm/gems/ruby-2.4.0#global/gems/railties-5.0.1/lib/rails/commands/console_helper.rb:9:in `start'
from /Users/davea/.rvm/gems/ruby-2.4.0#global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:78:in `console'
from /Users/davea/.rvm/gems/ruby-2.4.0#global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:49:in `run_command!'
from /Users/davea/.rvm/gems/ruby-2.4.0#global/gems/railties-5.0.1/lib/rails/commands.rb:18:in `<top (required)>'
from bin/rails:4:in `require'
from bin/rails:4:in `<main>'
Although teh example illustrates my point, I can't easily control the input coming in. So, although an answer coudl be< "Remove all quotes from teh string before parsing," I want to preserve the data as closely as possible.

That's a malformed document if you're trying to adhere to the CSV standard. Instad you might just brute-force it and pray there's no tabs in the data itself:
line.split(/\t/)
The CSV parsing library comes in handy when you're dealing with data like this:
"1\t2\t\"3a\t3b\"\t4"
Update: If you're prepared to abuse the CSV library a little then you can do this:
CSV.parse("11\tDave\tO\"malley", col_sep: "\t", quote_char: "\0")
That basically kills quote detection, so if there is other data that depends on that being processed correctly this may not work out.

"11\tDave\tO\"malley" is not valid CSV data. Strangely enough, the answer is to use two double-quotes, and to double quote each element
2.3.1 :001 > require 'csv'
=> true
2.3.1 :002 > line = "\"11\"\t\"Dave\"\t\"O\"\"malley\""
=> "\"11\"\t\"Dave\"\t\"O\"\"malley\""
2.3.1 :003 > puts line # for clarity
"11" "Dave" "O""malley"
=> nil
2.3.1 :004 > CSV.parse(line, col_sep: "\t")
=> [["11", "Dave", "O\"malley"]]

Related

How to match a string with `\xXXXX` and `\uXXXX`

I have a string that contains \xXXXX and \uXXXX:
str = "\nDefault\nRouterRandom=\x9Db\u0012\xD3,\x92r\xFC o\u007F\x9B+\u0005I`\nWebInit=1\n"
I want to delete the content:
"RouterRandom=\x9Db\u0012\xD3,\x92r\xFC o\u007F\x9B+\u0005I`\n"
How can I match the string or delete it? I tried:
content = str.sub(/RouterRandom=.*WebInit/, "")
It returns errors:
E:/Automation/experiment/ruby_test/string_test.rb:119:in `sub': invalid byte sequence in UTF-8 (ArgumentError)
from E:/Automation/experiment/ruby_test/string_test.rb:119:in `block in <top (required)>'
from E:/Automation/experiment/ruby_test/string_test.rb:110:in `open'
from E:/Automation/experiment/ruby_test/string_test.rb:110:in `<top (required)>'
from -e:1:in `load'
from -e:1:in `<main>'
You are getting invalid byte sequence because of invalid characters in your string. You can replace invalid characters in your string first:
content = str.encode( 'UTF-8', invalid: :replace )
Then split by newlines:
content = content.split( "\n" )
Delete the offending element in the array by index:
content.delete_at( 2 )
And then finally join the array back together into a newline delimited string:
new_str = content.join( "\n" )
# => "\nDefault\nWebInit=1"

Why is YAML throwing float ArgumentErrors on strings?

I have some nested strings in a complex hash that triggers "ArgumentError" exceptions. What's the best practiced way in dealing with this?
require 'yaml'
{
a: 'hello',
b: [{f:'hello',g:Hash.new,i:{a:'hello'}}],
c: {e:"+."}
}.to_yaml #=> `Float': invalid value for Float(): "+" (ArgumentError)
Full error dump:
/Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/scalar_scanner.rb:99:in `Float': invalid value for Float(): "+" (ArgumentError)
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/scalar_scanner.rb:99:in `tokenize'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/visitors/yaml_tree.rb:272:in `visit_String'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/visitors/yaml_tree.rb:128:in `accept'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/visitors/yaml_tree.rb:324:in `block in visit_Hash'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/visitors/yaml_tree.rb:322:in `each'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/visitors/yaml_tree.rb:322:in `visit_Hash'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/visitors/yaml_tree.rb:128:in `accept'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/visitors/yaml_tree.rb:324:in `block in visit_Hash'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/visitors/yaml_tree.rb:322:in `each'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/visitors/yaml_tree.rb:322:in `visit_Hash'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/visitors/yaml_tree.rb:128:in `accept'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/visitors/yaml_tree.rb:92:in `push'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych.rb:244:in `dump'
from /Users/XXX/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/core_ext.rb:14:in `psych_to_yaml'
This appears to be a bug in the bundled psych. Patching ~/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/2.0.0/psych/scalar_scanner.rb at line 99 from:
Float(string.gsub(/[,_]|\.$/, ''))
to:
Float(string.gsub(/[,_]|\.$/, '')) rescue ArgumentError
fixes the issue. This is essentially what's in the psych gem as well as the Ruby 1.9 bundled version.
If you'd rather not patch your Ruby, using the psych-1.3.4 gem is another option; just be sure to require 'psych' rather than 'yaml':
gem 'psych', '=1.3.4'
require 'psych'
{a: 'hello', b: [{f:'hello',g:Hash.new,i:{a:'hello'}}], c: {e:"0+."}}.to_yaml
# => "---\n:a: hello\n:b:\n- :f: hello\n :g: {}\n :i:\n :a: hello\n:c:\n :e: 0+.\n"
This can be reproduced with a simpler example:
"+.".to_yaml
This appears to be a bug in the version of psych bundled with ruby 2.0.0 (and other versions, I'm sure):
when FLOAT
if string == '.'
#string_cache[string] = true
string
else
Float(string.gsub(/[,_]|\.$/, ''))
end
The problem is that that "+." looks like a valid floating point number like +.5.
This is fixed in Ruby 2.2.1 (or probably an earlier version), which specifically checks for the case where there may be a leading sign (+ or -):
when FLOAT
if string =~ /\A[-+]?\.\Z/
#string_cache[string] = true
string
else
Float(string.gsub(/[,_]|\.$/, ''))
end

Ruby csv error : invalid byte sequence in UTF-8

[sagar#BL-53 RcTools]$ irb
1.9.3p0 :001 > require 'csv'
=> true
1.9.3p0 :002 > master = CSV.read("./public/jobs/in/Appexchange_Applications_Companies_487.csv")
ArgumentError: invalid byte sequence in UTF-8
from /home/sagar/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/csv.rb:1855:in `sub!'
from /home/sagar/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/csv.rb:1855:in `block in shift'
from /home/sagar/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/csv.rb:1849:in `loop'
from /home/sagar/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/csv.rb:1849:in `shift'
from /home/sagar/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/csv.rb:1791:in `each'
from /home/sagar/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/csv.rb:1805:in `to_a'
from /home/sagar/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/csv.rb:1805:in `read'
from /home/sagar/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/csv.rb:1411:in `block in read'
from /home/sagar/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/csv.rb:1354:in `open'
from /home/sagar/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/csv.rb:1411:in `read'
from (irb):2
from /home/sagar/.rvm/rubies/ruby-1.9.3-p0/bin/irb:16:in `<main>'
1.9.3p0 :003 >
But when i do
1.9.3p0 :003 > master = CSV.open("./public/jobs/in/Appexchange_Applications_Companies_487.csv","r")
=> <#CSV io_type:File io_path:"./public/jobs/in/Appexchange_Applications_Companies_487.csv" encoding:UTF-8 lineno:0 col_sep:"," row_sep:"\r\n" quote_char:"\"">
1.9.3p0 :004 >
I just want to know why this is happening and what is solution. And i want to read the csv because it returns an array of that csv.
So if i read file in first way like
master = CSV.read("./public/jobs/in/Appexchange_Applications_Companies_487.csv")
It returns me an array
1.9.3p0 :008 > master.class
=> Array
But in second case, class is CSV.
What is solution to read csv in first way.
Regarding the error: First, make sure that you are using the correct character encodings. If you do, then you have probably invalid data in your csv file. You can probably fix it using iconv (see link posted by Chetan Muneshwar).
Regarding the second part of your question: CSV.open just opens the file for reading but does no reading yet. CSV.read will open the file, read its contents, and close it again. So to just get the data out of the file use CSV.read.

How do I open local image file, encode it, and post via URI? (posting via Tumblr API)

I'm trying to read local image file, properly encode it and post to Tumbrl. According to the Tumblr API I can pass a parameter data which is Array (URL-encoded binary contents) Limit: 5 MB
I've tested my code with http://api.tumblr.com/v2/blog/#{BLOG}/info request. It is working. But I can't post a photo. Here is my code:
require 'oauth'
require 'oauth/consumer'
require 'open-uri'
require 'active_support'
CONSUMER = 'foo'
SECRET = 'foo'
TOKEN = 'foo'
TOKEN_SECRET = 'foo'
BLOG = 'foo'
consumer=OAuth::Consumer.new(CONSUMER, SECRET, {:site=>"http://tumblr.com"})
access_token = OAuth::AccessToken.new(consumer, TOKEN, TOKEN_SECRET)
# Here I tried one of two lines:
# data = Base64.encode64(IO.binread('./resized')) #first try
data = URI::encode(IO.binread('./resized')) #second try
# response = access_token.get "http://api.tumblr.com/v2/blog/#{BLOG}/info?api_key=#{CONSUMER}"
# puts response
response=access_token.post "http://api.tumblr.com/v2/blog/#{BLOG}/post?api_key=#{CONSUMER}&type=photo&data=#{data}&link=http://ya.ru&"
puts response
1st try:
% ruby ./w_oauth.rb
/usr/lib/ruby/1.9.1/uri/common.rb:176:in `split': bad URI(is not URI?): http://api.tumblr.com/v2/blog/foo/post?api_key=foo&type=photo&data=/9j/4AAQSkZJRgABAQEASABIAAD//gA7Q1JFQVRPUjogZ2QtanBlZyB2MS4w (URI::InvalidURIError)
ICh1c2luZyBJSkcgSlBFRyB2NjIpLCBxdWFsaXR5ID0gODAK/9sAQwAGBAUG
BQQGBgUGBwcGCAoQCgoJCQoUDg8MEBcUGBgXFBYWGh0lHxobIxwWFiAsICMm
(!!!long piece of image data skipped!!!)
FI/16HfTbyHPWurqdE+TGH4wx2js5SKQb+6b4bIj3aurqCrEtcXrf/4yf/dS
DLet/wCzEB6sa6uoomxJN2eaQj5mkYuerQj611dQM7Fx/wDLF/8AbXV1dTA/
/9k=
&link=http://ya.ru&
from /usr/lib/ruby/1.9.1/uri/common.rb:211:in `parse'
from /usr/lib/ruby/1.9.1/uri/common.rb:747:in `parse'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/tokens/access_token.rb:7:in `request'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/tokens/access_token.rb:47:in `post'
from ./w_oauth.rb:23:in `<main>'
2nd try:
% ruby ./w_oauth.rb
/var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/helper.rb:14:in `force_encoding': can't modify frozen String (RuntimeError)
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/helper.rb:14:in `rescue in escape'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/helper.rb:12:in `escape'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/helper.rb:43:in `block (2 levels) in normalize'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/helper.rb:42:in `collect'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/helper.rb:42:in `block in normalize'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/helper.rb:37:in `map'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/helper.rb:37:in `normalize'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/request_proxy/base.rb:98:in `normalized_parameters'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/request_proxy/base.rb:113:in `signature_base_string'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/signature/base.rb:77:in `signature_base_string'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/signature/hmac/base.rb:12:in `digest'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/signature/base.rb:65:in `signature'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/signature.rb:23:in `sign'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/client/helper.rb:45:in `signature'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/client/helper.rb:75:in `header'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/client/net_http.rb:91:in `set_oauth_header'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/client/net_http.rb:30:in `oauth!'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/consumer.rb:224:in `sign!'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/consumer.rb:188:in `create_signed_request'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/consumer.rb:159:in `request'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/tokens/consumer_token.rb:25:in `request'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/tokens/access_token.rb:12:in `request'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/tokens/access_token.rb:47:in `post'
from ./w_oauth.rb:23:in `<main>'
UPD: ./resized is a proper JPEG file:
% file ./resized
./resized: JPEG image data, JFIF standard 1.01, comment: "CREATOR: gd-jpeg v1.0 (using IJG JPEG v62), quality = 80"
URI encoding is not enough. You also need to encode: , / ? : # & = + $ #.
Try:
URI.escape(IO.binread('./resized'), Regexp.new("[^#{URI::PATTERN::UNRESERVED}]"))

Ruby 1.9 CSV: selectively ignoring conversions for a column

I have following CSV data:
10,11,12.34
I can parse this using CSV from the standard library, and have the values converted from strings to numbers:
require 'csv'
CSV.parse( "10,11,12.34" )
=> [["10", "11", "12.34"]]
CSV.parse( "10,11,12.34", {:converters => [:integer,:integer,:float]} )
=> [[10, 11, 12.34]]
I don't want to convert column 1, I'd just like that left as a string. My guess was I could omit a value from the converters array, but that didn't work:
CSV.parse( "10,11,12.34", {:converters => [nil,:integer,:float]} )
NoMethodError: undefined method `arity' for nil:NilClass
from /home/ian/.rvm/rubies/jruby-1.6.6/lib/ruby/1.9/csv.rb:2188:in `convert_fields'
from org/jruby/RubyArray.java:1614:in `each'
from /home/ian/.rvm/rubies/jruby-1.6.6/lib/ruby/1.9/csv.rb:2187:in `convert_fields'
from org/jruby/RubyArray.java:2332:in `collect'
from org/jruby/RubyEnumerator.java:190:in `each'
from org/jruby/RubyEnumerator.java:404:in `with_index'
from /home/ian/.rvm/rubies/jruby-1.6.6/lib/ruby/1.9/csv.rb:2186:in `convert_fields'
from /home/ian/.rvm/rubies/jruby-1.6.6/lib/ruby/1.9/csv.rb:1923:in `shift'
from org/jruby/RubyKernel.java:1408:in `loop'
from /home/ian/.rvm/rubies/jruby-1.6.6/lib/ruby/1.9/csv.rb:1825:in `shift'
from /home/ian/.rvm/rubies/jruby-1.6.6/lib/ruby/1.9/csv.rb:1767:in `each'
from org/jruby/RubyEnumerable.java:391:in `to_a'
from /home/ian/.rvm/rubies/jruby-1.6.6/lib/ruby/1.9/csv.rb:1778:in `read'
from /home/ian/.rvm/rubies/jruby-1.6.6/lib/ruby/1.9/csv.rb:1365:in `parse'
from (irb):25:in `evaluate'
In fact I haven't been able to find any way of specifying that I'd like the first column to be left unconverted. Any suggestions?
Update
I think I misunderstood the design intention for :converters. It's not a 1:1 mapping by column, but a list of converters to be applied (I think) to all values. I'm not sure, the docs aren't too clear. So the more general question is: How do I convert some columns in my CSV, and not others?
The documentation says these options aren't specified per column, but are instead a list of converters that will be applied to all columns.
Example:
CSV.parse("10,11,13,12.34", { :converters => [lambda{|s|s.to_s + 'x'}] })
# => [["10x", "11x", "13x", "12.34x"]]
Since the CSV module is eager to convert everything it can, you may as well shift back any columns you want using .to_s or use the :unconverted_fields option to save the original values and allow access to them.

Resources