Encoding::UndefinedConversionError: U+00A0 from UTF-8 to US-ASCII - ruby

I'm trying to scrap the 52 between the anchor links:
<div class="zg_usedPrice">
52 new
</div>
With this code:
def self.parse_products
product_hash = {}
product = #data.css('#zg_centerListWrapper')
product.css('.zg_itemImmersion').each do | product |
product_name = product.css('.zg_title a').text
product_used_price_status = product.css('.zg_usedPrice > a').text[/(\D+)/]
product_hash[:product] ||= []
product_hash[:product] << { :name => product_name,
:used_status => product_used_price_status }
end
product_hash
end
But I think the http://www.amazon.com/gp/offer-listing/B000O3GCFU/ref=zg_bs_baby-products_price?ie=UTF8&condition=new part in the URL is producing the following error:
Encoding::UndefinedConversionError:
U+00A0 from UTF-8 to US-ASCII
# ./parser_spec.rb:175:in `block (2 levels) in <top (required)>'
I tried what they suggested in "Ruby error UTF-8 to ASCII", but I'm still getting the same problem. Is there any workaround for that?
Full error trace:
1) Product (Baby) should return correct keys
Failure/Error: expect(product_hash[:product]["Pet Supplies"].keys).to eq(["Birds", "Cats", "Dogs", "Fish & Aquatic Pets", "Horses", "Insects", "Reptiles & Amphibians", "Small Animals"])
TypeError:
can't convert String into Integer
# ./parser_spec.rb:179:in `[]'
# ./parser_spec.rb:179:in `block (2 levels) in <top (required)>'
2) Product (Baby) should return correct values
Failure/Error: expect(product_hash[:product]["Pet Supplies"].values).to eq([16281, 245512, 513926, 46811, 14805, 364, 5816, 19769])
TypeError:
can't convert String into Integer
# ./parser_spec.rb:183:in `[]'
# ./parser_spec.rb:183:in `block (2 levels) in <top (required)>'
3) Product (Baby) should return correct hash
Failure/Error: expect(product_hash[:product]).to eq({"Pet Supplies"=>{"Birds"=>16281, "Cats"=>245512, "Dogs"=>513926, "Fish & Aquatic Pets"=>46811, "Horses"=>14805, "Insects"=>364, "Reptiles & Amphibians"=>5816, "Small Animals"=>19769}})
Encoding::UndefinedConversionError:
U+00A0 from UTF-8 to US-ASCII
# ./parser_spec.rb:187:in `block (2 levels) in <top (required)>'

Your HTML sample doesn't match the code you're showing, plus the URL you gave doesn't exist any more, so it's difficult to help you.
Here's a start:
require 'nokogiri'
html = '<div class="zg_usedPrice">
52 new
</div>
'
doc = Nokogiri::HTML(html)
text = doc.at('div.zg_usedPrice a').text # => "52\u00A0new"
text.gsub(/\u00A0/, ' ') # => "52 new"

Related

encoding error when calling FileUtils.copy_entry in Ruby

I'm writing code for copy and paste recursively.
But I got an encoding error when calling FileUtils.copy_entry
Error Messages :
C:/Ruby200/lib/ruby/2.0.0/fileutils.rb:1535:in `join': incompatible character encodings: CP949 and UTF-8 (Encoding::CompatibilityError)
from C:/Ruby200/lib/ruby/2.0.0/fileutils.rb:1535:in `join'
from C:/Ruby200/lib/ruby/2.0.0/fileutils.rb:1218:in `path'
from C:/Ruby200/lib/ruby/2.0.0/fileutils.rb:463:in `block in copy_entry'
from C:/Ruby200/lib/ruby/2.0.0/fileutils.rb:1485:in `call'
from C:/Ruby200/lib/ruby/2.0.0/fileutils.rb:1485:in `wrap_traverse'
from C:/Ruby200/lib/ruby/2.0.0/fileutils.rb:1488:in `block in wrap_traverse'
from C:/Ruby200/lib/ruby/2.0.0/fileutils.rb:1487:in `each'
from C:/Ruby200/lib/ruby/2.0.0/fileutils.rb:1487:in `wrap_traverse'
from C:/Ruby200/lib/ruby/2.0.0/fileutils.rb:1488:in `block in wrap_traverse'
from C:/Ruby200/lib/ruby/2.0.0/fileutils.rb:1487:in `each'
from C:/Ruby200/lib/ruby/2.0.0/fileutils.rb:1487:in `wrap_traverse'
from C:/Ruby200/lib/ruby/2.0.0/fileutils.rb:460:in `copy_entry'
from C:/Ruby200/lib/ruby/2.0.0/fileutils.rb:435:in `block in cp_r'
from C:/Ruby200/lib/ruby/2.0.0/fileutils.rb:1558:in `block in fu_each_src_dest'
from C:/Ruby200/lib/ruby/2.0.0/fileutils.rb:1574:in `fu_each_src_dest0'
from C:/Ruby200/lib/ruby/2.0.0/fileutils.rb:1556:in `fu_each_src_dest'
from C:/Ruby200/lib/ruby/2.0.0/fileutils.rb:434:in `cp_r'
from copy_test.rb:11:in `copy_files'
from copy_test.rb:81:in `block in <main>'
from copy_test.rb:76:in `each'
from copy_test.rb:76:in `<main>'
I'm calling copy_entry like this.
def copy_files(src, dst)
FileUtils.mkdir_p(File.dirname(dst))
FileUtils.copy_entry(src, dst)
end
There are some sub-folders and files named with my local language in src.
So I think (Encoding::CompatibilityError) occurs because of these sub-folders and files with local language (not English).
When I tested with only Enlgish folders and files, It worked.
But, I need non-English folders and files too.
How can I solve this problem?
Should I define a new method replacing copy_entry?
My Codes ADDED:
# -*- encoding: cp949 -*-
require 'fileutils'
LIST_FILE = "backup_target_list.txt"
DEST_FILE = "backup_dest.txt"
def copy_files(src, dst)
FileUtils.mkdir_p(File.dirname(dst))
FileUtils.copy_entry(src, dst)
end
def get_dst_base_name()
cur_date = Time.now.to_s[0..9]
return "backup_#{cur_date}"
end
def get_backup_list(list_file)
if !File.exist?(list_file) then
return nil
end
path_arr = []
File.open(list_file, "r") do |f|
f.each_line { |line|
path_arr.push(make_path(line.gsub("\n", "")))
}
end
return path_arr
end
def get_dest(dest_file)
if !File.exist?(dest_file) then
return nil
end
return File.open(dest_file, "r").readline
end
def make_path(*str)
path_new = nil
str.each do |item|
if item.class == Array then
path_new = (path_new == nil ? File.join(item) : File.join(path_new, item))
else
if item.include?(File::ALT_SEPARATOR) then
path_new = (path_new == nil ? File.join(item.split(File::ALT_SEPARATOR)) : File.join(path_new, item.split(File::ALT_SEPARATOR)))
else
path_new = (path_new == nil ? File.join(item) : File.join(path_new, item))
end
end
end
return path_new
end
get_backup_list(LIST_FILE).each do |path|
src = path
tmp = src.split(File::SEPARATOR)
dst = make_path(get_dest(DEST_FILE), get_dst_base_name, tmp[1..tmp.length])
print "src: #{src}\n=> dst: #{dst}\n"
copy_files(src, dst)
end
I saw you are using cp949 encoding, and the raised error tells CP949 and UTF-8 are incompatible, so why you just using UTF-8 encoding? So replace the shebang with
# coding: UTF-8
And to ensure all the characters read from LIST_FILE are utf-8 encoded, add the following:
line.force_encoding('utf-8')

How do I parse a tab-delimited line that contains a quote?

I'm using Ruby 2.4. How do I parse a tab-delimited line that contains a quote character? This is what's happening to me now ...
2.4.0 :003 > line = "11\tDave\tO\"malley"
=> "11\tDave\tO\"malley"
2.4.0 :004 > CSV.parse(line, col_sep: "\t")
CSV::MalformedCSVError: Illegal quoting in line 1.
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1912:in `block (2 levels) in shift'
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1868:in `each'
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1868:in `block in shift'
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1828:in `loop'
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1828:in `shift'
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1770:in `each'
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1784:in `to_a'
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1784:in `read'
from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1324:in `parse'
from (irb):4
from /Users/davea/.rvm/gems/ruby-2.4.0#global/gems/railties-5.0.1/lib/rails/commands/console.rb:65:in `start'
from /Users/davea/.rvm/gems/ruby-2.4.0#global/gems/railties-5.0.1/lib/rails/commands/console_helper.rb:9:in `start'
from /Users/davea/.rvm/gems/ruby-2.4.0#global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:78:in `console'
from /Users/davea/.rvm/gems/ruby-2.4.0#global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:49:in `run_command!'
from /Users/davea/.rvm/gems/ruby-2.4.0#global/gems/railties-5.0.1/lib/rails/commands.rb:18:in `<top (required)>'
from bin/rails:4:in `require'
from bin/rails:4:in `<main>'
Although teh example illustrates my point, I can't easily control the input coming in. So, although an answer coudl be< "Remove all quotes from teh string before parsing," I want to preserve the data as closely as possible.
That's a malformed document if you're trying to adhere to the CSV standard. Instad you might just brute-force it and pray there's no tabs in the data itself:
line.split(/\t/)
The CSV parsing library comes in handy when you're dealing with data like this:
"1\t2\t\"3a\t3b\"\t4"
Update: If you're prepared to abuse the CSV library a little then you can do this:
CSV.parse("11\tDave\tO\"malley", col_sep: "\t", quote_char: "\0")
That basically kills quote detection, so if there is other data that depends on that being processed correctly this may not work out.
"11\tDave\tO\"malley" is not valid CSV data. Strangely enough, the answer is to use two double-quotes, and to double quote each element
2.3.1 :001 > require 'csv'
=> true
2.3.1 :002 > line = "\"11\"\t\"Dave\"\t\"O\"\"malley\""
=> "\"11\"\t\"Dave\"\t\"O\"\"malley\""
2.3.1 :003 > puts line # for clarity
"11" "Dave" "O""malley"
=> nil
2.3.1 :004 > CSV.parse(line, col_sep: "\t")
=> [["11", "Dave", "O\"malley"]]

How to match a string with `\xXXXX` and `\uXXXX`

I have a string that contains \xXXXX and \uXXXX:
str = "\nDefault\nRouterRandom=\x9Db\u0012\xD3,\x92r\xFC o\u007F\x9B+\u0005I`\nWebInit=1\n"
I want to delete the content:
"RouterRandom=\x9Db\u0012\xD3,\x92r\xFC o\u007F\x9B+\u0005I`\n"
How can I match the string or delete it? I tried:
content = str.sub(/RouterRandom=.*WebInit/, "")
It returns errors:
E:/Automation/experiment/ruby_test/string_test.rb:119:in `sub': invalid byte sequence in UTF-8 (ArgumentError)
from E:/Automation/experiment/ruby_test/string_test.rb:119:in `block in <top (required)>'
from E:/Automation/experiment/ruby_test/string_test.rb:110:in `open'
from E:/Automation/experiment/ruby_test/string_test.rb:110:in `<top (required)>'
from -e:1:in `load'
from -e:1:in `<main>'
You are getting invalid byte sequence because of invalid characters in your string. You can replace invalid characters in your string first:
content = str.encode( 'UTF-8', invalid: :replace )
Then split by newlines:
content = content.split( "\n" )
Delete the offending element in the array by index:
content.delete_at( 2 )
And then finally join the array back together into a newline delimited string:
new_str = content.join( "\n" )
# => "\nDefault\nWebInit=1"

ruby - zlib header error when uncompressing tar.gz

I wanted to write a ruby program to unpackage tar.gz files, and I ran into some issues. After reading the documentation on the ruby-doc site, Zlib::GzipReader and Zlib::Inflate. Then I found this Module someone wrote on GitHub, and that didn't work either. So, using the examples from the Ruby page for Zlib::GzipReader, I these were able to run successfully.
irb(main):027:0* File.open('zlib_inflate.tar.gz') do |f|
irb(main):028:1* gz = Zlib::GzipReader.new(f)
irb(main):029:1> print gz.read
irb(main):030:1> gz.close
irb(main):031:1> end
#
#
#
irb(main):023:0* Zlib::GzipReader.open('zlib_inflate.tar.gz') { |gz|
irb(main):024:1* print gz.read
irb(main):025:1> }
Then, when trying to use the Zlib::Inflate options, I kept running into incorrect header check errors.
irb(main):047:0* zstream = Zlib::Inflate.new
=> #<Zlib::Inflate:0x00000002b15790 #dictionaries={}>
irb(main):048:0> buf = zstream.inflate('zlib_inflate.tar.gz')
Zlib::DataError: incorrect header check
from (irb):48:in `inflate'
from (irb):48
from C:/ruby/Ruby200p353/Ruby200-x64/bin/irb:12:in `<main>'
#
#
#
irb(main):049:0> cf = File.open("zlib_inflate.tar.gz")
=> #<File:zlib_inflate.tar.gz>
irb(main):050:0> ucf = File.open("ruby_inflate.rb", "w+")
=> #<File:ruby_inflate.rb>
irb(main):051:0> zi = Zlib::Inflate.new(Zlib::MAX_WBITS)
=> #<Zlib::Inflate:0x00000002c097f0 #dictionaries={}>
irb(main):052:0> ucf << zi.inflate(cf.read)
Zlib::DataError: incorrect header check
from (irb):52:in `inflate'
from (irb):52
from C:/ruby/Ruby200p353/Ruby200-x64/bin/irb:12:in `<main>'
#
#
#
irb(main):057:0* File.open("zlib_inflate.tar.gz") {|cf|
irb(main):058:1* zi = Zlib::Inflate.new
irb(main):059:1> File.open("zlib_inflate.rb", "w+") {|ucf|
irb(main):060:2* ucf << zi.inflate(cf.read)
irb(main):061:2> }
irb(main):062:1> zi.close
irb(main):063:1> }
Zlib::DataError: incorrect header check
from (irb):60:in `inflate'
from (irb):60:in `block (2 levels) in irb_binding'
from (irb):59:in `open'
from (irb):59:in `block in irb_binding'
from (irb):57:in `open'
from (irb):57
from C:/ruby/Ruby200p353/Ruby200-x64/bin/irb:12:in `<main>'
How would I go about taking a tar.gz file, and extract the contents so the files are not null? I did read this other post about Ruby Zlib from here, but that did not work for me as well. What am I doing wrong?

How do I open local image file, encode it, and post via URI? (posting via Tumblr API)

I'm trying to read local image file, properly encode it and post to Tumbrl. According to the Tumblr API I can pass a parameter data which is Array (URL-encoded binary contents) Limit: 5 MB
I've tested my code with http://api.tumblr.com/v2/blog/#{BLOG}/info request. It is working. But I can't post a photo. Here is my code:
require 'oauth'
require 'oauth/consumer'
require 'open-uri'
require 'active_support'
CONSUMER = 'foo'
SECRET = 'foo'
TOKEN = 'foo'
TOKEN_SECRET = 'foo'
BLOG = 'foo'
consumer=OAuth::Consumer.new(CONSUMER, SECRET, {:site=>"http://tumblr.com"})
access_token = OAuth::AccessToken.new(consumer, TOKEN, TOKEN_SECRET)
# Here I tried one of two lines:
# data = Base64.encode64(IO.binread('./resized')) #first try
data = URI::encode(IO.binread('./resized')) #second try
# response = access_token.get "http://api.tumblr.com/v2/blog/#{BLOG}/info?api_key=#{CONSUMER}"
# puts response
response=access_token.post "http://api.tumblr.com/v2/blog/#{BLOG}/post?api_key=#{CONSUMER}&type=photo&data=#{data}&link=http://ya.ru&"
puts response
1st try:
% ruby ./w_oauth.rb
/usr/lib/ruby/1.9.1/uri/common.rb:176:in `split': bad URI(is not URI?): http://api.tumblr.com/v2/blog/foo/post?api_key=foo&type=photo&data=/9j/4AAQSkZJRgABAQEASABIAAD//gA7Q1JFQVRPUjogZ2QtanBlZyB2MS4w (URI::InvalidURIError)
ICh1c2luZyBJSkcgSlBFRyB2NjIpLCBxdWFsaXR5ID0gODAK/9sAQwAGBAUG
BQQGBgUGBwcGCAoQCgoJCQoUDg8MEBcUGBgXFBYWGh0lHxobIxwWFiAsICMm
(!!!long piece of image data skipped!!!)
FI/16HfTbyHPWurqdE+TGH4wx2js5SKQb+6b4bIj3aurqCrEtcXrf/4yf/dS
DLet/wCzEB6sa6uoomxJN2eaQj5mkYuerQj611dQM7Fx/wDLF/8AbXV1dTA/
/9k=
&link=http://ya.ru&
from /usr/lib/ruby/1.9.1/uri/common.rb:211:in `parse'
from /usr/lib/ruby/1.9.1/uri/common.rb:747:in `parse'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/tokens/access_token.rb:7:in `request'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/tokens/access_token.rb:47:in `post'
from ./w_oauth.rb:23:in `<main>'
2nd try:
% ruby ./w_oauth.rb
/var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/helper.rb:14:in `force_encoding': can't modify frozen String (RuntimeError)
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/helper.rb:14:in `rescue in escape'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/helper.rb:12:in `escape'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/helper.rb:43:in `block (2 levels) in normalize'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/helper.rb:42:in `collect'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/helper.rb:42:in `block in normalize'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/helper.rb:37:in `map'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/helper.rb:37:in `normalize'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/request_proxy/base.rb:98:in `normalized_parameters'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/request_proxy/base.rb:113:in `signature_base_string'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/signature/base.rb:77:in `signature_base_string'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/signature/hmac/base.rb:12:in `digest'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/signature/base.rb:65:in `signature'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/signature.rb:23:in `sign'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/client/helper.rb:45:in `signature'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/client/helper.rb:75:in `header'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/client/net_http.rb:91:in `set_oauth_header'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/client/net_http.rb:30:in `oauth!'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/consumer.rb:224:in `sign!'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/consumer.rb:188:in `create_signed_request'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/consumer.rb:159:in `request'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/tokens/consumer_token.rb:25:in `request'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/tokens/access_token.rb:12:in `request'
from /var/lib/gems/1.9.1/gems/oauth-0.4.6/lib/oauth/tokens/access_token.rb:47:in `post'
from ./w_oauth.rb:23:in `<main>'
UPD: ./resized is a proper JPEG file:
% file ./resized
./resized: JPEG image data, JFIF standard 1.01, comment: "CREATOR: gd-jpeg v1.0 (using IJG JPEG v62), quality = 80"
URI encoding is not enough. You also need to encode: , / ? : # & = + $ #.
Try:
URI.escape(IO.binread('./resized'), Regexp.new("[^#{URI::PATTERN::UNRESERVED}]"))

Resources