How to search in page body and force an encoding conversion - ruby

I have pretty simple code:
require 'rubygems'
require 'mechanize'
URL = 'http://yandex.ru'
agent = Mechanize.new
page = agent.get(URL)
# page.encoding => UTF-8
# page.body.encoding => ASCII-8BIT
page.body.include?("Карты")
And on the last line of that code Ruby returned an error:
in `include?': incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
Solutions from "How to get Mechanize to auto-convert body to UTF8?" don't help. What should I do to fix it?

You can use the force_encoding method like this:
agent.page.body.force_encoding('utf-8')

Related

I am getting (eval):1: invalid Unicode codepoint error while trying to scrape instagram

I am trying to scrape data from instagram. Here is my code
require 'open-uri'
require 'nokogiri'
require 'json'
require "unicode/emoji"
def get_html
url = 'https://www.instagram.com/muriithi_kabogo/'
html = open(url)
end
def pass_data
html = get_html
doc = Nokogiri::HTML(html)
end
def get_data
profiles = []
body = pass_data.at('body')
script = body.at('script').text
myText = script
json_object_data = eval(myText)
end
get_data()
When I try to change the text into json format, I get an error:
(eval):1: invalid Unicode codepoint (SyntaxError)
usinessmen #beautiful #smile\ud83d\ude0a #teambringit #shebr
How do I move past this error?
JSON, like JavaScript, uses UCS2 encoding, which Ruby chokes on.
Do not use evil. For one thing, Ruby will detect \ud83d\ude0a as invalid codepoints, as it should; for another, it is a security hole; and lastly, it slows down your code.
Use JSON.parse, which is safer, faster, and knows how to deal with UCS2:
require 'json'
json_str = '"usinessmen #beautiful #smile\ud83d\ude0a #teambringit #shebr"'
JSON.parse(json_str)
# => "usinessmen #beautiful #smile😊 #teambringit #shebr"

incompatible character encodings: ASCII-8BIT and UTF-8 in Oga gem

I am using an XML/HTML parser called Oga.
I am attempting to crawl this URL: http://www.johnvanderlyn.com and parse the body for text, like so:
def get_page
body = Net::HTTP.get(URI.parse(#url))
document = Oga.parse_html(body)
end
document = get_page
words = document.css('body').text
When I get this error:
/gems/oga-2.7/lib/oga/xml/node_set.rb:276:in block in text': incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
That is related to this bit of code here.
What could be causing this and how can I fix it? Is there a way for me to fix it locally, or do I have to fork the gem, fix that method and then use my fork?
Thoughts?
The bit of code you linked has nothing to do with the glitch, that is the issue of body is being interpreted in wrong encoding. Try adding body = body.force_encoding 'UTF-8' before parsing a document:
def get_page
body = Net::HTTP.get(URI.parse(#url)).force_encoding 'UTF-8'
document = Oga.parse_html(body)
end

hpricot-invalid byte sequence in UTF-8

I already done some searches but none of that can solve this peculiar,unexpected problem.
Just look at the code blow:
require 'open-uri'
require 'hpricot'
doc = Hpricot(open("http://www.baidu.com/")) #this web page's encoding is GB2312
I don't know what's going on here,you can this in your irb to see if you can get the problem
It just pop up "ArgumentError: invalid byte sequence in UTF-8"
I have try to convert the original HTML into utf-8 by Iconv but it still won't work
Guys,I really don't what to do now,please help me
Hpricot - UTF-8 issues
invalid byte sequence in UTF-8 (ArgumentError)
require 'hpricot'
require 'open-uri'
doc = open('http://www.amazon.co.jp/') {|f| Hpricot(f.read) }
puts doc.to_html
open('http://www.amazon.co.jp/') {|f| Hpricot(f.read.encode("UTF-8")) }
I know how it could work with Net::HTTP (Ruby 1.9.2):
require 'net/http'
require 'uri'
url = URI.parse('http://www.baidu.com')
res = Net::HTTP.start(url.host, url.port) {|http|
http.get('/')
}
str = res.body.force_encoding('GB2312')
puts str
puts str.encoding.name # => GB2312
Does that help?

Nokogiri and Mechanize problem

I am doing one the examples at the mechanize doc site and I want to parse the results using
nokogiri.
My problem is that when the following line gets executed:
doc = Nokogiri::HTML(search_results, 'UTF-8' )
the following error occurs:
C:/Ruby192/lib/ruby/gems/1.9.1/gems/nokogiri-1.4.4.1-x86-mingw32/lib/nokogiri/html/document.rb:71:in `parse': undefined method `name' for "UTF-8":String (NoMethodError)
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/nokogiri-1.4.4.1-x86-mingw32/lib/nokogiri/html.rb:13:in `HTML'
from mechanize_test.rb:16:in `<main>'
I have installed ruby 1.9 on a windows vista machine
The results returned by mechanize are non-latin (utf8)
The code sample follows.
# encoding: UTF-8
require 'rubygems'
require 'mechanize'
require 'nokogiri'
agent = Mechanize.new
agent.user_agent_alias = 'Mac Safari'
page = agent.get("http://www.google.com/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = "invitations"
search_results = agent.submit(search_form)
puts search_results.body
doc = Nokogiri::HTML(search_results, 'UTF-8')
#Douglas Drouillard
Thanx for looking into this. I found out I made a mistake. The call to nokogiri should have been:
doc = Nokogiri::HTML(search_results.body, 'UTF-8')
Note that search_results is different that search_results.body.
Search_results contains info coming right out of mechanize instantiation
while search_resuls.body contains html utf8 info that nokogiri can parse with no problem.
This appears to be issue with what Nokogiri expects as parameters to the parse method that is being called. The first issue I see, is that you are passing in the encoding option in the wrong parameter slot,
A parsing example from Nokogiri project page that specifies encoding
Nokogiri.XML('<foo><bar /><foo>', nil, 'EUC-JP')
Notice the encoding is the third parameter, not the second. But that still does not fully explain the behavior you are seeing, as the encoding should simply be ignored.
Per the Nokogiri documentation a call to Nokogiri::HTML() is a convenience method for the parse method.
Code for Nokogiri::HTML::parse
def parse thing, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML, &block
document.parse(thing, url, encoding, options, &block)
end
The source for the Nokogiri::HTML::Document parse method is a bit long, but here is the relevant part though:
string_or_io.respond_to?(:encoding)
unless string_or_io.encoding.name == "ASCII-8BIT"
encoding ||= string_or_io.encoding.name
end
end
Notice string_or_io.encoding.name, this matches the error your saw, undefined method 'name' for "UTF-8":String (NoMethodError).
Does your search_results object has an attribute with a key value pair of {:encoding => 'UTF-8'}? It appears Nokogiri is looking for the encoding to store an object that then has a name attribute of 'UTF-8'.

How to convert with Ruby accented characters in HTML special entities

How can I do this on Ruby?
puts some_method("ò")
# => "ò"
In other words convert an accented character like ò to his HTML version: ò
I tried like this:
# coding: utf-8
require 'rubygems'
require 'htmlentities'
require 'unicode'
coder = HTMLEntities.new
string = "Scròfina"
puts coder.encode(string, :named)
but what I get this (from: http://htmlentities.rubyforge.org/) :
/Library/Ruby/Gems/1.8/gems/htmlentities-4.2.0/lib/htmlentities/encoder.rb:85:in `unpack': malformed UTF-8 character (expected 2 bytes, given 1 bytes) (ArgumentError)
from /Library/Ruby/Gems/1.8/gems/htmlentities-4.2.0/lib/htmlentities/encoder.rb:85:in `encode_decimal'
from (eval):2:in `encode_extended'
from /Library/Ruby/Gems/1.8/gems/htmlentities-4.2.0/lib/htmlentities/encoder.rb:18:in `encode'
from /Library/Ruby/Gems/1.8/gems/htmlentities-4.2.0/lib/htmlentities/encoder.rb:18:in `gsub!'
from /Library/Ruby/Gems/1.8/gems/htmlentities-4.2.0/lib/htmlentities/encoder.rb:18:in `encode'
from /Library/Ruby/Gems/1.8/gems/htmlentities-4.2.0/lib/htmlentities.rb:74:in `encode'
from unicode_pleasure.rb:8
Thank you for your time!
Leonardo
I had explicitly set the $KCODE to make your example work. Also, make sure your source file is actually encoded as UTF-8!
# coding: utf-8
require 'rubygems'
require 'htmlentities'
require 'unicode'
$KCODE = 'UTF-8'
coder = HTMLEntities.new
string = "Scròfina"
puts coder.encode(string, :named)

Resources