Nokogiri and Mechanize problem - ruby

I am doing one the examples at the mechanize doc site and I want to parse the results using
nokogiri.
My problem is that when the following line gets executed:
doc = Nokogiri::HTML(search_results, 'UTF-8' )
the following error occurs:
C:/Ruby192/lib/ruby/gems/1.9.1/gems/nokogiri-1.4.4.1-x86-mingw32/lib/nokogiri/html/document.rb:71:in `parse': undefined method `name' for "UTF-8":String (NoMethodError)
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/nokogiri-1.4.4.1-x86-mingw32/lib/nokogiri/html.rb:13:in `HTML'
from mechanize_test.rb:16:in `<main>'
I have installed ruby 1.9 on a windows vista machine
The results returned by mechanize are non-latin (utf8)
The code sample follows.
# encoding: UTF-8
require 'rubygems'
require 'mechanize'
require 'nokogiri'
agent = Mechanize.new
agent.user_agent_alias = 'Mac Safari'
page = agent.get("http://www.google.com/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = "invitations"
search_results = agent.submit(search_form)
puts search_results.body
doc = Nokogiri::HTML(search_results, 'UTF-8')

#Douglas Drouillard
Thanx for looking into this. I found out I made a mistake. The call to nokogiri should have been:
doc = Nokogiri::HTML(search_results.body, 'UTF-8')
Note that search_results is different that search_results.body.
Search_results contains info coming right out of mechanize instantiation
while search_resuls.body contains html utf8 info that nokogiri can parse with no problem.

This appears to be issue with what Nokogiri expects as parameters to the parse method that is being called. The first issue I see, is that you are passing in the encoding option in the wrong parameter slot,
A parsing example from Nokogiri project page that specifies encoding
Nokogiri.XML('<foo><bar /><foo>', nil, 'EUC-JP')
Notice the encoding is the third parameter, not the second. But that still does not fully explain the behavior you are seeing, as the encoding should simply be ignored.
Per the Nokogiri documentation a call to Nokogiri::HTML() is a convenience method for the parse method.
Code for Nokogiri::HTML::parse
def parse thing, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML, &block
document.parse(thing, url, encoding, options, &block)
end
The source for the Nokogiri::HTML::Document parse method is a bit long, but here is the relevant part though:
string_or_io.respond_to?(:encoding)
unless string_or_io.encoding.name == "ASCII-8BIT"
encoding ||= string_or_io.encoding.name
end
end
Notice string_or_io.encoding.name, this matches the error your saw, undefined method 'name' for "UTF-8":String (NoMethodError).
Does your search_results object has an attribute with a key value pair of {:encoding => 'UTF-8'}? It appears Nokogiri is looking for the encoding to store an object that then has a name attribute of 'UTF-8'.

Related

I am getting (eval):1: invalid Unicode codepoint error while trying to scrape instagram

I am trying to scrape data from instagram. Here is my code
require 'open-uri'
require 'nokogiri'
require 'json'
require "unicode/emoji"
def get_html
url = 'https://www.instagram.com/muriithi_kabogo/'
html = open(url)
end
def pass_data
html = get_html
doc = Nokogiri::HTML(html)
end
def get_data
profiles = []
body = pass_data.at('body')
script = body.at('script').text
myText = script
json_object_data = eval(myText)
end
get_data()
When I try to change the text into json format, I get an error:
(eval):1: invalid Unicode codepoint (SyntaxError)
usinessmen #beautiful #smile\ud83d\ude0a #teambringit #shebr
How do I move past this error?
JSON, like JavaScript, uses UCS2 encoding, which Ruby chokes on.
Do not use evil. For one thing, Ruby will detect \ud83d\ude0a as invalid codepoints, as it should; for another, it is a security hole; and lastly, it slows down your code.
Use JSON.parse, which is safer, faster, and knows how to deal with UCS2:
require 'json'
json_str = '"usinessmen #beautiful #smile\ud83d\ude0a #teambringit #shebr"'
JSON.parse(json_str)
# => "usinessmen #beautiful #smile😊 #teambringit #shebr"

want to get taobao's list of URL of products on search result page without taobao API

I want to get taobao's list of URL of products on search result page without taobao API.
I tried following Ruby script.
require "open-uri"
require "rubygems"
require "nokogiri"
url='https://world.taobao.com/search/search.htm?_ksTS=1517338530524_300&spm=a21bp.7806943.20151106.1&search_type=0&_input_charset=utf-8&navigator=all&json=on&q=%E6%99%BA%E8%83%BD%E6%89%8B%E8%A1%A8&cna=htqfEgp0pnwCATyQWEDB%2FRCE&callback=__jsonp_cb&abtest=_AB-LR517-LR854-LR895-PR517-PR854-PR895'
charset = nil
html = open(url) do |f|
charset = f.charset
f.read
end
doc = Nokogiri::HTML.parse(html, nil, charset)
p doc.xpath('//*[#id="list-itemList"]/div/div/ul/li[1]/div/div[1]/div/a/#href').each{|i| puts i.text}
# => 0
I want to get list of URL like https://click.simba.taobao.com/cc_im?p=%D6%C7%C4%DC%CA%D6%B1%ED&s=328917633&k=525&e=lDs3%2BStGrhmNjUyxd8vQgTvfT37ERKUkJtUYVk0Fu%2FVZc0vyfhbmm9J7EYm6FR5sh%2BLS%2FyzVVWDh7%2FfsE6tfNMMXhI%2B0UDC%2FWUl0TVvvELm1aVClOoSyIIt8ABsLj0Cfp5je%2FwbwaEz8tmCoZFXvwyPz%2F%2ByQnqo1aHsxssXTFVCsSHkx4WMF4kAJ56h9nOp2im5c3WXYS4sLWfJKNVUNrw%2BpEPOoEyjgc%2Fum8LOuDJdaryOqOtghPVQXDFcIJ70E1c5A%2F3bFCO7mlhhsIlyS%2F6JgcI%2BCdFFR%2BwwAwPq4J5149i5fG90xFC36H%2B6u9EBPvn2ws%2F3%2BHHXRqztKxB9a0FyA0nyd%2BlQX%2FeDu0eNS7syyliXsttpfoRv3qrkLwaIIuERgjVDODL9nFyPftrSrn0UKrE5HoJxUtEjsZNeQxqovgnMsw6Jeaosp7zbesM2QBfpp6NMvKM5e5s1buUV%2F1AkICwRxH7wrUN4%2BFn%2FJ0%2FIDJa4fQd4KNO7J5gQRFseQ9Z1SEPDHzgw%3D however I am getting 0
What should I do?
I don't know taobao.com but the page seems like its running lots of javascript. So perhaps the content can actually not be retrieved with a client without javascript capabilities. So instead of open-uri, you could try the gem selenium-webdriver:
https://rubygems.org/gems/selenium-webdriver/versions/2.53.4

How do I print XPath value?

I want to print the contents of an XPath node. Here is what I have:
require "mechanize"
agent = Mechanize.new
agent.get("http://store.steampowered.com/promotion/snowglobefaq")
puts agent.xpath("//*[#id='item_52b3985a70d58']/div[4]")
This returns: <main>: undefined method xpath for #<Mechanize:0x2fa18c0> (NoMethodError).
I just started using Mechanize and have no idea what I'm doing, however, I've used Watir and thought this would work but it didn't.
You an use Nokogiri to parse the page after retrieving it. Here is the example code:
m = Mechanize.new
result = m.get("http://google.com")
html = Nokogiri::HTML(result.body)
divs = html.xpath('//div').map { |div| div.content } # here you can do whatever is needed with the divs
# I've mapped their content into an array
There are two things wrong:
The ID doesn't exist on that page. Try this to see the list of tag IDs available:
require "open-uri"
require 'nokogiri'
doc = Nokogiri::HTML(open("http://store.steampowered.com/promotion/snowglobefaq"))
puts doc.search('[id*="item"]').map{ |n| n['id'] }.sort
The correct chain of methods is agent.page.xpath.
Because there is no sample HTML showing exactly which tag you want, we can't help you much.

Get website headline with Nokogiri

I'm trying to get a website's headline (in Vietnamese) using Nokogiri:
# encoding: utf-8
require 'rubygems'
require 'nokogiri'
require 'open-uri'
page = Nokogiri::HTML(open("http://vnexpress.net"))
list = page.css("a[class='link-topnews']")
puts list[0].text
but it's giving the error:
undefined method `text' for nil:NilClass (NoMethodError)
The weird thing is, with the exact same code, sometimes it does work and gives the correct result:
Triều Tiên dọa hành động với máy bay B-52 của Mỹ
Even when trying to get the title it's giving the same error:
page = Nokogiri::HTML(open("http://vnexpress.net/"))
list = page.css("title")
puts list[0].text
Why does it behave like that? What did I do wrong?
It seems that the their server refuses to serve content when you use just nokogiri. I suppose, they are checking some headers. You can add headers or use Mechanize gem:
require 'mechanize'
agent = Mechanize.new
page = agent.get "http://vnexpress.net"
page.search("a.link-topnews").first.text
=> "Triều Tiên dọa hành động với máy bay B-52 của Mỹ"

How to get Mechanize to auto-convert body to UTF8?

I found some solutions using post_connect_hook and pre_connect_hook, but it seems like they don't work. I'm using the latest Mechanize version (2.1). There are no [:response] fields in the new version, and I don't know where to get them in the new version.
https://gist.github.com/search?q=pre_connect_hooks
https://gist.github.com/search?q=post_connect_hooks
Is it possible to make Mechanize return a UTF8 encoded version, instead of having to convert it manually using iconv?
Since Mechanize 2.0, arguments of pre_connect_hooks() and post_connect_hooks() were changed.
See the Mechanize documentation:
pre_connect_hooks()
A list of hooks to call before retrieving a response. Hooks are called with the agent, the URI, the response, and the response body.
 
post_connect_hooks()
A list of hooks to call after retrieving a response. Hooks are called with the agent, the URI, the response, and the response body.
Now you can't change the internal response-body value because an argument is not array. So, the next best way is to replace an internal parser with your own:
class MyParser
def self.parse(thing, url = nil, encoding = nil, options = Nokogiri::XML::ParseOptions::DEFAULT_HTML, &block)
# insert your conversion code here. For example:
# thing = NKF.nkf("-wm0X", thing).sub(/Shift_JIS/,"utf-8") # you need to rewrite content charset if it exists.
Nokogiri::HTML::Document.parse(thing, url, encoding, options, &block)
end
end
agent = Mechanize.new
agent.html_parser = MyParser
page = agent.get('http://somewhere.com/')
...
I found a solution that works pretty well:
class HtmlParser
def self.parse(body, url, encoding)
body.encode!('UTF-8', encoding, invalid: :replace, undef: :replace, replace: '')
Nokogiri::HTML::Document.parse(body, url, 'UTF-8')
end
end
Mechanize.new.tap do |web|
web.html_parser = HtmlParser
end
No issues were found yet.
In your script, just enter: page.encoding = 'utf-8'
However, depending on your scenario, you may alternatively need to enter the reverse (the encoding of the website Mechanize is working with) instead. For that, open Firefox, open the website you want Mechanize to work with, select Tools in the menubar, and then open Page Info. Determine what the page is encoded in from there.
Using that info, you would instead enter what the page is encoded in (such as page.encoding = 'windows-1252').
How about something like this:
class Mechanize
alias_method :original_get, :get
def get *args
doc = original_get *args
doc.encoding = 'utf-8'
doc
end
end

Resources