Get website headline with Nokogiri - ruby

I'm trying to get a website's headline (in Vietnamese) using Nokogiri:
# encoding: utf-8
require 'rubygems'
require 'nokogiri'
require 'open-uri'
page = Nokogiri::HTML(open("http://vnexpress.net"))
list = page.css("a[class='link-topnews']")
puts list[0].text
but it's giving the error:
undefined method `text' for nil:NilClass (NoMethodError)
The weird thing is, with the exact same code, sometimes it does work and gives the correct result:
Triều Tiên dọa hành động với máy bay B-52 của Mỹ
Even when trying to get the title it's giving the same error:
page = Nokogiri::HTML(open("http://vnexpress.net/"))
list = page.css("title")
puts list[0].text
Why does it behave like that? What did I do wrong?

It seems that the their server refuses to serve content when you use just nokogiri. I suppose, they are checking some headers. You can add headers or use Mechanize gem:
require 'mechanize'
agent = Mechanize.new
page = agent.get "http://vnexpress.net"
page.search("a.link-topnews").first.text
=> "Triều Tiên dọa hành động với máy bay B-52 của Mỹ"

Related

want to get taobao's list of URL of products on search result page without taobao API

I want to get taobao's list of URL of products on search result page without taobao API.
I tried following Ruby script.
require "open-uri"
require "rubygems"
require "nokogiri"
url='https://world.taobao.com/search/search.htm?_ksTS=1517338530524_300&spm=a21bp.7806943.20151106.1&search_type=0&_input_charset=utf-8&navigator=all&json=on&q=%E6%99%BA%E8%83%BD%E6%89%8B%E8%A1%A8&cna=htqfEgp0pnwCATyQWEDB%2FRCE&callback=__jsonp_cb&abtest=_AB-LR517-LR854-LR895-PR517-PR854-PR895'
charset = nil
html = open(url) do |f|
charset = f.charset
f.read
end
doc = Nokogiri::HTML.parse(html, nil, charset)
p doc.xpath('//*[#id="list-itemList"]/div/div/ul/li[1]/div/div[1]/div/a/#href').each{|i| puts i.text}
# => 0
I want to get list of URL like https://click.simba.taobao.com/cc_im?p=%D6%C7%C4%DC%CA%D6%B1%ED&s=328917633&k=525&e=lDs3%2BStGrhmNjUyxd8vQgTvfT37ERKUkJtUYVk0Fu%2FVZc0vyfhbmm9J7EYm6FR5sh%2BLS%2FyzVVWDh7%2FfsE6tfNMMXhI%2B0UDC%2FWUl0TVvvELm1aVClOoSyIIt8ABsLj0Cfp5je%2FwbwaEz8tmCoZFXvwyPz%2F%2ByQnqo1aHsxssXTFVCsSHkx4WMF4kAJ56h9nOp2im5c3WXYS4sLWfJKNVUNrw%2BpEPOoEyjgc%2Fum8LOuDJdaryOqOtghPVQXDFcIJ70E1c5A%2F3bFCO7mlhhsIlyS%2F6JgcI%2BCdFFR%2BwwAwPq4J5149i5fG90xFC36H%2B6u9EBPvn2ws%2F3%2BHHXRqztKxB9a0FyA0nyd%2BlQX%2FeDu0eNS7syyliXsttpfoRv3qrkLwaIIuERgjVDODL9nFyPftrSrn0UKrE5HoJxUtEjsZNeQxqovgnMsw6Jeaosp7zbesM2QBfpp6NMvKM5e5s1buUV%2F1AkICwRxH7wrUN4%2BFn%2FJ0%2FIDJa4fQd4KNO7J5gQRFseQ9Z1SEPDHzgw%3D however I am getting 0
What should I do?
I don't know taobao.com but the page seems like its running lots of javascript. So perhaps the content can actually not be retrieved with a client without javascript capabilities. So instead of open-uri, you could try the gem selenium-webdriver:
https://rubygems.org/gems/selenium-webdriver/versions/2.53.4

Web Scraping with Nokogiri and Mechanize

I am parsing prada.com and would like to scrape data in the div class "nextItem" and get its name and price. Here is my code:
require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'open-uri'
agent = Mechanize.new
page = agent.get('http://www.prada.com/en/US/e-store/department/woman/handbags.html?cmp=from_home')
fp = File.new('prada_prices','w')
html_doc = Nokogiri::HTML(page)
page = html_doc.xpath("//ol[#class='nextItem']")
page.each do {|i| fp.write(i.text + "\n")}
end
I get an error and no output. What I think I am doing is instantiating a mechanize object and calling it agent.
Then creating a page variable and assigning it the url provided.
Then creating a variable that is a nokogiri object with the mechanize url passed in
Then searching the url for all class references that are titled nextItem
Then printing all the data contained there
Can someone show me where I might have went wrong?
Since Prada's website dynamically loads its content via JavaScript, it will be hard to scrape its content. See "Scraping dynamic content in a website" for more information.
Generally speaking, with Mechanize, after you get a page:
page = agent.get(page_url)
you can easily search items with CSS selectors and scrape for data:
next_items = page.search(".fooClass")
next_items.each do |item|
price = item.search(".fooPrice").text
end
Then simply handle the strings or generate hashes as you desire.
Here are the wrong parts:
Check again the block syntax - use {} or do/end but not both in the same time.
Mechanize#get returns a Mechanize::Page which act as a Nokogiri document, at least it has search, xpath, css. Use them instead of trying to coerce the document to a Nokogiri::HTML object.
There is no need to require 'open-uri', and require 'nokogiri' when you are not using them directly.
Finally check maybe more about Ruby's basics before continuing with web scraping.
Here is the code with fixes:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.prada.com/en/US/e-store/department/woman/handbags.html?cmp=from_home')
fp = File.new('prada_prices','w')
page = page.search("//ol[#class='nextItem']").each do |i|
fp.write(i.text + "\n")
end
fp.close

Using mechanize with watir + phantomjs

I'm trying to insert the html generated from phantom js into a mechanize object so that I can easily search it. I've tried the following to no avail...
b = Watir::Browser.new :phantomjs
url = "www.google.com"
b.goto url
agent = Mechanize.new
#Following is not executed at same time...
#Error 1: lots of errors
page = agent.get(b.html)
#Error 2: `parse': wrong number of arguments (1 for 3) (ArgumentError)
page = agent.parse(b.html)
#Error 3 last ditch effort: undefined method `agent'
page = agent(b.html)
As I think it through I'm beginning to wonder if I can mechanize an existing html object... I initially got onto it via: http://shane.in/2014/01/headless-web-scraping/ & http://watirmelon.com/2013/02/05/watir-webdriver-with-ghostdriver-on-osx-headless-browser-testing/
I was in the same situation. I write a lot of code with Mechanize so that I do not want to move to nokogiri when using watir. Below code is how I did.
require 'watir'
require 'mechanize'
b = Watir::Browser.new
b.goto(url)
html = b.html
a = Mechanize.new
page = Mechanize::Page.new(nil, {'content-type'=>'text/html'}, html, nil, a)
You could use page to search for elements.
require 'watir'
require 'nokogiri'
b = Watir::Browser.new :phantomjs
url = "http://google.com"
b.goto url
p Nokogiri::HTML(b.html)
You are probably better off just using Nokogiri for this [that is, if you only need to search for some data in source].

How do I print XPath value?

I want to print the contents of an XPath node. Here is what I have:
require "mechanize"
agent = Mechanize.new
agent.get("http://store.steampowered.com/promotion/snowglobefaq")
puts agent.xpath("//*[#id='item_52b3985a70d58']/div[4]")
This returns: <main>: undefined method xpath for #<Mechanize:0x2fa18c0> (NoMethodError).
I just started using Mechanize and have no idea what I'm doing, however, I've used Watir and thought this would work but it didn't.
You an use Nokogiri to parse the page after retrieving it. Here is the example code:
m = Mechanize.new
result = m.get("http://google.com")
html = Nokogiri::HTML(result.body)
divs = html.xpath('//div').map { |div| div.content } # here you can do whatever is needed with the divs
# I've mapped their content into an array
There are two things wrong:
The ID doesn't exist on that page. Try this to see the list of tag IDs available:
require "open-uri"
require 'nokogiri'
doc = Nokogiri::HTML(open("http://store.steampowered.com/promotion/snowglobefaq"))
puts doc.search('[id*="item"]').map{ |n| n['id'] }.sort
The correct chain of methods is agent.page.xpath.
Because there is no sample HTML showing exactly which tag you want, we can't help you much.

Nokogiri and Mechanize problem

I am doing one the examples at the mechanize doc site and I want to parse the results using
nokogiri.
My problem is that when the following line gets executed:
doc = Nokogiri::HTML(search_results, 'UTF-8' )
the following error occurs:
C:/Ruby192/lib/ruby/gems/1.9.1/gems/nokogiri-1.4.4.1-x86-mingw32/lib/nokogiri/html/document.rb:71:in `parse': undefined method `name' for "UTF-8":String (NoMethodError)
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/nokogiri-1.4.4.1-x86-mingw32/lib/nokogiri/html.rb:13:in `HTML'
from mechanize_test.rb:16:in `<main>'
I have installed ruby 1.9 on a windows vista machine
The results returned by mechanize are non-latin (utf8)
The code sample follows.
# encoding: UTF-8
require 'rubygems'
require 'mechanize'
require 'nokogiri'
agent = Mechanize.new
agent.user_agent_alias = 'Mac Safari'
page = agent.get("http://www.google.com/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = "invitations"
search_results = agent.submit(search_form)
puts search_results.body
doc = Nokogiri::HTML(search_results, 'UTF-8')
#Douglas Drouillard
Thanx for looking into this. I found out I made a mistake. The call to nokogiri should have been:
doc = Nokogiri::HTML(search_results.body, 'UTF-8')
Note that search_results is different that search_results.body.
Search_results contains info coming right out of mechanize instantiation
while search_resuls.body contains html utf8 info that nokogiri can parse with no problem.
This appears to be issue with what Nokogiri expects as parameters to the parse method that is being called. The first issue I see, is that you are passing in the encoding option in the wrong parameter slot,
A parsing example from Nokogiri project page that specifies encoding
Nokogiri.XML('<foo><bar /><foo>', nil, 'EUC-JP')
Notice the encoding is the third parameter, not the second. But that still does not fully explain the behavior you are seeing, as the encoding should simply be ignored.
Per the Nokogiri documentation a call to Nokogiri::HTML() is a convenience method for the parse method.
Code for Nokogiri::HTML::parse
def parse thing, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML, &block
document.parse(thing, url, encoding, options, &block)
end
The source for the Nokogiri::HTML::Document parse method is a bit long, but here is the relevant part though:
string_or_io.respond_to?(:encoding)
unless string_or_io.encoding.name == "ASCII-8BIT"
encoding ||= string_or_io.encoding.name
end
end
Notice string_or_io.encoding.name, this matches the error your saw, undefined method 'name' for "UTF-8":String (NoMethodError).
Does your search_results object has an attribute with a key value pair of {:encoding => 'UTF-8'}? It appears Nokogiri is looking for the encoding to store an object that then has a name attribute of 'UTF-8'.

Resources