Parse HTML nodes using xpath to Ruby/Nokogiri - ruby

Run the following, and it's supposed to return the sequence. The Xpath are coped using chrome Xpath but in nokogiri it just returns empty string.
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("https://pt.wiktionary.org/wiki/fazer"))
p sequence = doc.xpath('//*[#id="NavFrame1"]/div[2]/table[2]/tbody/tr[12]')

I have just tried with Capybara with Poltergeist; it worked fine. When I tried your code as well but, div[#id="NavFrame1"] does not exist. So there might be a parsing problem...
require 'capybara'
require 'capybara/dsl'
require 'capybara/poltergeist'
Capybara.register_driver :poltergeist_debug do |app|
Capybara::Poltergeist::Driver.new(app, inspector: true)
end
Capybara.javascript_driver = :poltergeist_debug
Capybara.current_driver = :poltergeist_debug
visit("https://pt.wiktionary.org/wiki/fazer")
doc = Nokogiri::HTML.parse(page.html)
p sequence = doc.xpath('//*[#id="NavFrame1"]/div[2]/table[2]/tbody/tr[12]')

The issue is not with parsing as suggested by #shota
The actual issue is that the div element which you trying to parse is not part of the first response. It actually gets added using JavaScript.
If you see page source of
https://pt.wiktionary.org/wiki/fazer
i.e. view-source:https://pt.wiktionary.org/wiki/fazer
you won't find any element with id NavFrame1
You can also verify this using some javascript disabler extension like Quick Javascript switcher

Related

How to scrape pages which have lazy loading

Here is the code which i used for parsing of web page.I did it in rails console.But i am not getting any output in my rails console.The site which i want to scrape is having lazy loading
require 'nokogiri'
require 'open-uri'
page = 1
while true
url = "http://www.justdial.com/functions"+"/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits"+"&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=#{page}"
doc = Nokogiri::HTML(open(url))
doc = Nokogiri::HTML(doc.at_css('#ajax').text)
d = doc.css(".rslwrp")
d.each do |t|
puts t.css(".jrcw").text
puts t.css("span.jcn").text
puts t.css(".jaid").text
puts t.css(".estd").text
page+=1
end
end
You have 2 options here:
Switch pure HTTP scraping to some tool which supports javascript evaluation, such as Capybara (with proper driver selected). This can be slow, since you're running headless browser under the hood plus you'll have to set some timeouts or figure another way to make sure the blocks of text you're interested in are loaded before you start any scraping.
Second option is to use Web Developer console and figure out how those blocks of text are loaded (which AJAX calls, their parameters and etc.) and implement them in your scraper. This is more advanced approach, but more performant, since you won't make any extra work, like you've done in option 1.
Have a nice day!
UPDATE:
Your code above doesn't work, because the response is HTML code wrapped in JSON object, while you're trying to parse it as a raw HTML. It looks like this:
{
"error": 0,
"msg": "request successful",
"paidDocIds": "some ids here",
"itemStartIndex": 20,
"lastPageNum": 50,
"markup": 'LOTS AND LOTS AND LOTS OF MARKUP'
}
What you need is unwrap JSON and then parse as HTML:
require 'json'
json = JSON.parse(open(url).read) # make sure you check http errors here
html = json['markup'] # can this field be empty? check for the json['error'] field
doc = Nokogiri::HTML(html) # parse as you like
I'd also advise you against using open-uri since your code may become vulnerable if you use dynamic urls because of the way open-uri works (read the linked article for the details) and use good and more feature-wise libraries such as HTTParty and RestClient.
UPDATE 2: Minimal working script for me:
require 'json'
require 'open-uri'
require 'nokogiri'
url = 'http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=2'
json = JSON.parse(open(url).read) # make sure you check http errors here
html = json['markup'] # can this field be empty? check for the json['error'] field
doc = Nokogiri::HTML(html) # parse as you like
puts doc.at_css('#newphoto10').attr('title')
# => Dr Raaj Batra Lal Kitab Expert in East Patel Nagar, Delhi

How do I use Nokogiri to parse a bit.ly stats page?

I'm trying to parse the Twitter usernames from a bit.ly stats page using Nokogiri:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://bitly.com/U026ue+/global'))
twitter_accounts = []
shares = doc.xpath('//*[#id="tweets"]/li')
shares.map do |tweet|
twitter_accounts << tweet.at_css('.conv.tweet.a')
end
puts twitter_accounts
My understanding is that Nokogiri will save shares in some form of tree structure, which I can use to drill down into, but my mileage is varying.
That data is coming in from an Ajax request with a JSON response. It's pretty easy to get at though:
require 'json'
url = 'http://search.twitter.com/search.json?_usragnt=Bitly&include_entities=true&rpp=100&q=nowness.com%2Fday%2F2012%2F12%2F6%2F2643'
hash = JSON.parse open(url).read
puts hash['results'].map{|x| x['from_user']}
I got that URL by loading the page in Chrome and then looking at the network panel, I also removed the timestamp and callback parameters just to clean things up a bit.
Actually, Eric Walker was onto something. If you look at doc, the section where the tweets are supposed to be look like:
<h2>Tweets</h2>
<ul id="tweets"></ul>
</div>
This is likely because they're generated by some JavaScript call which Nokogiri isn't executing. One possible solution is to use watir to traverse to the page, load the JavaScript and then save the HTML.
Here is a script that accomplishes just that. Note that you had some issues with your XPath arguments which I've since solved, and that watir will open a new browser every time you run this script:
require 'watir'
require 'nokogiri'
browser = Watir::Browser.new
browser.goto 'http://bitly.com/U026ue+/global'
doc = Nokogiri::HTML.parse(browser.html)
twitter_accounts = []
shares = doc.xpath('//li[contains(#class, "tweet")]/a')
shares.each do |tweet|
twitter_accounts << tweet.attr('title')
end
puts twitter_accounts
browser.close
You can also use headless to prevent a window from opening.

How do I scrape data from a page that loads specific data after the main page load?

I have been using Ruby and Nokogiri to pull data from a URL similar to this one from the hollister website: http://www.hollisterco.com/webapp/wcs/stores/servlet/TrackDetail?storeId=10251&catalogId=10201&langId=-1&URL=TrackDetailView&orderNumber=1316358
My script looks like this right now:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
page = Nokogiri::HTML(open("http://www.hollisterco.com/webapp/wcs/stores/servlet/TrackDetail?storeId=10251&catalogId=10201&langId=-1&URL=TrackDetailView&orderNumber=1316358"))
puts page.css("h3[data-property=GLB_ORDERNUMBERSYMBOL]")[0].text
My problem is that the Hollister page has some sort of asynchronous loading of data, such that when my script checks the area of the page with order specific data for a page element, it doesn't exist yet. I.E., the <h3> with data-property=GBL_ORDERNUMBERSYMBOL doesn't yet exist, but in the browser if you let it load for another ten seconds, the DOM and HTML change to reflect the specific order details.
What is the best way to capture this data that loads after the fact? I have tried using the watir-webdriver, but not sure what I would need to do to make that one work either.
Try installing Capybara-webkit (make sure you have QtWebKit installed, otherwise the gem install would fail). This will give you a headless solution. Then try this:
require 'capybara-webkit'
require 'capybara/dsl'
require 'nokogiri'
require 'open-uri'
url = 'http://www.hollisterco.com/webapp/wcs/stores/servlet/TrackDetail?storeId=10251&catalogId=10201&langId=-1&URL=TrackDetailView&orderNumber=1316358'
#change the capybara config to DSL and to use webkit
include Capybara::DSL
Capybara.current_driver = :webkit
visit(url)
doc = Nokogiri::HTML.parse(body)
then parse the body as you would normally. To remove all that error messages try this:
Capybara.register_driver :webkit do |app|
Capybara::Driver::Webkit.new(app, :stdout => nil)
end
I am not sure how to do it with Open-URI, but if you want to use Watir-Webdriver, the following works.
require 'watir-webdriver'
b = Watir::Browser.new
b.goto('http://www.hollisterco.com/webapp/wcs/stores/servlet/TrackDetail?storeId=10251&catalogId=10201&langId=-1&URL=TrackDetailView&orderNumber=1316358')
puts b.h3(:class, 'order-num').when_present.text
Note that a when_present() is performed on the h3 tag. What this means is that the script will wait for the h3 to appear before trying to get its text. If you know there are parts that take time to load, adding an explicit wait usually solves the problem.
Following #benaneesh's answer I had to make slight modifications to get it to work in my ruby script and not show the unknown url messages...
require 'capybara-webkit'
require 'capybara/dsl'
require 'nokogiri'
require 'open-uri'
include Capybara::DSL
Capybara.current_driver = :webkit
Capybara::Webkit.configure do |config|
config.block_unknown_urls
config.allow_url("*mysite.com")
end
#... rest of code

nokogiri returns blank given correct xpath

run the following, and its supposed to return the company name. The xpath works in firefox, and it returns the company name. however in nokogiri, this isn't happening, it jsut returns empty string!
require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.careerbuilder.com/JobSeeker/Jobs/JobDetails.aspx?IPath=QHK
CV&ff=21&APath=2.21.0.0.0&job_did=J3G71D73BM9HCK1M84Z&cbRecursionCnt=1&cbsid=6d2aee1515ed404b8306d1a583592cd4-314600403-JQ-5'))
companyname = doc.xpath("/html[1]/body[1]/div[1]/div[1]/form[1]/div[1]/table[1]/tbody[1]/tr[2]/td[1]/div[1]/table[1]/tbody[1]/tr[1]/td[1]/div[1]/div[2]/table[1]/tbody[1]/tr[1]/td[2]").to_s
puts companyname
Your xpath is not correct :)
You should omit the tbody part, this is generated by the browser but not by nokogiri!
doc.xpath("/html[1]/body[1]/div[1]/div[1]/form[1]/div[1]/table[1]/tr[2]/td[1]/div[1]/table[1]/tr[1]/td[1]/div[1]/div[2]/table[1]/tr[1]/td[2]").to_s
NB: Also you xpath will be more stable against changes of the HTML page if you use any class or id attributes to selected nodes, rather than the full path. For example you could use
doc.xpath("//div[#class='job_desc'][1]/table[1]/tr[1]/td[2]")
or even simple just use a css selector
doc.css("div.job_desc td")[1]

Parsing an HTML table using Hpricot (Ruby)

I am trying to parse an HTML table using Hpricot but am stuck, not able to select a table element from the page which has a specified id.
Here is my ruby code:-
require 'rubygems'
require 'mechanize'
require 'hpricot'
agent = WWW::Mechanize.new
page = agent.get('http://www.indiapost.gov.in/pin/pinsearch.aspx')
form = page.forms.find {|f| f.name == 'form1'}
form.fields.find {|f| f.name == 'ddl_state'}.options[1].select
page = agent.submit(form, form.buttons[2])
doc = Hpricot(page.body)
puts doc.to_html # Here the doc contains the full HTML page
puts doc.search("//table[#id='gvw_offices']").first # This is NIL
Can anyone help me to identify what's wrong with this.
Mechanize will use hpricot internally (it's mechanize's default parser). What's more, it'll pass the hpricot stuff on to the parser, so you don't have to do it yourself:
require 'rubygems'
require 'mechanize'
#You don't really need this if you don't use hpricot directly
require 'hpricot'
agent = WWW::Mechanize.new
page = agent.get('http://www.indiapost.gov.in/pin/pinsearch.aspx')
form = page.forms.find {|f| f.name == 'form1'}
form.fields.find {|f| f.name == 'ddl_state'}.options[1].select
page = agent.submit(form, form.buttons[2])
puts page.parser.to_html # page.parser returns the hpricot parser
puts page.at("//table[#id='gvw_offices']") # This passes through to hpricot
Also note that page.search("foo").first is equivalent to page.at("foo").
Note that Mechanize no longer uses Hpricot (it uses Nokogiri) by default in the later versions (0.9.0) and you have to explicitly specify Hpricot to continue using with:
WWW::Mechanize.html_parser = Hpricot
Just like that, no quotes or anything around Hpricot - there's probably a module you can specify for Hpricot, because it won't work if you put this statement inside your own module declaration. Here's the best way to do it at the top of your class (before opening module or class)
require 'mechanize'
require 'hpricot'
# Later versions of Mechanize no longer use Hpricot by default
# but have an attribute we can set to use it
begin
WWW::Mechanize.html_parser = Hpricot
rescue NoMethodError
# must be using an older version of Mechanize that doesn't
# have the html_parser attribute - just ignore it since
# this older version will use Hpricot anyway
end
By using the rescue block you ensure that if they do have an older version of mechanize, it won't barf on the nonexistent html_parser attribute. (Otherwise you need to make your code dependent on the latest version of Mechanize)
Also in the latest version, WWW::Mechanize::List was deprecated. Don't ask me why because it totally breaks backward compatibility for statements like
page.forms.name('form1').first
which used to be a common idiom that worked because Page#forms returned a mechanize List which had a "name" method. Now it returns a simple array of Forms.
I found this out the hard way, but your usage will work because you're using find which is a method of array.
But a better method for finding the first form with a given name is Page#form so your form finding line becomes
form = page.form('form1')
this method works with old an new versions.

Resources