Nokogiri - Works with XML, not so much with HTML - ruby

I'm having an issue getting Nokogiri to work properly. I'm using version 1.4.4 with Ruby 1.9.2.
I have both libxml2 libxslt installed and up to date. When I run a Ruby script with XML, it works great.
require 'nokogiri'
doc = Nokogiri::XML(File.open("test.xml"))
doc = doc.css("name").each do |node|
puts node.text
end
Enter into the CL, run ruby test.rb, returns
Name 1
Name 2
Name 3
And the crowd goes wild.
I tweak a few things, make a few adjustments to the code...
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://domain.tld"))
doc = doc.css("p").each do |node|
puts node.text
end
Back to CL, ruby test.rb, returns... nothing! Just a new, empty line.
Is there any reason that it will work with an XML file, but not HTML?

To debug this sort of problem we need more information from you. Since you're not giving a working URL, and because we know that Nokogiri works fine for this sort of problem, the debugging falls on you.
Here's what I would do to test:
In IRB:
Do you get output when you do: open('http://whateverURLyouarehiding.com').read
If that returns a valid document, what do you get when you wrap the previous open statement in Nokogiri::HTML(...). That needs to preserve the .read in the previous line too, so Nokogiri is receiving the body of the page, NOT an IO stream.
Try #2 above, but remove the .read. That will tell if there's a problem with Nokogiri reading an IO stream, though I seriously doubt it has a problem since I use it all the time. At that point I'd suspect a problem on your system.
If you're getting a document in #2 and #3, then the problem could be in your accessor; I suspect what you're looking for doesn't exist.
If it does exist, then check the value of doc.errors after Nokogiri parses the document. It could be finding errors in the document, and, if so, they'll be captured there.

Related

How to get HTML of an element when using Poltergeist?

I'm using Capybara with the Poltergeist driver. My question is: how to get the HTML (string) of a node?
I've read that using the RackTest driver you can get it like this:
find("table").native #=> native Nokogiri element
find("table").native.to_html #=> "..."
But with Poltergeist calling #native on a node returns a Capybara::Poltergeist::Node, not a native Nokogiri element. And then calling #native again on the Capybara::Poltergeist::Node returns the same Capybara::Poltergeist::Node again (that is, it returns self).
It has become slightly irritating having to look at the HTML from the entire page to find what I'm looking for :P
I am adding this answer for others who land here. The solution is dead simple.
following the example you provided it would be:
find("table")['outerHTML']
I also find Poltergeist irritating. Here's what I did:
def nokogiri(selector)
nokogiri = Nokogiri::HTML(page.html);
return nokogiri.css(selector)[0]
end
This takes a css selector, and returns a native nokogiri element, rather than poltergeist's idiocy. You'll also have to require 'nokogiri', but it shouldn't be a problem since it's a dependency for poltergeist.
Its can be done like this
lets say on google.co.in you wana fetch INDIA
on step.rb file under your function write this line
x = page.find(:xpath,'//*[#id="hplogo"]/div' , :visible => false).text
puts x
x will display "India"
Terminal o/p

Ruby RSS::Parser.to_s silently fails?

I'm using Ruby 1.8.7's RSS::Parser, part of stdlib. I'm new to Ruby.
I want to parse an RSS feed, make some changes to the data, then output it (as RSS).
The docs say I can use '#to_s', but and it seems to work with some feeds, but not others.
This works:
#!/usr/bin/ruby -w
require 'rss'
require 'net/http'
url = 'http://news.ycombinator.com/rss'
feed = Net::HTTP.get_response(URI.parse(url)).body
rss = RSS::Parser.parse(feed, false, true)
# Here I would make some changes to the RSS, but right now I'm not.
p rss.to_s
Returns expected output: XML text.
This fails:
#!/usr/bin/ruby -w
require 'rss'
require 'net/http'
url = 'http://feeds.feedburner.com/devourfeed'
feed = Net::HTTP.get_response(URI.parse(url)).body
rss = RSS::Parser.parse(feed, false, true)
# Here I would make some changes to the RSS, but right now I'm not.
p rss.to_s
Returns nothing (empty quotes).
And yet, if I change the last line to:
p rss
I can see that the object is filled with all of the feed data. It's the to_s method that fails.
Why?
How can I get some kind of error output to debug a problem like this?
From what I can tell, the problem isn't in to_s, it's in the parser itself. Stepping way into the parser.rb code showed nothing being returned, so to_s returning an empty string is valid.
I'd recommend looking at something like Feedzirra.
Also, as a FYI, take a look at Ruby's Open::URI module for easy retrieval of web assets, like feeds. Open-URI is simple but adequate for most tasks. Net::HTTP is lower level, which will require you to type a lot more code to replace the functionality of Open-URI.
I had the same problem, so I started debugging the code. I think the ruby rss has a few too many required elements. The channel need to have "title, link, description", if one is missing to_s will fail.
The second feed in the example above is missing the description, which will make the to_s fail...
I believe this is a bug, but I really don't understand the code and barely ruby so who knows. It would seem natural to me that to_s would try its best even if some elements are missing.
Either way
rss.channel.description="something"
rss.to_s
will "work"
The problem lies in def have_required_elements?
Or in the
self.class::MODELS

nokogiri doc.xpath('head') returns nil

I am trying to get all the scripts declared in the head section of a given html, but no matter how I try, it always returns nil.
doc = Nokogiri::HTML(open('http://www.walmart.com.br/'))
puts doc.at('body') # returns nill
doc.xpath('//html/head').each # this also will never iterate
Any suggestions?
The page's DOCTYPE isn't valid, so Nokogiri parses the page improperly. A quick, inefficient fix to the problem:
require 'nokogiri'
require 'open-uri'
require 'pp'
# Request the HTML before parsing
html = open("http://www.walmart.com.br/").read
# Replace original DOCTYPE with a valid DOCTYPE
html = html.sub(/^<!DOCTYPE html(.*)$/, '<!DOCTYPE html>')
# Parse
doc = Nokogiri::HTML(html)
# Party.
pp doc.xpath("/html/head")
Ok, when I tried it in script/console, I could indeed get something useful for:
doc.at('body')
so I'm not sure what's going wrong there for you.
For the html head, I can't get the head element either. html works fine, but head either way doesn't.
I think there's something screwy with that walmart page. I tried doing the same thing for
Nokogiri::HTML(open('http://google.com/'))
and it worked just fine.
So unless you can figure out what they're doing to stop you from accessing parts of the page... then I don't know.
If you can deal with all scripts from the doc, I found that this one works just fine:
doc.xpath('//script')

How to use Scrubty properly to grab URL from the XML outputted content

I am by no means a master with Ruby and am quite new to Scrubyt. I was just trying out some examples found on there wiki page. The example i was working on was getting the search results returned by Google when you search for 'ruby' and I had the idea of grabbing the URL of each result so I could go ahead and fetch that page as well. The problem is I don't know how to grab the URL appropriately. This is my following code:
require 'rubygems'
require 'scrubyt'
google_data = Scrubyt::Extractor.define do
fetch 'http://www.google.com/ncr'
fill_textfield 'q','ruby'
submit
link_title "//a[#class='l']", :write_text => true do
link_url
end
end
google_data.to_xml.write($stdout, 1);
The code prints out the XML data appropriately (name and link) but how do I retrieve the link without the <link_url> tags that seems to get added to it (I tried to print out link_url and I noticed the tags are printed as well). Could I do something as simple as fetch link_url or is there a way of extracting the text from the xml content held in link_url?
This is some of the content that gets printed by the google_data.to_xml.write():
<root>
<link_title>
Ruby Programming Language
<link_url>http://ruby-lang.org/</link_url>
</link_title>
<link_title>
Download Ruby
<link_url>http://www.ruby-lang.org/en/downloads/</link_url>
</link_title>
<link_title>
Ruby - The Inspirational Weight Loss Journey on the Style Network ...
<link_url>http://www.mystyle.com/mystyle/shows/ruby/index.jsp</link_url>
</link_title>
<link_title>
Ruby (programming language) - Wikipedia, the free encyclopedia
<link_url>http://en.wikipedia.org/wiki/Ruby_(programming_language)</link_url>
</link_title>
</root>
I'd think about alternatives. Scrubyt hasn't been updated in a while, and the forums have been shut down.
Mechanize can do what the Extractor does, Nokogiri can parse XML or HTML responses, and Builder can create XML (though it seems like you don't really want XML).

extract single string from HTML using Ruby/Mechanize (and Nokogiri)

I am extracting data from a forum. My script based on is working fine. Now I need to extract date and time (21 Dec 2009, 20:39) from single post. I cannot get it work. I used FireXPath to determine the xpath.
Sample code:
require 'rubygems'
require 'mechanize'
post_agent = WWW::Mechanize.new
post_page = post_agent.get('http://www.vbulletin.org/forum/showthread.php?t=230708')
puts post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
puts post_page.parser.at_xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
puts post_page.parser.xpath('//[#id="post1960370"]/tbody/tr[1]/td/div[2]/text()')
all my attempts end with empty string or an error.
I cannot find any documentation on using Nokogiri within Mechanize. The Mechanize documentation says at the bottom of the page:
After you have used Mechanize to navigate to the page that you need to scrape, then scrape it using Nokogiri methods.
But what methods? Where can I read about them with samples and explained syntax? I did not find anything on Nokogiri's site either.
Radek. I'm going to show you how to fish.
When you call Mechanize::Page::parser, it's giving you the Nokogiri document. So your "xpath" and "at_xpath" calls are invoking Nokogiri. The problem is in your xpaths. In general, start out with the most general xpath you can get to work, and then narrow it down. So, for example, instead of this:
puts post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
start with this:
puts post_page.parser.xpath('//table').to_html
This gets the any tables, anywhere, and then prints them as html. Examine the HTML, to see what tables it brought back. It probably grabbed several when you want only one, so you'll need to tell it how to pick out the one table you want. If, for example, you notice that the table you want has CSS class "userdata", then try this:
puts post_page.parser.xpath("//table[#class='userdata']").to_html
Any time you don't get back an array, you goofed up the xpath, so fix it before proceding. Once you're getting the table you want, then try to get the rows:
puts post_page.parser.xpath("//table[#class='userdata']//tr").to_html
If that worked, then take off the "to_html" and you now have an array of Nokogiri nodes, each one a table row.
And that's how you do it.
I think you have copied this from Firebug, firebug gives you an extra tbody, which might not be there in actual code... so my suggestion is to remove that tbody and try again.
if it still doesn't work ... then follow Wayne Conrad's process that's the best!

Resources