I have a doubt about nokogiri, I need to get the HTML elements from a page, and get the xpath for each one. The problem is that I can't realize how to do it with nokogiri. The HTML code is random, because I've to parse several pages, from different websites.
If you are asking how to search for a node, you may use either CSS or XPath expressions, like so:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://slashdot.com/"))
node_found_by_css = doc.css("h1").first
node_found_by_xpath = doc.xpath("/html/body//h1").first
If you are asking how, once you've found a node, you can retrieve the canonical XPath expression for it, you may use Node#path like so:
puts node_found_by_css.path # => "/html/body/div[3]/div[1]/div[1]/h1"
If you are asking how to get the XPath for each HTML element in a page, then the following should help. This will open and parse a page and then print out the XPath for each element.
require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://slashdot.com/"))
doc.traverse {|node| puts node.path }
Related
I'm experimenting with Ruby and Nokogiri.
I've figured out how to open a local html document and select nodes by classname:
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML(open("file"))
puts doc.css('a.target')
How then do I dump the document without the nodes I've selected for?
Should be:
doc.css('a.target').remove
puts doc.at('html').to_s
Looking at this site here: http://www.grammy.com/nominees/search?artist=&title=&year=1958&genre=All
I can view all winners by year -- how can I scrape just the names of the winners on each page (for each year) and get them in a simple database?
thanks!
This will get you the actual names, cleaning them up a little bit and inserting them into a DB is an exercise left to you:
require 'rubygems'
require 'hpricot'
require 'open-uri'
html = open("http://www.grammy.com/nominees/search?artist=&title=&year=1958&genre=All")
doc = Hpricot(html)
doc.search("td.views-field-field-nominee-extended-value").each do |winner|
puts winner.inner_html.strip
end
I am trying to parse table using Mechanize gem but i don't know how to iterate table.
Mechanize uses nokogiri for parsing HTML, so you should look up the documentation there. Namely, take a look at xpath method.
Here's an example, parsing the current page:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open('http://stackoverflow.com/questions/4265745/how-to-get-all-text-inside-td-tags-from-table-tag-on-html-page-using-mechaniz'))
table = doc.xpath('//table').first # getting the first table on the page
table.xpath('tr/td').count # getting all the td nodes right below table/tr and counting them
#=> 4
My first question here, would be awesome to find an answer. I am new to using nokogiri.
Here is my problem. I have something like this in the HTML head on a target site (here a techcrunch post):
<meta content="During my time at TechCrunch I've seen thousands of startups and written about hundreds of them. I sure as hell don't know all ..." name="description"/>
I would now like to have a script to run through the meta tags, locate the one with the name attribute "description" and get what is in the content attribute.
I have tried something like this
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.techcrunch.com/2009/10/11/the-underutilized-power-of-the-video-demo-to-explain-what-the-hell-you-actually-do/"
doc = Nokogiri::HTML(open(url))
posts = doc.xpath("//meta")
posts.each do |link|
a = link.attributes['name']
b = link.attributes['content']
end
after which I could select the link where the attribute name is equal to description - but this code returns nil for a and b.
I played around with
posts = doc.xpath("//meta"), posts = doc.xpath("//meta/*"), etc. but still nil.
The problem is not with the xpath, as it seems the document does not parse. You can check that with puts doc, it does not contain the full input. It seems to be a problem with parsing comments (I suspect either invalid HTML or a bug in libxml2).
In your case I would use a regular expression as workaround. Given that <meta> tags are simple enough that might work, eg /<meta name="([^"]*)" content="([^"]*)"/
you should change
doc = Nokogiri::HTML(open(url))
to
doc = Nokogiri::HTML(open(url).read)
update: or maybe not :) actually your code works for me, using ruby 1.8.7 / nokogiri 1.4.0
How to load a Web page and search for a word in Ruby??
Here's a complete solution:
require 'open-uri'
if open('http://example.com/').read =~ /searchword/
# do something
end
For something simple like this I would prefer to write a couple of lines of code instead of using a full blown gem. Here is what I will do:
require 'net/http'
# let's take the url of this page
uri = 'http://stackoverflow.com/questions/1878891/how-to-load-a-web-page-and-search-for-a-word-in-ruby'
response = Net::HTTP.get_response(URI.parse(uri)) # => #<Net::HTTPOK 200 OK readbody=true>
# match the word Ruby
/Ruby/.match(response.body) # => #<MatchData "Ruby">
I can go to the path of using a gem if I need to do more than this and I need to implement some algorithm for that which is already being done in one of the gems
I suggest using Nokogiri or hpricot to open and parse HTML documents. If you need something simple that doesn't require parsing the HTML, you can just use the open-uri library built in to most ruby distributions. If need something more complex for posting forms (or logging in), you can elect to use Mechanize.
Nokogiri is probably the preferred solution post _why, but both are about as simple as this:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri(open("http://www.example.com"))
if doc.inner_text.match(/someword/)
puts "got it"
end
Both also allow you to search using xpath-like queries or CSS selectors, which allows you to grab items out of all divs with class=foo, for example.
Fortunately, it's not that big of a leap to move between open-uri, nokogiri and mechanize, so use the first one that meets your needs, and revise your code once you realize you need the capabilities of one of the other libraries.
You can also use mechanize gem, something similar to this.
require 'rubygems'
require 'mechanize'
mech = WWW::Mechanize.new.get('http://example.com') do |page|
if page.body =~ /mysearchregex/
puts "found it"
end
end