Hpricot remove single element - ruby

I'm using Ruby's Hpricot gem to parse html. I'd like to remove a single node from the document for use elsewhere, but I can't find a way.
I see that I can remove an entire list of elements, using an instance of Hpricot::Elements (x = (doc/"div").remove), but I only want to remove the first instance of a given tag.
Poking around, I see the suggestion that I simply replace the element's inner text with a comment node or whitespace (x.inner_html = ''), but that prevents me making use of the node elsewhere.
What can I do?
Specs: Ruby 1.8.7, Hpricot 0.8.4

Try this!
x = (doc/"div").first
x.parent.children.delete(x) unless x.nil?

Related

Why can't Nokogiri wrap my image with a link?

I'm confused about the reaction of Nokogiri (1.6.6.2), when I try to wrap an image with a link tag. Here is an example of my problem:
fragment = Nokogiri::HTML5.fragment("<p>Example</p><img src='test.jpg' class='test'><p>Example</p>")
Now I would like to wrap the image with a link:
fragment.search('img').wrap('')
This unfortunately results in an error:
ArgumentError: Requires a Node, NodeSet or String argument, and cannot accept a NilClass.
(You probably want to select a node from the Document with at() or search(), or create a new Node via Node.new().)
Now the very strange this is, it works with other tags:
fragment.search('img').wrap('<something href="http://www.google.com"></something>')
Why is Nokogiri doing that? Is it a bug?
The first problem is:
uninitialized constant Nokogiri::HTML5 (NameError)
You want Nokogiri::HTML instead.
Running this:
require 'nokogiri'
fragment = Nokogiri::HTML.fragment("<p>Example</p><img src='test.jpg' class='test'><p>Example</p>")
fragment.search('img').wrap('<a href="test">')
and looking at fragment afterwards:
puts fragment.to_html
# >> <p>Example</p><img src="test.jpg" class="test"><p>Example</p>
It appears to be working correctly. Adding the trailing </a> also works.
Perhaps you need to check your Nokogiri and libXML2 versions.

How to get HTML of an element when using Poltergeist?

I'm using Capybara with the Poltergeist driver. My question is: how to get the HTML (string) of a node?
I've read that using the RackTest driver you can get it like this:
find("table").native #=> native Nokogiri element
find("table").native.to_html #=> "..."
But with Poltergeist calling #native on a node returns a Capybara::Poltergeist::Node, not a native Nokogiri element. And then calling #native again on the Capybara::Poltergeist::Node returns the same Capybara::Poltergeist::Node again (that is, it returns self).
It has become slightly irritating having to look at the HTML from the entire page to find what I'm looking for :P
I am adding this answer for others who land here. The solution is dead simple.
following the example you provided it would be:
find("table")['outerHTML']
I also find Poltergeist irritating. Here's what I did:
def nokogiri(selector)
nokogiri = Nokogiri::HTML(page.html);
return nokogiri.css(selector)[0]
end
This takes a css selector, and returns a native nokogiri element, rather than poltergeist's idiocy. You'll also have to require 'nokogiri', but it shouldn't be a problem since it's a dependency for poltergeist.
Its can be done like this
lets say on google.co.in you wana fetch INDIA
on step.rb file under your function write this line
x = page.find(:xpath,'//*[#id="hplogo"]/div' , :visible => false).text
puts x
x will display "India"
Terminal o/p

What would be the best way to take a string of html, chop it up, and put each piece into an array?

I have a general idea of how I can do this, but can't pinpoint how exactly to get it done. I am sure it can be done with a regex of some sort. Wondering if anyone here can point me in the right direction.
If I have a string of html such as this
some_html = '<div><b>This is some BOLD text</b></div>'
I want to to divide it into logical pieces, and then put those pieces into an array so I end with a result like this
html_array = ["<div>", "<b>", "This is some BOLD text", "</b>","</div>" ]
Rather than use regex I'd use the nokogiri gem (a gem for parsing html written by Aaron Patterson - contributor to Rails and Ruby). Here's a sample of how to use it:
html_doc = Nokogiri::HTML("<html><body><h1>Mr. Belvedere Fan Club</h1></body></html>")
You can then call html_doc.children to get a nodeset and work your way from there
html_doc.children # returns a nodeset
Use an HTML parser, for instance, Nokogiri. Using SAX you can add tags/elements to the array as events are triggered.
It's not a good idea to try to regex HTML, unless you're planning to treat only a small determined subset of it.
some_html.split(/(<[^>]*>)/).reject{|x| '' == x}

Safe $variable binding in Nokogiri

Supposing I want to query for the XPath //*[#id=$href]. How can I tell nokogiri to safely bind a value for the $href variable?
This is similar to REXML's XPath.first( node, "//*[#id=$href]", nil, {"href"=>"linktohere"})
This feature has just (a half hour ago) been added to Nokogiri, so it should appear in the next version.

extract single string from HTML using Ruby/Mechanize (and Nokogiri)

I am extracting data from a forum. My script based on is working fine. Now I need to extract date and time (21 Dec 2009, 20:39) from single post. I cannot get it work. I used FireXPath to determine the xpath.
Sample code:
require 'rubygems'
require 'mechanize'
post_agent = WWW::Mechanize.new
post_page = post_agent.get('http://www.vbulletin.org/forum/showthread.php?t=230708')
puts post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
puts post_page.parser.at_xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
puts post_page.parser.xpath('//[#id="post1960370"]/tbody/tr[1]/td/div[2]/text()')
all my attempts end with empty string or an error.
I cannot find any documentation on using Nokogiri within Mechanize. The Mechanize documentation says at the bottom of the page:
After you have used Mechanize to navigate to the page that you need to scrape, then scrape it using Nokogiri methods.
But what methods? Where can I read about them with samples and explained syntax? I did not find anything on Nokogiri's site either.
Radek. I'm going to show you how to fish.
When you call Mechanize::Page::parser, it's giving you the Nokogiri document. So your "xpath" and "at_xpath" calls are invoking Nokogiri. The problem is in your xpaths. In general, start out with the most general xpath you can get to work, and then narrow it down. So, for example, instead of this:
puts post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
start with this:
puts post_page.parser.xpath('//table').to_html
This gets the any tables, anywhere, and then prints them as html. Examine the HTML, to see what tables it brought back. It probably grabbed several when you want only one, so you'll need to tell it how to pick out the one table you want. If, for example, you notice that the table you want has CSS class "userdata", then try this:
puts post_page.parser.xpath("//table[#class='userdata']").to_html
Any time you don't get back an array, you goofed up the xpath, so fix it before proceding. Once you're getting the table you want, then try to get the rows:
puts post_page.parser.xpath("//table[#class='userdata']//tr").to_html
If that worked, then take off the "to_html" and you now have an array of Nokogiri nodes, each one a table row.
And that's how you do it.
I think you have copied this from Firebug, firebug gives you an extra tbody, which might not be there in actual code... so my suggestion is to remove that tbody and try again.
if it still doesn't work ... then follow Wayne Conrad's process that's the best!

Resources