I'm using Capybara with the Poltergeist driver. My question is: how to get the HTML (string) of a node?
I've read that using the RackTest driver you can get it like this:
find("table").native #=> native Nokogiri element
find("table").native.to_html #=> "..."
But with Poltergeist calling #native on a node returns a Capybara::Poltergeist::Node, not a native Nokogiri element. And then calling #native again on the Capybara::Poltergeist::Node returns the same Capybara::Poltergeist::Node again (that is, it returns self).
It has become slightly irritating having to look at the HTML from the entire page to find what I'm looking for :P
I am adding this answer for others who land here. The solution is dead simple.
following the example you provided it would be:
find("table")['outerHTML']
I also find Poltergeist irritating. Here's what I did:
def nokogiri(selector)
nokogiri = Nokogiri::HTML(page.html);
return nokogiri.css(selector)[0]
end
This takes a css selector, and returns a native nokogiri element, rather than poltergeist's idiocy. You'll also have to require 'nokogiri', but it shouldn't be a problem since it's a dependency for poltergeist.
Its can be done like this
lets say on google.co.in you wana fetch INDIA
on step.rb file under your function write this line
x = page.find(:xpath,'//*[#id="hplogo"]/div' , :visible => false).text
puts x
x will display "India"
Terminal o/p
Related
I'm confused about the reaction of Nokogiri (1.6.6.2), when I try to wrap an image with a link tag. Here is an example of my problem:
fragment = Nokogiri::HTML5.fragment("<p>Example</p><img src='test.jpg' class='test'><p>Example</p>")
Now I would like to wrap the image with a link:
fragment.search('img').wrap('')
This unfortunately results in an error:
ArgumentError: Requires a Node, NodeSet or String argument, and cannot accept a NilClass.
(You probably want to select a node from the Document with at() or search(), or create a new Node via Node.new().)
Now the very strange this is, it works with other tags:
fragment.search('img').wrap('<something href="http://www.google.com"></something>')
Why is Nokogiri doing that? Is it a bug?
The first problem is:
uninitialized constant Nokogiri::HTML5 (NameError)
You want Nokogiri::HTML instead.
Running this:
require 'nokogiri'
fragment = Nokogiri::HTML.fragment("<p>Example</p><img src='test.jpg' class='test'><p>Example</p>")
fragment.search('img').wrap('<a href="test">')
and looking at fragment afterwards:
puts fragment.to_html
# >> <p>Example</p><img src="test.jpg" class="test"><p>Example</p>
It appears to be working correctly. Adding the trailing </a> also works.
Perhaps you need to check your Nokogiri and libXML2 versions.
I am new to nokogiri, but it looks like this would be the tool that I would use to scrape a webpage. I am looking for specific words on a webpage. The words are "Valid", "Requirements Met", and "Requirements Not". I am using watir to drive through the website. I currently have:
page = Nokogiri::HTML.parse(browser.html)
to get the html, but I am not sure where to go from here.
Thanks for the help!
If you are using Watir to drive the website, I would suggest using Watir to check for the text. You can get all the text on the page using:
ie.text #Where ie is a Watir::IE
You could then check to see if it has those words are included (by comparing to a regex):
if ie.text =~ /Valid|Requirements Met|Requirements Not/
#Do something if the words are on the page
end
That said, if you are looking for a specific bits of text, you can use Watir to look specifically for those elements (and avoid parsing text or html). If you can provide an HTML sample of what you are working on, we can help find a more robust solution.
I am not sure why you are using both. You could get the page using 'net/http' or mechanize if you just want to check for text. Anyways, you can check for text in watir with browser.text.match 'Valid', same for nokogiri with page.text.match 'Valid'.
You should also be able to use the .text method from Justin's answer along with the standard ruby string .include? method which returns true or false.
if browser.text.include? /Valid|Requirements Met|Requirements Not/
#code to execute if text found
else
#code to execute if text not found
end
This also makes it easy to have a single line validation step if that is what you are after
if using rspec/cucumber
browser.text.should include /Valid|Requirements Met|Requirements Not/
if using test:Unit
assert browser.text.include? /Valid|Requirements Met|Requirements Not/
I am trying to check to see if text is present using Selenium 2 and Firefox but cant seem to find the method to use. I tried to use the method is_text_present which seems to be what everyone says work but will not work for me. I get the returned error:
NoMethodError: undefined method `is_text_present' for# Selenium::WebDriver::Driver:0x1017232e0 browser=:firefox
How do you check the page for text using Selenium 2 and Firefox?
When I tried this stack overflow option "Finding text in page with selenium 2" it did not work for me, I believe it doesn't work because I am using Ruby to do my test, not Java.
If you don't know in which element should be your text, you can use :
driver.page_source.include? 'TEXT_TO_SEARCH'
Otherwise you can use as Llelong and Thomp suggested :
driver.find_element(:id=>"ELEMENT_ID").text.include? 'TEXT_TO_SEARCH'
driver.find_element(:class=>"ELEMENT_CLASS").text.include? 'TEXT_TO_SEARCH'
You can find the text in JAVA using the following code snippet...
Try using the same for Ruby:-
driver.getPageSource().contains("TEXTTOSEARCH");
require 'selenium-webdriver'
br = Selenium::WebDriver.for :firefox
br.get "http://cnn.com"
br.find_element(:id=>"cnnMainPage")
br.find_element(:id=>"cnnMainPage").text.include? "Mideast"
I tested this in the irb, so you may run into timing problems (may have to wait for the element to be present first).
I'm not sure how to do it in Ruby, but you should be able to call the getPageSource() method, and check to see if that contains the string of text you're looking for.
If you can't find the text in the source, you probably want to identify the exact element that contains the text and call the getText() method on that element. For example, these are some common identifiers for elements:
driver.findElement(By.xpath("xpathstring")).getText()
driver.findElement(By.className("className")).getText()
driver.findElement(By.name("elementname")).getText()
driver.findElement(By.id("idname")).getText()
There are several more element identifiers, you'll have to consult the documentation if these don't work for you.
this worked for me...
driver.page_source.should_not include 'Login failed'
I'm having an issue getting Nokogiri to work properly. I'm using version 1.4.4 with Ruby 1.9.2.
I have both libxml2 libxslt installed and up to date. When I run a Ruby script with XML, it works great.
require 'nokogiri'
doc = Nokogiri::XML(File.open("test.xml"))
doc = doc.css("name").each do |node|
puts node.text
end
Enter into the CL, run ruby test.rb, returns
Name 1
Name 2
Name 3
And the crowd goes wild.
I tweak a few things, make a few adjustments to the code...
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://domain.tld"))
doc = doc.css("p").each do |node|
puts node.text
end
Back to CL, ruby test.rb, returns... nothing! Just a new, empty line.
Is there any reason that it will work with an XML file, but not HTML?
To debug this sort of problem we need more information from you. Since you're not giving a working URL, and because we know that Nokogiri works fine for this sort of problem, the debugging falls on you.
Here's what I would do to test:
In IRB:
Do you get output when you do: open('http://whateverURLyouarehiding.com').read
If that returns a valid document, what do you get when you wrap the previous open statement in Nokogiri::HTML(...). That needs to preserve the .read in the previous line too, so Nokogiri is receiving the body of the page, NOT an IO stream.
Try #2 above, but remove the .read. That will tell if there's a problem with Nokogiri reading an IO stream, though I seriously doubt it has a problem since I use it all the time. At that point I'd suspect a problem on your system.
If you're getting a document in #2 and #3, then the problem could be in your accessor; I suspect what you're looking for doesn't exist.
If it does exist, then check the value of doc.errors after Nokogiri parses the document. It could be finding errors in the document, and, if so, they'll be captured there.
I am extracting data from a forum. My script based on is working fine. Now I need to extract date and time (21 Dec 2009, 20:39) from single post. I cannot get it work. I used FireXPath to determine the xpath.
Sample code:
require 'rubygems'
require 'mechanize'
post_agent = WWW::Mechanize.new
post_page = post_agent.get('http://www.vbulletin.org/forum/showthread.php?t=230708')
puts post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
puts post_page.parser.at_xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
puts post_page.parser.xpath('//[#id="post1960370"]/tbody/tr[1]/td/div[2]/text()')
all my attempts end with empty string or an error.
I cannot find any documentation on using Nokogiri within Mechanize. The Mechanize documentation says at the bottom of the page:
After you have used Mechanize to navigate to the page that you need to scrape, then scrape it using Nokogiri methods.
But what methods? Where can I read about them with samples and explained syntax? I did not find anything on Nokogiri's site either.
Radek. I'm going to show you how to fish.
When you call Mechanize::Page::parser, it's giving you the Nokogiri document. So your "xpath" and "at_xpath" calls are invoking Nokogiri. The problem is in your xpaths. In general, start out with the most general xpath you can get to work, and then narrow it down. So, for example, instead of this:
puts post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
start with this:
puts post_page.parser.xpath('//table').to_html
This gets the any tables, anywhere, and then prints them as html. Examine the HTML, to see what tables it brought back. It probably grabbed several when you want only one, so you'll need to tell it how to pick out the one table you want. If, for example, you notice that the table you want has CSS class "userdata", then try this:
puts post_page.parser.xpath("//table[#class='userdata']").to_html
Any time you don't get back an array, you goofed up the xpath, so fix it before proceding. Once you're getting the table you want, then try to get the rows:
puts post_page.parser.xpath("//table[#class='userdata']//tr").to_html
If that worked, then take off the "to_html" and you now have an array of Nokogiri nodes, each one a table row.
And that's how you do it.
I think you have copied this from Firebug, firebug gives you an extra tbody, which might not be there in actual code... so my suggestion is to remove that tbody and try again.
if it still doesn't work ... then follow Wayne Conrad's process that's the best!