XPATH Ruby Nokogiri and substring-after - xpath

I have HTML with some node like this
<span>Website: http://example.com</span>
I want to get the text http://example.com and I can extract it with xpath
substring-after(//span[contains(.,'Website:')],'Website: ')
With Ruby and Nokogiri to extract this info by
doc.xpath("substring-after(//span[contains(.,'Website: ')],'Website: ')")
But if I use
doc.at_xpath("substring-after(//span[contains(.,'Website: ')],'Website: ')")
Then it returns
NoMethodError: undefined method `first' for "http://example.com":String
But I don't want to use doc.xpath I want to use doc.at_xpath 'cause I don't want to rewrite my function.
How could I do that? Any suggest?
Thank you.

Related

Why can't Nokogiri wrap my image with a link?

I'm confused about the reaction of Nokogiri (1.6.6.2), when I try to wrap an image with a link tag. Here is an example of my problem:
fragment = Nokogiri::HTML5.fragment("<p>Example</p><img src='test.jpg' class='test'><p>Example</p>")
Now I would like to wrap the image with a link:
fragment.search('img').wrap('')
This unfortunately results in an error:
ArgumentError: Requires a Node, NodeSet or String argument, and cannot accept a NilClass.
(You probably want to select a node from the Document with at() or search(), or create a new Node via Node.new().)
Now the very strange this is, it works with other tags:
fragment.search('img').wrap('<something href="http://www.google.com"></something>')
Why is Nokogiri doing that? Is it a bug?
The first problem is:
uninitialized constant Nokogiri::HTML5 (NameError)
You want Nokogiri::HTML instead.
Running this:
require 'nokogiri'
fragment = Nokogiri::HTML.fragment("<p>Example</p><img src='test.jpg' class='test'><p>Example</p>")
fragment.search('img').wrap('<a href="test">')
and looking at fragment afterwards:
puts fragment.to_html
# >> <p>Example</p><img src="test.jpg" class="test"><p>Example</p>
It appears to be working correctly. Adding the trailing </a> also works.
Perhaps you need to check your Nokogiri and libXML2 versions.

Removing XML tags when parsing XML

Using Ruby with Nokogiri is there an easy way to remove tags around returned results? I can't find one in the docs.
Example from the Nokogiri site:
characters[0].to_s # => "<character>Al Bundy</character>"
I was hoping to get:
Al Bundy
Try using the text method:
characters[0].text
You can use the .inner_html method. Here is an example you can use from a basic xml sitemap:
parse_content.css("url").each do |x|
location = x.css("loc").inner_html
last_mod = x.css("lastmod").inner_html
end
You can read about sitemaps here: https://www.sitemaps.org/protocol.html

How to get HTML of an element when using Poltergeist?

I'm using Capybara with the Poltergeist driver. My question is: how to get the HTML (string) of a node?
I've read that using the RackTest driver you can get it like this:
find("table").native #=> native Nokogiri element
find("table").native.to_html #=> "..."
But with Poltergeist calling #native on a node returns a Capybara::Poltergeist::Node, not a native Nokogiri element. And then calling #native again on the Capybara::Poltergeist::Node returns the same Capybara::Poltergeist::Node again (that is, it returns self).
It has become slightly irritating having to look at the HTML from the entire page to find what I'm looking for :P
I am adding this answer for others who land here. The solution is dead simple.
following the example you provided it would be:
find("table")['outerHTML']
I also find Poltergeist irritating. Here's what I did:
def nokogiri(selector)
nokogiri = Nokogiri::HTML(page.html);
return nokogiri.css(selector)[0]
end
This takes a css selector, and returns a native nokogiri element, rather than poltergeist's idiocy. You'll also have to require 'nokogiri', but it shouldn't be a problem since it's a dependency for poltergeist.
Its can be done like this
lets say on google.co.in you wana fetch INDIA
on step.rb file under your function write this line
x = page.find(:xpath,'//*[#id="hplogo"]/div' , :visible => false).text
puts x
x will display "India"
Terminal o/p

How do I get the input value from a Nokogiri::XML::NodeSet?

I am looking for my input element using Nokogiri's xpath method.
It's returning an object of class Nokogiri::XML::NodeSet:
[#<Nokogiri::XML::Element:0x3fcc0e07de14 name="input" attributes=[#<Nokogiri::XML::Attr:0x3fcc0e07dba8 name="type" value="text">, #<Nokogiri::XML::Attr:0x3fcc0e07db94 name="name" value="creditInstallmentAmount">, #<Nokogiri::XML::Attr:0x3fcc0e07db44 name="style" value="width:240px">, #<Nokogiri::XML::Attr:0x3fcc0e07dae0 name="value" value="94.8">, #<Nokogiri::XML::Attr:0x3fcc0e07da18 name="readonly" value="true">]>
Is there a faster and cleaner way to get the value of input than casting this using to_s:
"<input type=\"text\" name=\"creditInstallmentAmount\" style=\"width:240px\" value=\"94.8\" readonly>"
and match with regular expressions?
A couple things will help:
Nokogiri has the at method, which is the equivalent of search(...).first, and, instead of returning a NodeSet, it returns the Node itself, making it easy to grab values from it:
require 'nokogiri'
doc = Nokogiri::HTML('<input type="text" name="creditInstallmentAmount" style="width:240px" value="94.8" readonly>')
doc.at('input')['value'] # => "94.8"
doc.at('input')['value'].to_f # => 94.8
Also, notice I'm using CSS notation, instead of XPath. Nokogiri supports both, and a lot of times the CSS is more obvious and easily readable. The at_css method is an alias to at for convenience.
Note that Nokogiri uses a little test in search and at to try to determine whether the selector is CSS or XPath, and then branches accordingly to the specific method. The test can be fooled, at which point you should use the specific CSS or XPath variant, or always use them if you're paranoid. In years of using Nokogiri I've only once encountered the situation where the code was confused.
If you want to be more explicit about which input you want, you can look into the parameters for the tag:
doc.at('input[#name="creditInstallmentAmount"]')['value'] # => "94.8"
Get familiar with the difference between search and at and their varients, and Nokogiri will really become useful to you. Learn how to access the parameters and text() nodes and you'll know 99% of what you need to know for parsing HTML and XML.
Ok, I found the answer:
.map{|node| node["value"]}.first
Ok, this works for me
require 'nokogiri'
require 'open-uri'
html = open ARGV[0]
doc = Nokogiri::HTML(html)
inputs = doc.search 'input'
inputs.map{|node| node['name']}
or all in one
inputs = Nokogiri::HTML(html).search('input').map{|node| node['name']}

What's the xpath syntax to get tag names?

I'm using Nokogiri to parse a large XML file. Say I've got the following structure:
<menagerie>
<penguin>Pablo</penguin>
<penguin>Mortimer</penguin>
<bull>Ferdinand</bull>
<aardvark>James Cornelius Madison Humphrey Zophar Handlebrush III</aardvark>
</menagerie>
I can count the non-penguins like this:
xml.xpath('//menagerie//*[not(penguin)]').length // 2
But how do I get a list of the tags, like this? (The exact format isn't important; I just want to visually scan the non-penguins.)
bull
aardvark
Update
This gave me the list I wanted - thanks Oded and TMN and delnan!
xml.xpath('//menageries/*[not(penguin)]').each do |node|
puts node.name()
end
You can use the name() or local-name() XPath function.
See the examples on zvon.
I know it's a bit outdated but you should do: xml.xpath('//meagerie/*[not(penguin)]/name()') as the expression. Note the slash, not the dot. This is how you call methods on the current node in XPath.

Resources