Get a specific tag in a node? - ruby

I'm using Ruby, XPath and Nokogiri and trying to retrieve d1 from the following XML:
<a>
<b1>
<c>
<d1>01/11/2001</d1>
<d2>02/02/2004</d2>
</c>
</b1>
</a>
This is my code in a loop:
rs = doc.xpath("//a/b1/c/d1").inner_text
puts rs
It returns nothing (No error).
I want to get the text in <d1>.

You don't ask for the text content in your xpath query:
rs = doc.xpath('//a/b1/c/d1/text()')

You're misusing XPath:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<a>
<b1>
<c>
<d1>01/11/2001</d1>
<d2>02/02/2004</d2>
</c>
</b1>
</a>
EOT
doc.at('/a/b1/c/d1').text # => "01/11/2001"
doc.at('//d1').text # => "01/11/2001"
// in XPath-ese means start at the top and look anywhere in your document. Instead, if you're supplying an explicit/absolute selector, start at the top of the document and drill down using '/a/b1/c/d1'. Or, do the simple thing and let the parser search through the document for that particular node using //d1. You can do that if you know there's a single instance of that node.
In my code above, I used at instead of xpath. at returns the first matching node, which is similar to using xpath('//d1').first. xpath returns a NodeSet, which is like an array of nodes, whereas at returns a Node only. Using inner_text on a NodeSet is likely to not give you the results you want, which would be the text of a particular node, so be careful there.
doc.xpath('/a/b1/c/d1/text()').class # => Nokogiri::XML::NodeSet
doc.xpath('//c').inner_text # => "\n 01/11/2001\n 02/02/2004\n "
doc.xpath('/a/b1/c/d1').first.text # => "01/11/2001"
Look at the following lines. Instead of using XPath selectors, I used CSS, which tends to be more readable. Nokogiri supports both.
doc.at('d1').text # => "01/11/2001"
doc.at('a b1 c d1').text # => "01/11/2001"
Also, notice the type of data returned from these two lines:
doc.at('/a/b1/c/d1/text()').class # => Nokogiri::XML::Text
doc.at('/a/b1/c/d1').text.class # => String
While it might seem good/smart to tell the parser to locate the text() node inside <d1>, what will be returned isn't text, and will need to be accessed further to make it usable, so consider forgoing the use of text() unless you know exactly why you need it:
doc.at('/a/b1/c/d1/text()').text # => "01/11/2001"
Finally, Nokogiri has many methods used for locating nodes. As I said above, xpath returns a NodeSet and at returns a Node. xpath is really an XPath-specific version of Nokogiri's search method. search, css and xpath all return NodeSets. at, at_css and at_xpath all return Nodes. The CSS and XPath variants are useful when you have an ambiguous selector that you need to be used as CSS or XPath specifically. Most of the time Nokogiri can figure whether it's CSS or XPath on its own and will do the right thing, so it's OK to use the generic search and at for the majority of your coding. Use the specific versions when you have to specify one or the other.

Related

How to retrieve string using XPath without returning null errors

I'm trying to write "Private Equity Group; USA" to a file.
"Private Equity Group" prints fine, but I get an error for the "USA" portion
TypeError: null is not an object (evaluating 'style.display')"
HTML code:
<div class="cl profile-xsmall">
<div class="cl profile-small-bold">Private Equity Group</div>
USA
</div>
The XPath for "USA" is:
//*[#id="addrDiv-Id"]/div/div[3]/text()
I get the error when I print the XPath or have it in an if statement:
if (internet.has_xpath?('//*[#id="addrDiv-Id"]/div/div[3]/text()')){
file.puts "#{internet.find(:xpath, '//*[#id="addrDiv-Id"]/div/div[3]/text()')}"
}
Capybara is not a general purpose xpath library - it is a library aimed at testing, and therefore is element centric. The xpaths used need to refer to elements, not text nodes.
if (internet.has_xpath?('//*[#id="addrDiv-Id"]/div/div[3]')){
file.puts internet.find(:xpath, '//*[#id="addrDiv-Id"]/div/div[3]').text
}
although using XPath at all for this is just a bad idea. Whenever possible default to CSS, it's easier to read, and faster for the browser to process - something like
if (internet.has_css?('#addrDiv-Id > div > div:nth-of-type(3)')){
file.puts internet.find('#addrDiv-Id" > div > div:nth-of-type(3)').text
}
or if the HTML allows it (I don't know without seeing more of the HTML)
if (internet.has_css?('#addrDiv-id .cl.profile-xsmall')){
file.puts internet.find('#addrDiv-id .cl.profile-xsmall').text
}
or even cleaner if it works for your use case
file.puts internet.first('#addrDiv-id .cl.profile-xsmall')&.text
Another way to do it :
xml = %{<div class="cl profile-xsmall">
<div class="cl profile-small-bold">Private Equity Group</div>
USA</div>}
require 'rexml/document'
doc = REXML::Document.new xml
print(REXML::XPath.match(doc, 'normalize-space(string(//div[#class="cl profile-xsmall"]))'))
Output :
["Private Equity Group USA"]
I'd say the HTML isn't well-formed, using span would have been better, but this works:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div class="cl profile-xsmall">
<div class="cl profile-small-bold">Private Equity Group</div>
USA
</div>
EOT
div = doc.at('.profile-small-bold')
[div.text.strip, div.next_sibling.text.strip].join(' ')
# => "Private Equity Group USA"
which can be reduced to:
[div, div.next_sibling].map { |n| n.text.strip }.join(' ')
# => "Private Equity Group USA"
The problem is that you have two nested divs, with "USA" trailing, so it's important to point to the inner node which has the main text you want. Then "USA" is in the following text node, which is accessible using next_sibling:
div.next_sibling.class # => Nokogiri::XML::Text
div.next_sibling # => #<Nokogiri::XML::Text:0x3c "\n USA\n">
Note, I'm using CSS selectors; They're easier to read, which is echoed by the Nokogiri documentation. I have no proof they're faster, and, because Nokogiri uses libxml to process both, there's probably no real difference worth worrying about, so use whatever makes more sense, and run benchmarks if you're curious.
You might be tempted to use text against the div class="cl profile-xsmall" node, but don't be sucked into that, as it's a trap:
doc.at('.profile-xsmall').text # => "\n Private Equity Group\n USA\n"
doc.at('.profile-xsmall').text.gsub(/\s+/, ' ').strip # => "Private Equity Group USA"
text will return a string of the text nodes after they're concatenated together. In this particular, rare case, it results in a somewhat usable result, however, usually you'll get something like this:
doc = Nokogiri::HTML('<div><p>foo</p><p>bar</p></div>')
doc.at('div').text # => "foobar"
doc.search('p').text # => "foobar"
Once those text nodes have been concatenated it's really difficult to take them apart again. Nokogiri's documentation talks about this:
Note: This joins the text of all Node objects in the NodeSet:
doc = Nokogiri::XML('<xml><a><d>foo</d><d>bar</d></a></xml>')
doc.css('d').text # => "foobar"
Instead, if you want to return the text of all nodes in the NodeSet:
doc.css('d').map(&:text) # => ["foo", "bar"]
The XPath for "USA" is:
//*[#id="addrDiv-Id"]/div/div[3]/text()
Um, no, not according to the HTML you gave us. But, let's pretend.
Using an absolute path to a node is a good way to write fragile selectors. It takes only a small change in the HTML to break your access to the node. Instead, find way-points to skip through the HTML to find the node you want, taking advantage of CSS and XPath to search downward through the DOM.
Typically, a selector like yours is generated by a browser, which isn't a good source to trust. Often browsers do fixups on malformed HTML, which changes it from what Nokogiri or a parser would see, resulting in a non-existing target, or the browser presents the HTML after JavaScript has had a change to run, which can move nodes, hide them, add new ones, etc.
Instead of trusting the browser, use curl, wget or nokogiri at the command-line to dump the file and look at it using a text editor. Then you'll be seeing it just as Nokogiri sees it, prior to any fixups or mangling.

How to search two paths but get the results in order using Nokogiri

I am trying to search for elements with prefix w and also t or br using Nokogiri.
For example if this is the core of the doc returned from parsing:
<w:t></w:t><w:br></w:br><w:t></w:t>
This search
doc.search('.//w:t','.//w:br')
Results in:
['<w:t></w:t>','<w:t></w:t>','<w:br></w:br>']
Instead I want (the elements are in the order of the original doc):
['<w:t></w:t>','<w:br></w:br>','<w:t></w:t>']
Using CSS selectors you can do this:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<xml>
<t></t><br></br><t></t>
</xml>
EOT
doc.search('t, br')
# => [#<Nokogiri::XML::Element:0x3c name="t">, #<Nokogiri::XML::Element:0x50 name="br">, #<Nokogiri::XML::Element:0x64 name="t">]
doc.search('t, br').map(&:to_html)
# => ["<t></t>", "<br>", "<t></t>"]
CSS selectors are recommended by Nokogiri's authors because they're generally easier and less noisy.
Using XPath, this'd work:
doc.search('//t | //br')
# => [#<Nokogiri::XML::Element:0x3c name="t">, #<Nokogiri::XML::Element:0x50 name="br">, #<Nokogiri::XML::Element:0x64 name="t">]
doc.search('//t | //br').map(&:to_html)
# => ["<t></t>", "<br>", "<t></t>"]
However, your XML has namespaces, and you didn't show us the appropriate namespace declaration so that's left for you to figure out.
See Nokogiri's Namespaces documentation for more information.
Thanks to the Tin Man's response, the answer I was looking for is this
doc.search('.//w:t | .//w:br')

How to find an element's text in Capybara while ignoring inner element text

In the HTML example below I am trying to grab the $16.95 text in the outer span.price element and exclude the text from the inner span.sale one.
<div class="price">
<span class="sale">
<span class="sale-text">"Low price!"</span>
"$16.95"
</span>
</div>
If I was using Nokogiri this wouldn't be too difficult.
price = doc.css('sale')
price.search('.sale-text').remove
price.text
However Capybara navigates rather than removes nodes. I knew something like price.text would grab text from all sub elements, so I tried to use xpath to be more specific. p.find(:xpath, "//span[#class='sale']", :match => :first).text. However this grabs text from the inner element as well.
Finally, I tried looping through all spans to see if I could separate the results but I get an Ambiguous error.
p.find(:css, 'span').each { |result| puts result.text }
Capybara::Ambiguous: Ambiguous match, found 2 elements matching css "span"
I am using Capybara/Selenium as this is for a web scraping project with authentication complications.
There is no single statement way to do this with Capybara since the DOMs concept of innerText doesn't really support what you want to do. Assuming p is the '.price' element, two ways you could get what you want are as follows:
Since you know the node you want to ignore just subtract that text from the whole text
p.find('span.sale').text.sub(p.find('span.sale-text').text, '')
Grab the innerHTML string and parse that with Nokogiri or Capybara.string (which just wraps Nokogiri elements in the Capybara DSL)
doc = Capybara.string(p['innerHTML'])
nokogiri_fragment = doc.native
#do whatever you want with the nokogiri fragment

Use Nokogiri to get all nodes in an element that contain a specific attribute name

I'd like to use Nokogiri to extract all nodes in an element that contain a specific attribute name.
e.g., I'd like to find the 2 nodes that contain the attribute "blah" in the document below.
#doc = Nokogiri::HTML::DocumentFragment.parse <<-EOHTML
<body>
<h1 blah="afadf">Three's Company</h1>
<div>A love triangle.</div>
<b blah="adfadf">test test test</b>
</body>
EOHTML
I found this suggestion (below) at this website: http://snippets.dzone.com/posts/show/7994, but it doesn't return the 2 nodes in the example above. It returns an empty array.
# get elements with attribute:
elements = #doc.xpath("//*[#*[blah]]")
Thoughts on how to do this?
Thanks!
I found this here
elements = #doc.xpath("//*[#*[blah]]")
This is not a useful XPath expression. It says to give you all elements that have attributes that have child elements named 'blah'. And since attributes can't have child elements, this XPath will never return anything.
The DZone snippet is confusing in that when they say
elements = #doc.xpath("//*[#*[attribute_name]]")
the inner square brackets are not literal... they're there to indicate that you put in the attribute name. Whereas the outer square brackets are literal. :-p
They also have an extra * in there, after the #.
What you want is
elements = #doc.xpath("//*[#blah]")
This will give you all the elements that have an attribute named 'blah'.
You can use CSS selectors:
elements = #doc.css "[blah]"

why can't i wrap <span> around the rfollowing nokogiri xpath?

doc = Nokogiri::HTML(open(url)).xpath("//*")
.xpath("//*[br]/text()[string-length(normalize-space()) != 0]")
.wrap("<span></span>")
puts doc
it just returns the text ... i was expecting the full html source with now wrapped around the specified xpath elements.
Try
doc = Nokogiri::HTML(open(url)).xpath("//*")
.xpath("//*[br and text()[string-length(normalize-space()) != 0]]")
.wrap("<span></span>")
puts doc
What your XPath does is it fetches the non-empty text nodes. Which by their very definition don't contain any markup.
In contrast, my XPath fetches any node that contains at least one <br> and at least one non-empty text node.

Resources