Get attribute value from XML - ruby

I have this chunk of XML:
<show name="Are We There Yet?">
<sid>24588</sid>
<network>TBS</network>
<title>The Kwandanegaba Children's Fund Episode</title>
<ep>03x31</ep>
<link>
http://www.tvrage.com/shows/id-24588/episodes/1065228407
</link>
</show>
I am trying to get "Are we there yet?" via Nokogiri. It is effectively the 'name' attribute of 'show'. I'm struggling to figure out how to parse this.
xml.at_css('show').value was my best guess but doesn't work.

You can use the following:
xml.at('//show/#name').text
which is XPath expression that returns the name attribute from the show element.

Use:
require 'nokogiri'
xml =<<EOT
<show name="Are We There Yet?">
<sid>24588</sid>
<network>TBS</network>
<title>The Kwandanegaba Children's Fund Episode</title>
<ep>03x31</ep>
<link>
http://www.tvrage.com/shows/id-24588/episodes/1065228407
</link>
</show>
EOT
xml = Nokogiri::XML(xml)
puts xml.at('show')['name']
=> Are We There Yet?
at accepts either CSS or XPath expressions, so feel free to use it for both. Use at_css or at_xpath if you know you need to declare the expression as CSS or XPath, respectively. at returns a Node, so you can simply reference the parameters of the node like you would a hash.

Related

xpath remove an attribute from an dynamic attribute list

I want remove an xml attribute via xpath, but
the xml element could have more atrributes in the future.
html code:
<p class="red, blue, green">test/<p>
xpath:
<xpath expr="//p[contains(#class, 'green')]" position="attributes">
<attribute name="class">red, blue</attribute>
</xpath>
Is where a better way for fixtext "red, blue"?
In order to suppport possible new version of the html file like
"<p class="red, blue, green, brown">test</p>" in the future without need to change the xpath code again.
for instance actual attribute list as var + an xpath function
What about setting the #class to
concat(substring-before(#class, "green"), substring-after(#class, "green"))
You'll need to solve the abandoned commas, too, but as Björn Tantau commented, in real HTML the classes would be separated by spaces, so you can just wrap the result into normalize-space.

How to find the parent node by matching text using XPath

I have some XML:
<sys>
<lang>
<employee>
<name>Employee 1</name>
<code>4fdaa994-7015-4ec1-b365-de4ee0279966</code>
</employee>
<employee>
<name>Employee 2</name>
<code>1d960bdc-0853-49af-bb83-18cf92493897</code>
</employee>
</lang>
</syz>
How can I search and get the employee node where name ="Employee 1"?
I tried this but it didn't work:
obj.xpath("//sys/lang[/employee/name = 'Employee 1']")
This XPath
/sys/lang/employee[name = 'Employee 1']
will select the employee element whose name is Employee 1.
Why might OP be getting an "Invalid expression" using the above XPath?
Transcription error.
Resolution: Use copy and paste.
Single quotes around single quotes.
Resolution: Use outer double quotes: "/sys/lang/employee[name = 'Employee 1']"
Smart quotes.
Resolution: Replace ‘ and ’ with single quote '.
Misinterpretation of error message.
Resolution: Carefully check any line number mentioned in error, or carve away surrounding code as much as possible, and see if error goes away.
If none of the above possibilities apply, post a MCVE (Minimal, Complete, and Verifiable Example, including the provided XPath and the calling code -- the complete in MCVE) that produces the invalid expression error, and someone will likely immediately spot the problem.
I'm a big fan of using CSS over XPath for readability reasons. Nokogiri implements a number of jQuery's extensions to make it easier to use CSS for things we'd usually use XPath for.
I'd do it this way:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<sys>
<lang>
<employee>
<name>Employee 1</name>
<code>4fdaa994-7015-4ec1-b365-de4ee0279966</code>
</employee>
<employee>
<name>Employee 2</name>
<code>1d960bdc-0853-49af-bb83-18cf92493897</code>
</employee>
</lang>
</syz>
EOT
emp1 = doc.at('employee name:contains("Employee 1")') # => #<Nokogiri::XML::Element:0x3ffed05285b4 name="name" children=[#<Nokogiri::XML::Text:0x3ffed05283d4 "Employee 1">]>
emp1.to_xml # => "<name>Employee 1</name>"
emp1.parent.to_xml # => "<employee>\n <name>Employee 1</name>\n <code>4fdaa994-7015-4ec1-b365-de4ee0279966</code>\n </employee>"
Also note, it's not good practice to define the full path in the selector for a node. If the HTML or XML changes the structure that selector will break. Instead, find useful landmarks and hop from one to the next. That way your selector is more likely to survive changes in the markup. I only care about finding the appropriate <employee>...<name> combination, not those two tags embedded under <sys> and <lang>.
Sometimes an alternate way of getting to the information you want is to use search and look at a particular index:
doc.search('employee').first.to_xml # => "<employee>\n <name>Employee 1</name>\n <code>4fdaa994-7015-4ec1-b365-de4ee0279966</code>\n </employee>"
Or:
doc.at('employee').to_xml # => "<employee>\n <name>Employee 1</name>\n <code>4fdaa994-7015-4ec1-b365-de4ee0279966</code>\n </employee>"
at('some selector') is equivalent to search('some selector').first.

Get element in particular index nokogiri

How can I get the element at index 2.
For example in following HTML I want to display the third element i.e a DIV:
<HTMl>
<DIV></DIV>
<OL></OL>
<DIV> </DIV>
</HTML>
I have been trying the following:
p1 = html_doc.css('body:nth-child(2)')
puts p1
I don't think you're understanding how we use a parser like Nokogiri, because it's a lot easier than you make it out to be.
I'd use:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<HTMl>
<DIV>1</DIV>
<OL></OL>
<DIV>2</DIV>
</HTML>
EOT
doc.at('//div[2]').to_html # => "<div>2</div>"
That's using at which returns the first Node that matches the selector. //div[2] is an XPath selector that will return the second <div> found. search could be used instead of at, but it returns a NodeSet, which is like an array, and would mean I'd need to extract that particular node.
Alternately, I could use CSS instead of XPath:
doc.search('div:nth-child(3)').to_html # => "<div>2</div>"
Which, to me, is not really an improvement over the XPath as far as readability.
Using search to find all occurrences of a particular tag, means I have to select the particular element from the returned NodeSet:
doc.search('div')[1].to_html # => "<div>2</div>"
Or:
doc.search('div').last.to_html # => "<div>2</div>"
The downside to using search this way, is it will be slower and needlessly memory intensive on big documents since search finds all occurrences of the nodes that match the selector in the document, and which are then thrown away after selecting only one. search, css and xpath all behave that way, so, if you only need the first matching node, use at or its at_css and at_xpath equivalents and provide a sufficiently definitive selector to find just the tag you want.
'body:nth-child(2)' doesn't work because you're not using it right, according to ":nth-child()" and how I understand it works. nth-child looks at the tag supplied, and finds the "nth" occurrence of it under its parent. So, you're asking for the third tag under body's "html" parent, which doesn't exist because a correctly formed HTML document would be:
<html>
<head></head>
<body></body
</html>
(How you tell Nokogiri to parse the document determines how the resulting DOM is structured.)
Instead, use: div:nth-child(3) which says, "find the third child of the parent of div, which is "body", and results in the second div tag.
Back to how Nokogiri can be told to parse a document; Meditate on the difference between these:
doc = Nokogiri::HTML(<<EOT)
<p>foo</p>
EOT
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <p>foo</p>
# >> </body></html>
and:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>foo</p>
EOT
puts doc.to_html
# >> <p>foo</p>
If you can modify the HTML add id's and classes to target easily what you are looking for (also add the body tag).
If you can not modify the HTML keep your selector simple and access the second element of the array.
html_doc.css('div')[1]

Get a specific tag in a node?

I'm using Ruby, XPath and Nokogiri and trying to retrieve d1 from the following XML:
<a>
<b1>
<c>
<d1>01/11/2001</d1>
<d2>02/02/2004</d2>
</c>
</b1>
</a>
This is my code in a loop:
rs = doc.xpath("//a/b1/c/d1").inner_text
puts rs
It returns nothing (No error).
I want to get the text in <d1>.
You don't ask for the text content in your xpath query:
rs = doc.xpath('//a/b1/c/d1/text()')
You're misusing XPath:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<a>
<b1>
<c>
<d1>01/11/2001</d1>
<d2>02/02/2004</d2>
</c>
</b1>
</a>
EOT
doc.at('/a/b1/c/d1').text # => "01/11/2001"
doc.at('//d1').text # => "01/11/2001"
// in XPath-ese means start at the top and look anywhere in your document. Instead, if you're supplying an explicit/absolute selector, start at the top of the document and drill down using '/a/b1/c/d1'. Or, do the simple thing and let the parser search through the document for that particular node using //d1. You can do that if you know there's a single instance of that node.
In my code above, I used at instead of xpath. at returns the first matching node, which is similar to using xpath('//d1').first. xpath returns a NodeSet, which is like an array of nodes, whereas at returns a Node only. Using inner_text on a NodeSet is likely to not give you the results you want, which would be the text of a particular node, so be careful there.
doc.xpath('/a/b1/c/d1/text()').class # => Nokogiri::XML::NodeSet
doc.xpath('//c').inner_text # => "\n 01/11/2001\n 02/02/2004\n "
doc.xpath('/a/b1/c/d1').first.text # => "01/11/2001"
Look at the following lines. Instead of using XPath selectors, I used CSS, which tends to be more readable. Nokogiri supports both.
doc.at('d1').text # => "01/11/2001"
doc.at('a b1 c d1').text # => "01/11/2001"
Also, notice the type of data returned from these two lines:
doc.at('/a/b1/c/d1/text()').class # => Nokogiri::XML::Text
doc.at('/a/b1/c/d1').text.class # => String
While it might seem good/smart to tell the parser to locate the text() node inside <d1>, what will be returned isn't text, and will need to be accessed further to make it usable, so consider forgoing the use of text() unless you know exactly why you need it:
doc.at('/a/b1/c/d1/text()').text # => "01/11/2001"
Finally, Nokogiri has many methods used for locating nodes. As I said above, xpath returns a NodeSet and at returns a Node. xpath is really an XPath-specific version of Nokogiri's search method. search, css and xpath all return NodeSets. at, at_css and at_xpath all return Nodes. The CSS and XPath variants are useful when you have an ambiguous selector that you need to be used as CSS or XPath specifically. Most of the time Nokogiri can figure whether it's CSS or XPath on its own and will do the right thing, so it's OK to use the generic search and at for the majority of your coding. Use the specific versions when you have to specify one or the other.

Get low level xpath from XML with Nokogiri

I'm trying to store in an array all the unique Xpaths of the low level elements in the XML below, but like I'm doing in array a is being stored all the XML, not only the Xpath themselves. The XML has different levels of Xpath. I mean, some child elements only have 2 ancestors and others more than one.
This is the code I have.
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8"?>
<items>
<item>
<name>Cake</name>
<ppu>0.55</ppu>
<batters>
<batter>Regular</batter>
<batter>Chocolate</batter>
<batter>Blueberry</batter>
<batter>Devil's Food</batter>
</batters>
<topping>None</topping>
<topping>Glazed</topping>
<topping>Sugar</topping>
<topping>Powdered Sugar</topping>
<topping>Chocolate with Sprinkles</topping>
<topping>Chocolate</topping>
<topping>Maple</topping>
</item>
<item>
<name>Raised</name>
<ppu>0.55</ppu>
<batters>
<batter>Regular</batter>
</batters>
<topping>None</topping>
<topping>Glazed</topping>
<topping>Sugar</topping>
<topping>Chocolate</topping>
<topping>Maple</topping>
</item>
</items>
EOT
a = []
a = doc.xpath("//*")
puts a
I'd like to store in array "a" only the unique xpaths as below:
/items/item/name
/items/item/ppu
/items/item/batters/batter
/items/item/topping
Maybe somebody could help me in how to do this.
Thanks for the help.
What you want to select is the "leaf" nodes. You can do it like so:
doc.xpath("//*[not(*)]")
This means "select all elements that don't contain elements".
If you want the XPaths, you'll need to call .path on each node. But the paths provided by Nokogiri have explicit positions (e.g. /items/item[2]/topping[4]), so you'll have to apply a regex to remove them, then remove duplicates with uniq:
doc.xpath("//*[not(*)]").map {|leaf| leaf.path.gsub(/\[.*?\]/, '') }.uniq
Output:
/items/item/name
/items/item/ppu
/items/item/batters/batter
/items/item/topping

Resources