LIBXML-RUBY > Xpath context - ruby

Context: I'm parsing an XML file using the libxml-ruby gem. I need to query the XML document for a set of nodes using the XPath find method. I then need to process each node individually, querying them once again using the XPath find method.
Issue: When I attempt to query the returned nodes individually, the XPath find method is querying the entire document rather than just the node:
Code Example:
require 'xml'
string = %{<?xml version="1.0" encoding="iso-8859-1"?>
<bookstore>
<book>
<title lang="eng">Harry Potter</title>
<price>29.99</price>
</book>
<book>
<title lang="eng">Learning XML</title>
<price>39.95</price>
</book>
</bookstore>}
xml = XML::Parser.string(string, :encoding => XML::Encoding::ISO_8859_1).parse
books = xml.find("//book")
books.each do |book|
price = book.find("//price").first.content
puts price
end
This script returns 29.99 twice. I think this must have something to with setting the XPath context but I have not figured out how to accomplish that yet.

The first problem I see is book.find("//price").
//price means "start at the top of the document and look downward. That's most certainly NOT what you want to do. Instead I think you want to look inside book for the first price.
Using Nokogiri, I'd use CSS selectors because they're more easy on the eyes and can usually accomplish the same thing:
require 'nokogiri'
string = %{<?xml version="1.0" encoding="iso-8859-1"?>
<bookstore>
<book>
<title lang="eng">Harry Potter</title>
<price>29.99</price>
</book>
<book>
<title lang="eng">Learning XML</title>
<price>39.95</price>
</book>
</bookstore>}
xml = Nokogiri::XML(string)
books = xml.search("book")
books.each do |book|
price = book.at("price").content
puts price
end
After running that I get:
29.99
39.95

Related

Extracting text while keeping the structure

Assume I have the following document:
<bookstore>
<book>
<title lang="en">Harry Potter</title>
<price>29.99</price>
Berlin
</book>
<book>
<title lang="en">Learning XML</title>
<price>39.95</price>
Tokyo
</book>
</bookstore>
How can I get the following document using XPath?
<bookstore>
<book>
Berlin
</book>
<book>
Tokyo
</book>
</bookstore>
I tried /bookstore/book/text() but that obviously destroys the structure of the document.
AFAIK this is not possible with XPath.
With XPath you can select single node, not return complex document structure.

Parsing an XML file with Nokogiri to determine the path (Ruby)

My code is supposed to "guess" the path(s) that lies before the relevant text nodes in my XML file. Relevant in this case means: text nodes nested within the recurring product/person/something tag, but not text nodes that are used outside of it.
This code:
#doc, items = Nokogiri.XML(#file), []
path = []
#doc.traverse do |node|
if node.class.to_s == "Nokogiri::XML::Element"
is_path_element = false
node.children.each do |child|
is_path_element = true if child.class.to_s == "Nokogiri::XML::Element"
end
path.push(node.name) if is_path_element == true && !path.include?(node.name)
end
end
final_path = "/"+path.reverse.join("/")
works for simple XML files, for example:
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
<title>Some XML file title</title>
<description>Some XML file description</description>
<item>
<title>Some product title</title>
<brand>Some product brand</brand>
</item>
<item>
<title>Some product title</title>
<brand>Some product brand</brand>
</item>
</channel>
</rss>
puts final_path # => "/rss/channel/item"
But when it gets more complicated, how should I then approach the challenge? For example with this one:
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
<title>Some XML file title</title>
<description>Some XML file description</description>
<item>
<titles>
<title>Some product title</title>
</titles>
<brands>
<brand>Some product brand</brand>
</brands>
</item>
<item>
<titles>
<title>Some product title</title>
</titles>
<brands>
<brand>Some product brand</brand>
</brands>
</item>
</channel>
</rss>
If you are looking for a list of deepest "parent" paths in the XML, there is more than one way to view that.
Although I think your own code could be adjusted to achieve the same output, I was convinced the same thing could be achieved by using xpath. And my motivation is to get my XML skills unrusty (not used Nokogiri yet, but I will need to do so professionally soon). So here is how to get all parent paths that have just one child level beneath them, using xpath:
xml.xpath('//*[child::* and not(child::*/*)]').each { |node| puts node.path }
The output of this for your second example file is:
/rss/channel/item[1]/titles
/rss/channel/item[1]/brands
/rss/channel/item[2]/titles
/rss/channel/item[2]/brands
. . . if you took this list and gsub out the indexes, then make the array unique, then this looks a lot like the output of your loop . . .
paths = xml.xpath('//*[child::* and not(child::*/*)]').map { |node| node.path }
paths.map! { |path| path.gsub(/\[[0-9]+\]/,'') }.uniq!
=> ["/rss/channel/item/titles", "/rss/channel/item/brands"]
Or in one line:
paths = xml.xpath('//*[* and not(*/*)]').map { |node| node.path.gsub(/\[[0-9]+\]/,'') }.uniq
=> ["/rss/channel/item/titles", "/rss/channel/item/brands"]
I'm created a library to build xpath.
xpath = Jini.new
.add_path('parent')
.add_path('child')
.add_all('toys')
.add_attr('name', 'plane')
.to_s
puts xpath // -> /parent/child//toys[#name="plane"]

How to match xPath using RegexBuddy?

The expression /bookstore/book[1]/title should return <title lang="eng">Harry Potter</title> but instead I get "The regular expression does not match..."
Here is my XML that I am testing:
<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
<book>
<title lang="eng">Harry Potter</title>
<price>29.99</price>
</book>
<book>
<title lang="eng">Learning XML</title>
<price>39.95</price>
</book>
</bookstore>
While RB can handle regular expressions for XPath, but it doesn't handle XPath paths.
For constructing/checking what's selected etc of XPath paths you'd need something like XPath Explorer, Firefox with Firebug+Firepath, or similar.

Which version of xpath that Nokogiri support?

I can't find an official statement of the xpath version that Nokogiri supports. Anyone can help me with it? In fact I want to extract some elements that have an attribute start with specified sub string. For example, I want to get all Book elements that have a category attribute start with the character C. How to do this with nokogiri?
<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- Edited by XMLSpy?-->
<bookstore>
<book category="COOKING">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="WEB">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>
<book category="WEB">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
I don't know which specific version of XPath Nokogiri supports. But, you can do this:
I want to get all book elements that have a category attribute start with the character C.
using XPath's starts-with:
doc = Nokogiri::XML(your_xml)
doc.search('//book[starts-with(#category, "C")]').each { |e| puts e['category'] }
# output is:
# COOKING
# CHILDREN
You could also use a CSS3 "begins with" selector:
doc = Nokogiri::XML(your_xml)
doc.search('book[category^=C]').each { |e| puts e['category'] }
# output is:
# COOKING
# CHILDREN

How to add a new node to XML

I have a simple XML file, items.xml:
<?xml version="1.0" encoding="UTF-8" ?>
<items>
<item>
<name>mouse</name>
<manufacturer>Logicteh</manufacturer>
</item>
<item>
<name>keyboard</name>
<manufacturer>Logitech - Inc.</manufacturer>
</item>
<item>
<name>webcam</name>
<manufacturer>Logistech</manufacturer>
</item>
</items>
I am trying to insert a new node with the following code:
require 'rubygems'
require 'nokogiri'
f = File.open('items.xml')
#items = Nokogiri::XML(f)
f.close
price = Nokogiri::XML::Node.new "price", #items
price.content = "10"
#items.xpath('//items/item/manufacturer').each do |node|
node.add_next_sibling(price)
end
file = File.open("items_fixed.xml",'w')
file.puts #items.to_xml
file.close
However this code adds a new node only after the last <manufacturer> node, items_fixed.xml:
<?xml version="1.0" encoding="UTF-8"?>
<items>
<item>
<name>mouse</name>
<manufacturer>Logitech</manufacturer>
</item>
<item>
<name>keyboard</name>
<manufacturer>Logitech</manufacturer>
</item>
<item>
<name>webcam</name>
<manufacturer>Logitech</manufacturer><price>10</price>
</item>
</items>
Why?
It would be helpful to distinguish between a Node (a particular piece of structured XML data at a particular place in a tree), and a "node template" which is the structure of the data.
Nokogiri (and most other XML libraries) only allow you to specify Nodes, not node templates. So when you created price = Nokogiri::XML::Node.new "price", #items, you had a particular piece of data that belongs in a particular place, but hadn't defined the place yet.
When you added it to the first <item>, you defined its place. When you added it to the second <item>, you uprooted it from its place and put it in a new place. At that point this Node appeared only in the second <item>. This continues when you add the same Node to each item, until you reach the last <item>, which is where the node stays.
Nokogiri doesn't have any way to specify a node template. What you need to do is:
#items.xpath('//items/item/manufacturer').each do |node|
price = Nokogiri::XML::Node.new "price", #items
price.content = "10"
node.add_next_sibling(price)
end
I'd start with this:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8"?>
<items>
<item>
<name>mouse</name>
<manufacturer>Logitech</manufacturer>
</item>
<item>
<name>keyboard</name>
<manufacturer>Logitech - Inc.</manufacturer>
</item>
</items>
EOT
doc.search('manufacturer').each { |n| n.after('<price>10</price>') }
Which results in:
puts doc.to_xml
# >> <?xml version="1.0" encoding="UTF-8"?>
# >> <items>
# >> <item>
# >> <name>mouse</name>
# >> <manufacturer>Logitech</manufacturer><price>10</price>
# >> </item>
# >> <item>
# >> <name>keyboard</name>
# >> <manufacturer>Logitech - Inc.</manufacturer><price>10</price>
# >> </item>
# >> </items>
It's easy to build upon this to insert different values for the price.

Resources