How to match xPath using RegexBuddy? - xpath

The expression /bookstore/book[1]/title should return <title lang="eng">Harry Potter</title> but instead I get "The regular expression does not match..."
Here is my XML that I am testing:
<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
<book>
<title lang="eng">Harry Potter</title>
<price>29.99</price>
</book>
<book>
<title lang="eng">Learning XML</title>
<price>39.95</price>
</book>
</bookstore>

While RB can handle regular expressions for XPath, but it doesn't handle XPath paths.
For constructing/checking what's selected etc of XPath paths you'd need something like XPath Explorer, Firefox with Firebug+Firepath, or similar.

Related

Extracting text while keeping the structure

Assume I have the following document:
<bookstore>
<book>
<title lang="en">Harry Potter</title>
<price>29.99</price>
Berlin
</book>
<book>
<title lang="en">Learning XML</title>
<price>39.95</price>
Tokyo
</book>
</bookstore>
How can I get the following document using XPath?
<bookstore>
<book>
Berlin
</book>
<book>
Tokyo
</book>
</bookstore>
I tried /bookstore/book/text() but that obviously destroys the structure of the document.
AFAIK this is not possible with XPath.
With XPath you can select single node, not return complex document structure.

LIBXML-RUBY > Xpath context

Context: I'm parsing an XML file using the libxml-ruby gem. I need to query the XML document for a set of nodes using the XPath find method. I then need to process each node individually, querying them once again using the XPath find method.
Issue: When I attempt to query the returned nodes individually, the XPath find method is querying the entire document rather than just the node:
Code Example:
require 'xml'
string = %{<?xml version="1.0" encoding="iso-8859-1"?>
<bookstore>
<book>
<title lang="eng">Harry Potter</title>
<price>29.99</price>
</book>
<book>
<title lang="eng">Learning XML</title>
<price>39.95</price>
</book>
</bookstore>}
xml = XML::Parser.string(string, :encoding => XML::Encoding::ISO_8859_1).parse
books = xml.find("//book")
books.each do |book|
price = book.find("//price").first.content
puts price
end
This script returns 29.99 twice. I think this must have something to with setting the XPath context but I have not figured out how to accomplish that yet.
The first problem I see is book.find("//price").
//price means "start at the top of the document and look downward. That's most certainly NOT what you want to do. Instead I think you want to look inside book for the first price.
Using Nokogiri, I'd use CSS selectors because they're more easy on the eyes and can usually accomplish the same thing:
require 'nokogiri'
string = %{<?xml version="1.0" encoding="iso-8859-1"?>
<bookstore>
<book>
<title lang="eng">Harry Potter</title>
<price>29.99</price>
</book>
<book>
<title lang="eng">Learning XML</title>
<price>39.95</price>
</book>
</bookstore>}
xml = Nokogiri::XML(string)
books = xml.search("book")
books.each do |book|
price = book.at("price").content
puts price
end
After running that I get:
29.99
39.95

Parsing an XML file with Nokogiri to determine the path (Ruby)

My code is supposed to "guess" the path(s) that lies before the relevant text nodes in my XML file. Relevant in this case means: text nodes nested within the recurring product/person/something tag, but not text nodes that are used outside of it.
This code:
#doc, items = Nokogiri.XML(#file), []
path = []
#doc.traverse do |node|
if node.class.to_s == "Nokogiri::XML::Element"
is_path_element = false
node.children.each do |child|
is_path_element = true if child.class.to_s == "Nokogiri::XML::Element"
end
path.push(node.name) if is_path_element == true && !path.include?(node.name)
end
end
final_path = "/"+path.reverse.join("/")
works for simple XML files, for example:
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
<title>Some XML file title</title>
<description>Some XML file description</description>
<item>
<title>Some product title</title>
<brand>Some product brand</brand>
</item>
<item>
<title>Some product title</title>
<brand>Some product brand</brand>
</item>
</channel>
</rss>
puts final_path # => "/rss/channel/item"
But when it gets more complicated, how should I then approach the challenge? For example with this one:
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
<title>Some XML file title</title>
<description>Some XML file description</description>
<item>
<titles>
<title>Some product title</title>
</titles>
<brands>
<brand>Some product brand</brand>
</brands>
</item>
<item>
<titles>
<title>Some product title</title>
</titles>
<brands>
<brand>Some product brand</brand>
</brands>
</item>
</channel>
</rss>
If you are looking for a list of deepest "parent" paths in the XML, there is more than one way to view that.
Although I think your own code could be adjusted to achieve the same output, I was convinced the same thing could be achieved by using xpath. And my motivation is to get my XML skills unrusty (not used Nokogiri yet, but I will need to do so professionally soon). So here is how to get all parent paths that have just one child level beneath them, using xpath:
xml.xpath('//*[child::* and not(child::*/*)]').each { |node| puts node.path }
The output of this for your second example file is:
/rss/channel/item[1]/titles
/rss/channel/item[1]/brands
/rss/channel/item[2]/titles
/rss/channel/item[2]/brands
. . . if you took this list and gsub out the indexes, then make the array unique, then this looks a lot like the output of your loop . . .
paths = xml.xpath('//*[child::* and not(child::*/*)]').map { |node| node.path }
paths.map! { |path| path.gsub(/\[[0-9]+\]/,'') }.uniq!
=> ["/rss/channel/item/titles", "/rss/channel/item/brands"]
Or in one line:
paths = xml.xpath('//*[* and not(*/*)]').map { |node| node.path.gsub(/\[[0-9]+\]/,'') }.uniq
=> ["/rss/channel/item/titles", "/rss/channel/item/brands"]
I'm created a library to build xpath.
xpath = Jini.new
.add_path('parent')
.add_path('child')
.add_all('toys')
.add_attr('name', 'plane')
.to_s
puts xpath // -> /parent/child//toys[#name="plane"]

Which version of xpath that Nokogiri support?

I can't find an official statement of the xpath version that Nokogiri supports. Anyone can help me with it? In fact I want to extract some elements that have an attribute start with specified sub string. For example, I want to get all Book elements that have a category attribute start with the character C. How to do this with nokogiri?
<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- Edited by XMLSpy?-->
<bookstore>
<book category="COOKING">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="WEB">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>
<book category="WEB">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
I don't know which specific version of XPath Nokogiri supports. But, you can do this:
I want to get all book elements that have a category attribute start with the character C.
using XPath's starts-with:
doc = Nokogiri::XML(your_xml)
doc.search('//book[starts-with(#category, "C")]').each { |e| puts e['category'] }
# output is:
# COOKING
# CHILDREN
You could also use a CSS3 "begins with" selector:
doc = Nokogiri::XML(your_xml)
doc.search('book[category^=C]').each { |e| puts e['category'] }
# output is:
# COOKING
# CHILDREN

C# What query language to specify in rule to pull out the two book nodes

<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
<book category="COOKING">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="WEB">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>
<book category="WEB">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
Hi,
Currently, I have a rules based system, where incoming xml messages are matched against a rule, and if the rule hits, the packet is processed. To get a match I use xpath to select individual values in the xml, and I specify these in the rule as combined xpath:regexp expressions, something like this.
/bookstore/book[1]/ title: (.+)
For example the above would match against the "Everyday Italian"
But I'm trying to find a query or perhaps a new query language expression which will allow me to select all the book nodes for the above classic msdn docs book.xml, such that if I specify the query expression in the rule, I can lift it using a parser, and use it directly against the xml file to pull out the books node.
I don't know if xpath can do it. I was looking at XQuery, but it seems wordy. XSLT could probably do it, but it's wordy. Any ideas.
Is their a simple way of specifying an expression such that it would lift all book nodes, or say 1 and 2, or 1st and 3rd.
Thanks.
Bob.
The following questions, which of a different design, answered it indirectly.
XPath Node Select
using xquery and flwor:
nicholas#mordor:~/flwor/position$
nicholas#mordor:~/flwor/position$ basex position.xq
<book category="children">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="web">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>nicholas#mordor:~/flwor/position$
nicholas#mordor:~/flwor/position$
nicholas#mordor:~/flwor/position$ cat position.xq
xquery version "3.1";
let $doc := doc("bookstore.xml")
let $books := $doc/bookstore/book
for $book at $p in $books
where $p gt 1 and $p lt 4
return $book
nicholas#mordor:~/flwor/position$
nicholas#mordor:~/flwor/position$ cat bookstore.xml
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="children">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="web">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>
<book category="web">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>nicholas#mordor:~/flwor/position$
the where clause can find the position of each book node, and just return specific nodes.

Resources