Extracting text while keeping the structure

Extracting text while keeping the structure - xpath

Assume I have the following document:
<bookstore>
<book>
<title lang="en">Harry Potter</title>
<price>29.99</price>
Berlin
</book>
<book>
<title lang="en">Learning XML</title>
<price>39.95</price>
Tokyo
</book>
</bookstore>
How can I get the following document using XPath?
<bookstore>
<book>
Berlin
</book>
<book>
Tokyo
</book>
</bookstore>
I tried /bookstore/book/text() but that obviously destroys the structure of the document.

AFAIK this is not possible with XPath.
With XPath you can select single node, not return complex document structure.

Related

Having trouble determining if element exists

I have an xml document full of nested item nodes. In most cases, each item has a name element. I want to check if an item has a name element, and return a default name if one doesn't exist.
<item>
<name>Item 1</name>
</item>
<item>
<items>
<item>
<name>Child Item 1</name>
</item>
<item>
<name>Child Item 2</name>
</item>
</items>
</item>
When I ask node.at('name') for the node with no name element, it picks the next one from the children further down the tree. In the case above, if I ask at('name') on the second item, I get "Child Item 1".

The problem is you're using at(), which can accept either a CSS selector or an XPath expression, and tries to guess which you gave it. In this case it thinks that name is a CSS selector, which is a descendant selector, selecting name elements anywhere below the current node.
Instead, you want to use an XPath expression to find only child <name> elements. You can do this either by making it clearly an XPath expression:
node.at('./name')
or you can do it by using the at_xpath method to be clear:
node.at_xpath('name')
Here's a simple working example:
require 'nokogiri'
doc = Nokogiri.XML '<r>
<item id="a">
<name>Item 1</name>
</item>
<item id="b">
<items>
<item id="c">
<name>Child Item 1</name>
</item>
<item id="d">
<name>Child Item 2</name>
</item>
</items>
</item>
</r>'
doc.css('item').each do |item|
name = item.at_xpath('name')
name = name ? name.text : "DEFAULT"
puts "#{item['id']} -- #{name}"
end
#=> a -- Item 1
#=> b -- DEFAULT
#=> c -- Child Item 1
#=> d -- Child Item 2

LIBXML-RUBY > Xpath context

Context: I'm parsing an XML file using the libxml-ruby gem. I need to query the XML document for a set of nodes using the XPath find method. I then need to process each node individually, querying them once again using the XPath find method.
Issue: When I attempt to query the returned nodes individually, the XPath find method is querying the entire document rather than just the node:
Code Example:
require 'xml'
string = %{<?xml version="1.0" encoding="iso-8859-1"?>
<bookstore>
<book>
<title lang="eng">Harry Potter</title>
<price>29.99</price>
</book>
<book>
<title lang="eng">Learning XML</title>
<price>39.95</price>
</book>
</bookstore>}
xml = XML::Parser.string(string, :encoding => XML::Encoding::ISO_8859_1).parse
books = xml.find("//book")
books.each do |book|
price = book.find("//price").first.content
puts price
end
This script returns 29.99 twice. I think this must have something to with setting the XPath context but I have not figured out how to accomplish that yet.

The first problem I see is book.find("//price").
//price means "start at the top of the document and look downward. That's most certainly NOT what you want to do. Instead I think you want to look inside book for the first price.
Using Nokogiri, I'd use CSS selectors because they're more easy on the eyes and can usually accomplish the same thing:
require 'nokogiri'
string = %{<?xml version="1.0" encoding="iso-8859-1"?>
<bookstore>
<book>
<title lang="eng">Harry Potter</title>
<price>29.99</price>
</book>
<book>
<title lang="eng">Learning XML</title>
<price>39.95</price>
</book>
</bookstore>}
xml = Nokogiri::XML(string)
books = xml.search("book")
books.each do |book|
price = book.at("price").content
puts price
end
After running that I get:
29.99
39.95

How to match xPath using RegexBuddy?

The expression /bookstore/book[1]/title should return <title lang="eng">Harry Potter</title> but instead I get "The regular expression does not match..."
Here is my XML that I am testing:
<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
<book>
<title lang="eng">Harry Potter</title>
<price>29.99</price>
</book>
<book>
<title lang="eng">Learning XML</title>
<price>39.95</price>
</book>
</bookstore>

While RB can handle regular expressions for XPath, but it doesn't handle XPath paths.
For constructing/checking what's selected etc of XPath paths you'd need something like XPath Explorer, Firefox with Firebug+Firepath, or similar.

Which version of xpath that Nokogiri support?

I can't find an official statement of the xpath version that Nokogiri supports. Anyone can help me with it? In fact I want to extract some elements that have an attribute start with specified sub string. For example, I want to get all Book elements that have a category attribute start with the character C. How to do this with nokogiri?
<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- Edited by XMLSpy?-->
<bookstore>
<book category="COOKING">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="WEB">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>
<book category="WEB">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>

I don't know which specific version of XPath Nokogiri supports. But, you can do this:
I want to get all book elements that have a category attribute start with the character C.
using XPath's starts-with:
doc = Nokogiri::XML(your_xml)
doc.search('//book[starts-with(#category, "C")]').each { |e| puts e['category'] }
# output is:
# COOKING
# CHILDREN
You could also use a CSS3 "begins with" selector:
doc = Nokogiri::XML(your_xml)
doc.search('book[category^=C]').each { |e| puts e['category'] }
# output is:
# COOKING
# CHILDREN

C# What query language to specify in rule to pull out the two book nodes

<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
<book category="COOKING">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="WEB">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>
<book category="WEB">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
Hi,
Currently, I have a rules based system, where incoming xml messages are matched against a rule, and if the rule hits, the packet is processed. To get a match I use xpath to select individual values in the xml, and I specify these in the rule as combined xpath:regexp expressions, something like this.
/bookstore/book[1]/ title: (.+)
For example the above would match against the "Everyday Italian"
But I'm trying to find a query or perhaps a new query language expression which will allow me to select all the book nodes for the above classic msdn docs book.xml, such that if I specify the query expression in the rule, I can lift it using a parser, and use it directly against the xml file to pull out the books node.
I don't know if xpath can do it. I was looking at XQuery, but it seems wordy. XSLT could probably do it, but it's wordy. Any ideas.
Is their a simple way of specifying an expression such that it would lift all book nodes, or say 1 and 2, or 1st and 3rd.
Thanks.
Bob.

The following questions, which of a different design, answered it indirectly.
XPath Node Select

using xquery and flwor:
nicholas#mordor:~/flwor/position$
nicholas#mordor:~/flwor/position$ basex position.xq
<book category="children">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="web">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>nicholas#mordor:~/flwor/position$
nicholas#mordor:~/flwor/position$
nicholas#mordor:~/flwor/position$ cat position.xq
xquery version "3.1";
let $doc := doc("bookstore.xml")
let $books := $doc/bookstore/book
for $book at $p in $books
where $p gt 1 and $p lt 4
return $book
nicholas#mordor:~/flwor/position$
nicholas#mordor:~/flwor/position$ cat bookstore.xml
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="children">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="web">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>
<book category="web">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>nicholas#mordor:~/flwor/position$
the where clause can find the position of each book node, and just return specific nodes.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Extracting text while keeping the structure - xpath

AFAIK this is not possible with XPath. With XPath you can select single node, not return complex document structure.

Related

Having trouble determining if element exists

LIBXML-RUBY > Xpath context

How to match xPath using RegexBuddy?

Which version of xpath that Nokogiri support?

C# What query language to specify in rule to pull out the two book nodes

Categories

Resources