How to Use multiple conditions in Xpath? - xpath

New to Xpath. Was trying in to use XML task in SSIS to load some values. Using Microsoft' XML inventory mentioned below.
How can I load first-name value in bookstore/books where style is novel and award = 'Pulitzer'?
//book[#style='novel' and ./author/award/text()='Pulitzer'] is what I am trying. It gives the whole element. Where should I modify to just get the first-name value?
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="myfile.xsl" ?>
<bookstore specialty="novel">
<book style="autobiography">
<author>
<first-name>Joe</first-name>
<last-name>Bob</last-name>
<award>Trenton Literary Review Honorable Mention</award>
</author>
<price>12</price>
</book>
<book style="textbook">
<author>
<first-name>Mary</first-name>
<last-name>Bob</last-name>
<publication>Selected Short Stories of
<first-name>Mary</first-name>
<last-name>Bob</last-name>
</publication>
</author>
<editor>
<first-name>Britney</first-name>
<last-name>Bob</last-name>
</editor>
<price>55</price>
</book>
<magazine style="glossy" frequency="monthly">
<price>2.50</price>
<subscription price="24" per="year"/>
</magazine>
<book style="novel" id="myfave">
<author>
<first-name>Toni</first-name>
<last-name>Bob</last-name>
<degree from="Trenton U">B.A.</degree>
<degree from="Harvard">Ph.D.</degree>
<award>P</award>
<publication>Still in Trenton</publication>
<publication>Trenton Forever</publication>
</author>
<price intl="Canada" exchange="0.7">6.50</price>
<excerpt>
<p>It was a dark and stormy night.</p>
<p>But then all nights in Trenton seem dark and
stormy to someone who has gone through what
<emph>I</emph> have.</p>
<definition-list>
<term>Trenton</term>
<definition>misery</definition>
</definition-list>
</excerpt>
</book>
<my:book xmlns:my="uri:mynamespace" style="leather" price="29.50">
<my:title>Who's Who in Trenton</my:title>
<my:author>Robert Bob</my:author>
</my:book>
</bookstore>

I got an answer.
//book[#style='novel' and ./author/award/text()='Pulitzer']//first-name

Use:
/*/book[#style='novel']/author[award = 'Pulitzer']/first-name
This selects any first-name element whose author parent has a award child with string value of 'Pulitzer' and whose (of the author) parent is a book whose style attribute has value "novel" and whose parent is the top element of the XML document.

A similar question in the same context. How can I do the vice-versa ? Let's suppose I want to find the id of all those books whose price is greater than 20 ? I know I am being a nudge, but really want to clear my understanding.
Here is the needed XPATH :
//book/price[text() > 20]/..

Related

I am using pig version .8 , How to extract specific elements of xml by using XPath() ?. I tried with multiple ways but couldn't get.Please suggest

<CATALOG>
<BOOK>
<TITLE>Hadoop Defnitive Guide</TITLE>
<AUTHOR>Tom White</AUTHOR>
<COUNTRY>US</COUNTRY>
<COMPANY>CLOUDERA</COMPANY>
<PRICE>24.90</PRICE>
<YEAR>2012</YEAR>
</BOOK>
</CATALOG>
This is xml i am using.
I want to extract only TITLE and COMPANY elements.Is there any way to extract them by using Regex or XPath();
First thing you need to do is format your XML like so:
<CATALOG>
<BOOK>
<TITLE>Hadoop Defnitive Guide</TITLE>
<AUTHOR>Tom White</AUTHOR>
<COUNTRY>US</COUNTRY>
<COMPANY>CLOUDERA</COMPANY>
<PRICE>24.90</PRICE>
<YEAR>2012</YEAR>
</BOOK>
</CATALOG>
Then you can extract those elements like so:
/CATALOG/BOOK/*[self::title or self::company]
More about axes you can find here: http://www.w3schools.com/xsl/xpath_axes.asp

Xpath - multiple tags with same name starting with a substring

I have this XML file and I need to get the name and the author/s of a book, where name of at least one author starts with "E".
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE library SYSTEM "knihy.dtd">
<library>
<book isbn="0470114878">
<author hlavni="hlavni">David Hunter</author>
<author>Jeff Rafter</author>
<author>Joe Fawcett</author>
<author>Eric van der Vlist</author>
<author>Danny Ayers</author>
<name>Beginning XML</name>
<publisher>Wrox</publisher>
</book>
<book isbn="0596004206">
<author>Erik Ray</author>
<name>Learning XML</name>
<publisher>O'Reilly</publisher>
</book>
<book isbn="0764547593">
<author>Kay Ethier</author>
<author>Alan Houser</author>
<name>XML Weekend Crash Course</name>
<year>2001</year>
</book>
<book isbn="1590596765">
<author>Sas Jacobs</author>
<name>Beginning XML with DOM and Ajax</name>
<publisher>Apress</publisher>
<year>2006</year>
</book>
</library>
I tried this approach
for $book in /library/book[starts-with(author, "E")]
return $book
but it returns XPathException in invokeTransform: A sequence of more than one item is not allowed as the first argument of starts-with() ("David Hunter", "Jeff Rafter", ...). So how can I check this sequence?
As the error message suggests, use starts-with() in predicate for individual author elements, instead of passing all author child elements to starts-with() function at once :
for $book in /library/book[author[starts-with(., "E")]]
return $book
xpathtester demo
The above will return all books where name of at least one of the author starts with "E".
output :
<book isbn="0470114878">
<author hlavni="hlavni">David Hunter</author>
<author>Jeff Rafter</author>
<author>Joe Fawcett</author>
<author>Eric van der Vlist</author>
<author>Danny Ayers</author>
<name>Beginning XML</name>
<publisher>Wrox</publisher>
</book>
<book isbn="0596004206">
<author>Erik Ray</author>
<name>Learning XML</name>
<publisher>O'Reilly</publisher>
</book>

Choose First match from an XPath

The XPath (bookstore/book/title|bookstore/book/author) selects title, author if both of them exist
How can I select just the first match of these two and not both, and get this value for all the 'book' nodes in the document
(bookstore/book/title|bookstore/book/author)[1] limits the result to just the first 'title' in the first book.But I need to be able to get results from other book nodes in the document
I'm assuming that by 'first' you mean 'first in document order', not 'first referenced in my XPath expression.'
In XPath 2.0, you can say
bookstore/book/((title|author)[1])
If you only have XPath 1.0, let us know and we can proceed from there. Also let us know something of the broader environment (XSLT? XQuery? Javascript?) because some of this may have to be done outside of XPath.
Update: I just tested this, using Simple Online XPath Tester with XPath 2.0. Given the following XML input:
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book category="COOKING">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="CHILDREN">
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="WEB">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>
<book category="WEB">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
and the XPath expression
/bookstore/book/((title|author)[1])
I get the following output, which appears to be what you asked for:
<title lang="en">Everyday Italian</title>
<author>J K. Rowling</author>
<title lang="en">XQuery Kick Start</title>
<title lang="en">Learning XML</title>
For XPath 1.0
If you don't have XPath 2.0, as I suspect you don't, and you still want to do it all in XPath, here's what I would do:
/bookstore/book/title[1] | /bookstore/book[not(title)]/author[1]
What does this do? It gives the (first) title of each book that has a title, as well as the (first) author of each book that doesn't have a title.
This expression is not quite as general as what you asked for: it assumes that <title> comes before <author> when it exists, as it does in your sample data. If your data has author before title, then the above expression will still prefer title despite the order.
If you really need the first of the two regardless of whether it's author or title, try
/bookstore/book/title[1][not(preceding-sibling::author)] |
/bookstore/book/author[1][not(preceding-sibling::title)]

How to recursively delete empty child elements at a specific xpath location in an XML using Nokogiri?

I have the below XML, where i have few child elements with empty text.
doc = <<'XML'
<Book>
<BookId>BK45647</BookId>
<BookName>The Client by John Grisham</BookName>
<BookAuthenticationCode></BookAuthenticationCode>
<BookCategory>Suspense</BookCategory>
<BookSequence></BookSequence>
<BookPublisherInfo>
<PublisherId>PBBK12345</PublisherId>
<PublisherName>Mc.GrawHill</PublisherName>
<PublisherIndex></PublisherIndex>
<PublisherCategoryQuota></PublisherCategoryQuota>
</BookPublisherInfo>
<BookPurchaselist>
<Customer>
<FirstName>John</FirstName>
<LastName>Smith</LastName>
<MiddleName></MiddleName>
<NickName></NickName>
</Customer>
<Customer>
<FirstName>Winston</FirstName>
<LastName>Churchill</LastName>
<MiddleName></MiddleName>
<NickName></NickName>
</Customer>
</BookPurchaselist>
</Book>
XML
I tried with below code but its somehow not working properly.
cust = doc.at_xpath("//Customer")
cust.each do |cust_obj|
if cust_obj.has_text? == false
cust_obj.delete
end
end
This is somehow not working properly and giving the below output
<Book>
<BookId>BK45647</BookId>
<BookName>The Client by John Grisham</BookName>
<BookAuthenticationCode></BookAuthenticationCode>
<BookCategory>Suspense</BookCategory>
<BookSequence></BookSequence>
<BookPublisherInfo>
<PublisherId>PBBK12345</PublisherId>
<PublisherName>Mc.GrawHill</PublisherName>
<PublisherIndex></PublisherIndex>
<PublisherCategoryQuota></PublisherCategoryQuota>
</BookPublisherInfo>
<BookPurchaselist>
<Customer>
<FirstName>John</FirstName>
<LastName>Smith</LastName>
<MiddleName></MiddleName>
</Customer>
<Customer>
<FirstName>Winston</FirstName>
<LastName>Churchill</LastName>
<NickName></NickName>
</Customer>
</BookPurchaselist>
</Book>
Few of the elements which has empty text are getting and few of them remain as such. How can i recursively delete elements at specific xpath(with empty data) and re-write the XML.
Got stuck here.. Need suggestions.
doc.xpath('//Customer/child::*[not(text())]').each do |node|
node.remove
end
You can use not(node()) if you want to delete nodes that have no children, too.
EDIT: Full working example (using the same code as above)
require 'nokogiri'
xml = <<-XML
<Book>
<BookId>BK45647</BookId>
<BookName>The Client by John Grisham</BookName>
<BookAuthenticationCode></BookAuthenticationCode>
<BookCategory>Suspense</BookCategory>
<BookSequence></BookSequence>
<BookPublisherInfo>
<PublisherId>PBBK12345</PublisherId>
<PublisherName>Mc.GrawHill</PublisherName>
<PublisherIndex></PublisherIndex>
<PublisherCategoryQuota></PublisherCategoryQuota>
</BookPublisherInfo>
<BookPurchaselist>
<Customer>
<FirstName>John</FirstName>
<LastName>Smith</LastName>
<MiddleName></MiddleName>
</Customer>
<Customer>
<FirstName>Winston</FirstName>
<LastName>Churchill</LastName>
<NickName></NickName>
</Customer>
</BookPurchaselist>
</Book>
XML
doc = Nokogiri.parse(xml)
doc.xpath('//Customer/child::*[not(text())]').each do |node|
node.remove
end
puts doc.to_s
The output of this program is:
<?xml version="1.0"?>
<Book>
<BookId>BK45647</BookId>
<BookName>The Client by John Grisham</BookName>
<BookAuthenticationCode/>
<BookCategory>Suspense</BookCategory>
<BookSequence/>
<BookPublisherInfo>
<PublisherId>PBBK12345</PublisherId>
<PublisherName>Mc.GrawHill</PublisherName>
<PublisherIndex/>
<PublisherCategoryQuota/>
</BookPublisherInfo>
<BookPurchaselist>
<Customer>
<FirstName>John</FirstName>
<LastName>Smith</LastName>
</Customer>
<Customer>
<FirstName>Winston</FirstName>
<LastName>Churchill</LastName>
</Customer>
</BookPurchaselist>
</Book>

I want to reprint the modified xml after deleting entire child node

<product>
<book>
<id>111</id>
<name>xxx</name>
</book>
<pen>
<id>222</id>
<name>yyy</name>
</pen>
<pencil>
<id>333</id>
<name>zzz</name>
</pencil>
I want to remove the "pencil" node and print the remaining xml using REXML (Ruby). Can anybody tell me how to do that ?
By using one of the delete methods http://rubydoc.info/stdlib/rexml/
require "rexml/document"
string = <<EOF
<product>
<book>
<id>111</id>
<name>xxx</name>
</book>
<pen>
<id>222</id>
<name>yyy</name>
</pen>
<pencil>
<id>333</id>
<name>zzz</name>
</pencil>
</product>
EOF
doc = REXML::Document.new(string)
doc.delete_element('//pencil')
puts doc
There is also nice tutorial to get you started: http://www.germane-software.com/software/rexml/docs/tutorial.html

Resources