libxml2 predicates in xpath expression are not always recognized - xpath

I appeal to you because I have problems in using the libxml2 library that does not take into account certain parameters in my xpath expressions.
Here is an example of xml file that I am trying to parse:
<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
<book title="Harry Potter" lang="eng" version="1">
<price>29.99</price>
</book>
<book title="Learning XML" lang="eng" version="2">
<price>38.95</price>
</book>
<book title="Learning C" lang="eng" version="2">
<price>39.95</price>
</book>
</bookstore>
Suppose I want to extract all the books whose native language is English and whose version is the first edition.
I'll use if I'm not mistaken the following XPath expression :
//book[#lang='eng' and #version='1']
and the following instructions in my code :
xmlChar * xpath_expression = "//book[#lang='eng' and #version='1']";
xmlXPathObjectPtr xpathRes = xmlXPathEvalExpression(xpath_expression, ctxt);
The problem is that I get as a result, the list of books as if I'd just do the following request:
//book
I wonder if my version is buggy knowing that I have the latest for my debian squeeze (2.7.8.dfsg-2 + squeeze7)...

This is most certainly not a bug in libxml2. You probably made an error elsewhere. The following code only prints "Harry Potter":
#include <stdio.h>
#include <libxml/xpath.h>
int main()
{
static const char xml[] =
"<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>\n"
"<bookstore>\n"
" <book title=\"Harry Potter\" lang=\"eng\" version=\"1\">\n"
" <price>29.99</price>\n"
" </book>\n"
" <book title=\"Learning XML\" lang=\"eng\" version=\"2\">\n"
" <price>38.95</price>\n"
" </book>\n"
" <book title=\"Learning C\" lang=\"eng\" version=\"2\"> \n"
" <price>39.95</price>\n"
" </book>\n"
"</bookstore>\n";
xmlDocPtr doc = xmlParseMemory(xml, sizeof(xml));
xmlXPathContextPtr ctxt = xmlXPathNewContext(doc);
xmlChar *expression = BAD_CAST "//book[#lang='eng' and #version='1']";
xmlXPathObjectPtr res = xmlXPathEvalExpression(expression, ctxt);
xmlNodeSetPtr nodeset = res->nodesetval;
for (int i = 0; i < nodeset->nodeNr; i++) {
xmlNodePtr node = nodeset->nodeTab[i];
xmlChar *title = xmlGetProp(node, BAD_CAST "title");
printf("%s\n", title);
}
xmlXPathFreeObject(res);
xmlXPathFreeContext(ctxt);
xmlFreeDoc(doc);
return 0;
}

Related

Ruby nokogiri attribute selector in XML file

this is the xml file:
<?xml version="1.0" encoding="UTF-8"?>
<SOAP-ENV:Envelope SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"
xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/">
<SOAP-ENV:Body>
<ns1:putResponse
xmlns:ns1="urn:DmsManagerClient">
<result xsi:type="xsd:string">
<?xml version="1.0" encoding="ISO-8859-1"?>
<MESSAGE ID="11c73b9e-687c-4300-baba-b743c26f7c83" TYPE="CUSDMS">
<DELIVERY>
<FROM>
<SENDER>0072000</SENDER>
<SERVICE>eService</SERVICE>
<DATE>2019-03-08T12:27:25</DATE>
</FROM>
<TO>
<DEALER DEALERCODE="0072000" MARKETCODE="1000"/>
</TO>
</DELIVERY>
<CONTENT>
<dms:ComplexResponse ErrorCode="430" ErrorDescription="null : PrivacyUE Mancante" Return="false"
xmlns:dms="http://dmsmanagerservice">
<dms:Element Name="DMSVERSION">2.7</dms:Element>
</dms:ComplexResponse>
</CONTENT>
</MESSAGE>
</result>
</ns1:putResponse>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
I am coding with Ruby and I used Nokogiri and the method xpath to extrapole the "CONTENT" of the file
this is the code:
def extrapolate_error(xml)
doc = Nokogiri::XML(File.open(xml))
doc.xpath('//CONTENT')
end
and this is the result:
[#<Nokogiri::XML::Element:0x1c5ba78 name="CONTENT" children=[
#<Nokogiri::XML::Text:0x1c5b940 "\n">,
#<Nokogiri::XML::Element:0x1c5b8bc name="ComplexResponse" namespace=#<Nokogiri::XML::Namespace:0x1c5b88c prefix="dms" href="http://dmsmanagerservice">
attributes=[
#<Nokogiri::XML::Attr:0x1c5b874 name="ErrorCode" value="430">,
#<Nokogiri::XML::Attr:0x1c5b868 name="ErrorDescription" value="null : PrivacyUE Mancante">,
#<Nokogiri::XML::Attr:0x1c5b85c name="Return" value="false">]
children=[#<Nokogiri::XML::Text:0x1c5b118 "\n">,
#<Nokogiri::XML::Element:0x1c5b094 name="Element" namespace=#<Nokogiri::XML::Namespace:0x1c5b88c prefix="dms" href="http://dmsmanagerservice">
attributes=[#<Nokogiri::XML::Attr:0x1c5b058 name="Name" value="DMSVERSION">]
children=[#<Nokogiri::XML::Text:0x1c5abe4 "2.7">]>,
#<Nokogiri::XML::Text:0x1c5aaac "\n">]>,
#<Nokogiri::XML::Text:0x1c5a974 "\n">]>]
Now I need to enter in it and select some attributes.
In the specific I need this:
name="ErrorCode" value="430"
name="ErrorDescription" value="null : PrivacyUE Mancante"
I do not know how to procceed. Can you help me?
The following should work for you assuming the dms namespace is always the same
doc.xpath('//CONTENT/dms:ComplexResponse', dms: 'http://dmsmanagerservice')
.xpath('#ErrorCode | #ErrorDescription')
.each_with_object({}) do |e,obj|
obj[e.name] = e.text
end
#=> {"ErrorCode"=>"430", "ErrorDescription"=>"null : PrivacyUE Mancante"}
You already understand how you got to //CONTENT so from there we use dms:ComplexResponse to navigate deeper but since this is namespaced we have to provide the namespace reference e.g. dms: 'http://dmsmanagerservice'.
Then we select the attributes we are interested in #ErrorCode and #ErrorDescription.
In XPath the pipe | means UNION (think AND) so we want to select both.
Then we are just building a Hash using the name as the key and the text as the value.
XPath Cheatsheet - Useful resource if you need additional reference
Update
You asked about conditionals so this is what I would propose
ndoc = Nokogiri::XML(doc)
namespaces = ndoc.collect_namespaces
response = ndoc.xpath("//CONTENT/dms:ComplexResponse", namespaces)
if response.xpath("self::node()[#ErrorCode != '' and #ErrorDescription != '']").any?
response.xpath("#ErrorCode | #ErrorDescription")
.each_with_object({}) do |e,obj|
obj[e.name] = e.text
end
else
response.xpath('dms:Element/#Name | dms:Element/text()',namespaces)
.each_slice(2)
.map {|s| s.map(&:text)}.to_h
end
This checks to see if there is an ErrorCode and and ErrorDescription if so then Hash as originally proposed. If Not then it returns all the dms:Elements as a Hash so {"DMSVERSION"=>"2.7"} in this case Functional Example

How to add the values for respective elements?

I have 3 XML structures as below:
a.xml
<Books>
<Book>
<Publisher>ABC Pvt Ltd</Publisher>
<Month>May</Month>
<Year>2016</Year>
<BooksReleased>4</BooksReleased>
</Book>
</Books>
b.xml
<Books>
<Book>
<Publisher>XYZ Pvt Ltd</Publisher>
<Month>April</Month>
<Year>2016</Year>
<BooksReleased>2</BooksReleased>
</Book>
</Books>
c.xml
<Books>
<Book>
<Publisher>ABC Pvt Ltd</Publisher>
<Month>June</Month>
<Year>2016</Year>
<BooksReleased>2</BooksReleased>
</Book>
</Books>
I would like to group these XML by publisher and also need to calculate its total no. of BooksReleased by the publisher for particular year.
required output format:
<TotalCalc>
<PublishedBook>
<Publisher>ABC Pvt Ltd</Publisher>
<no.of books>6</no.of books>
</PublishedBook>
<PublishedBook>
<Publisher>XYZ Pvt Ltd</Publisher>
<no.of books>2</no.of books>
</PublishedBook>
</TotalCalc>
Kindly, help me i tried the following but its not working
typeswitch($Publisher)
case element (ABC Pvt Ltd)
return sum($doc/BooksReleases[$doc/$Publisher = 'ABC Pvt Ltd'])
default return 'unknnown'
It might be possible to use cts:value-tuples to pull up co-occurrences of Publisher and 'BooksReleased', which you can then iterate to aggregate by Publisher. That would scale much better. Something like:
let $aggregates := map:map()
let $_ :=
for $tuple in cts:value-tuples((
cts:element-reference(xs:QName("Publisher")),
cts:element-reference(xs:QName("BooksReleased"))
))
let $values := json:array-values($tuple)
let $pub := $values[1]
let $books as xs:int := $values[2]
return map:put($aggregates, $pub, (map:get($aggregates, $pub), 0)[1] + $books)
return $aggregates
Note thought that this requires indexes on Publisher and BooksReleased, and it is important that each document contains only one (value of) Publisher to prevent cross-products.
I would also consider simply dropping (or ignoring) BooksReleased, and just making sure you save each book as a separate document. You can then use cts:values on Publisher and use cts:frequency on each publisher value to get the number of books for the publishers.
HTH!

Working with XML body from Soap call with Nokogiri in Ruby

I'm writing a Ruby script to make a Postman SOAP POST call, then using Nokogiri to to parse the XML response. When I take the full SOAP call response from Postman, copy it into my editor and manually take the XML body and decode it and format it online I'm able to use the following Nokogiri script successfully:
doc = Nokogiri::XML(File.open("response.xml"))
property_ids = []
doc.css('Property').each do |property|
puts "Property ID: #{property['PropertyId']}"
property_ids << property['PropertyId']
end
property_ids.each_with_index do |property_id, index|
puts "index: #{index}"
puts "property id: #{property_id}"
end
Where I run into the problem is when I want to include in the script the Ruby snippet of the Postman call:
require 'nokogiri'
require 'uri'
require 'net/http'
require 'openssl'
url = URI("https://esite.thelyndco.com/AmsiWeb/eDexWeb/esite/leasing.asmx")
http = Net::HTTP.new(url.host, url.port)
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
request = Net::HTTP::Post.new(url)
request["content-type"] = 'application/soap+xml'
request["cache-control"] = 'no-cache'
request["postman-token"] = '916e3f3d-11ca-e8cf-2066-542b009a281d'
request.body = "<?xml version=\"1.0\" encoding=\"utf-8\"?>\r\n<soap12:Envelope xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\" xmlns:soap12=\"http://www.w3.org/2003/05/soap-envelope\">\r\n <soap12:Body>\r\n <GetPropertyList xmlns=\"http://tempuri.org/\">\r\n <UserID>updater</UserID>\r\n <Password>[password]</Password>\r\n <PortfolioName>[portfolio name]</PortfolioName>\r\n <XMLData> \r\n</XMLData>\r\n </GetPropertyList>\r\n </soap12:Body>\r\n</soap12:Envelope>"
response = http.request(request)
doc = Nokogiri::XML(response.body)
# doc = Nokogiri::XML(File.open("full-response.xml"))
# doc.at('GetPropertyListResponse').text
What I want to do is take the full SOAP response with the SOAP envelope and be able to process it in my script without having to cut and paste; manually decoding and formatting using online XML formatters.
Commented out are a couple of lines that I tried from Stack Overflow. Is it possible to decode and format the XML body with Nokogiri or to parse out the SOAP envelope?
edit:
By decoding the XML I mean taking:
<GetPropertyListResult><Properties><Property PropertyId="11A" PropertyName1="1111 Austin Hwy" PropertyName2="" PropertyAddrLine1="The 1111" PropertyAddrLine2="1111 Austin Highway" PropertyAddrLine3="" PropertyAddrLine4="" PropertyAddrCity="San Antonio" PropertyAddrState="TX" PropertyAddrZipCode="78209" PropertyAddrCountry="" PropertyAddrEmail="" RemitToAddrLine1="The 1111" RemitToAddrLine2="1111 Austin Highway" RemitToAddrLine3="" RemitToAddrLine4="" RemitToAddrCity="San Antonio" RemitToAddrState="TX" RemitToAddrZipCode="78209" RemitToAddrCountry="" LiveDate="2013-12-04T00:00:00" MgrOffPhoneNo="210-804-1100" MgrFaxNo="" MgrSalutation="" MgrFirstName="" MgrMiName="" MgrLastName="" MonthEndInProcess="N"><Amenity PropertyId="11A"
and decoding it into using this online XML decoder:
<GetPropertyListResult><Properties><Property PropertyId="11A" PropertyName1="1111 Austin Hwy" PropertyName2="" PropertyAddrLine1="The 1111" PropertyAddrLine2="1111 Austin Highway" PropertyAddrLine3="" PropertyAddrLine4="" PropertyAddrCity="San Antonio" PropertyAddrState="TX" PropertyAddrZipCode="78209" PropertyAddrCountry="" PropertyAddrEmail="" RemitToAddrLine1="The 1111" RemitToAddrLine2="1111 Austin Highway" RemitToAddrLine3="" RemitToAddrLine4="" RemitToAddrCity="San Antonio" RemitToAddrState="TX" RemitToAddrZipCode="78209" RemitToAddrCountry="" LiveDate="2013-12-04T00:00:00" MgrOffPhoneNo="210-804-1100" MgrFaxNo="" MgrSalutation="" MgrFirstName="" MgrMiName="" MgrLastName="" MonthEndInProcess="N"><Amenity PropertyId="11A"
then running it through an XML formatter so that nested elements are indented for legibility.
You can use this code to decode and format the XML:
require "nokogiri"
XML_CHAR_ENTITIES = {
"lt" => "<",
"gt" => ">",
"amp" => "&",
"num" => "#",
"comma" => ","
}
xsl =<<XSL
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select="."/>
</xsl:template>
</xsl:stylesheet>
XSL
xml = '<GetPropertyListResult><Properties><Property PropertyId="11A" PropertyName1="1111 Austin Hwy" PropertyName2="" PropertyAddrLine1="The 1111" PropertyAddrLine2="1111 Austin Highway" PropertyAddrLine3="" PropertyAddrLine4="" PropertyAddrCity="San Antonio" PropertyAddrState="TX" PropertyAddrZipCode="78209" PropertyAddrCountry="" PropertyAddrEmail="" RemitToAddrLine1="The 1111" RemitToAddrLine2="1111 Austin Highway" RemitToAddrLine3="" RemitToAddrLine4="" RemitToAddrCity="San Antonio" RemitToAddrState="TX" RemitToAddrZipCode="78209" RemitToAddrCountry="" LiveDate="2013-12-04T00:00:00" MgrOffPhoneNo="210-804-1100" MgrFaxNo="" MgrSalutation="" MgrFirstName="" MgrMiName="" MgrLastName="" MonthEndInProcess="N"><Amenity PropertyId="11A"></GetPropertyListResult>'
xml = xml.gsub(/&(\w+);/) do |match|
char_entity = XML_CHAR_ENTITIES[$1]
char_entity ? char_entity : match
end
doc = Nokogiri::XML(xml)
xslt = Nokogiri::XSLT(xsl)
xml = xslt.transform(doc)
puts "#{xml}"
The XML provided was incomplete, so this terminating string was appended to allow it to be parsed: ></GetPropertyListResult>
The XML_CHAR_ENTITIES provides a hash of encoded strings to decoded strings, and can be easily extended to include other XML character entities, such as those documented at the W3 Character Entity Reference Chart.
XSL is an embedded stylesheet that is used to format the XML for output with Nokogiri.
Decoding the XML character entities is done with the String#gsub call using the block option. The XML is then successfully parsed by Nokogiri. Once the XML is parsed, it is formatted using Nokogiri XSLT transformation.
The output of this code is:
<?xml version="1.0" encoding="UTF-8"?>
<GetPropertyListResult>
<Properties>
<Property PropertyId="11A" PropertyName1="1111 Austin Hwy" PropertyName2="" PropertyAddrLine1="The 1111" PropertyAddrLine2="1111 Austin Highway" PropertyAddrLine3="" PropertyAddrLine4="" PropertyAddrCity="San Antonio" PropertyAddrState="TX" PropertyAddrZipCode="78209" PropertyAddrCountry="" PropertyAddrEmail="" RemitToAddrLine1="The 1111" RemitToAddrLine2="1111 Austin Highway" RemitToAddrLine3="" RemitToAddrLine4="" RemitToAddrCity="San Antonio" RemitToAddrState="TX" RemitToAddrZipCode="78209" RemitToAddrCountry="" LiveDate="2013-12-04T00:00:00" MgrOffPhoneNo="210-804-1100" MgrFaxNo="" MgrSalutation="" MgrFirstName="" MgrMiName="" MgrLastName="" MonthEndInProcess="N">
<Amenity PropertyId="11A"/>
</Property>
</Properties>
</GetPropertyListResult>

trying to get content inside cdata tags in xml file using nokogiri

I have seen several things on this, but nothing has seemed to work so far. I am parsing an xml via a url using nokogiri on rails 3 ruby 1.9.2.
A snippet of the xml looks like this:
<NewsLineText>
<![CDATA[
Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly creme brulee.
]]>
</NewsLineText>
I am trying to parse this out to get the text associated with the NewsLineText
r = node.at_xpath('.//newslinetext') if node.at_xpath('.//newslinetext')
s = node.at_xpath('.//newslinetext').text if node.at_xpath('.//newslinetext')
t = node.at_xpath('.//newslinetext').content if node.at_xpath('.//newslinetext')
puts r
puts s ? if s.blank? 'NOTHING' : s
puts t ? if t.blank? 'NOTHING' : t
What I get in return is
<newslinetext></newslinetext>
NOTHING
NOTHING
So I know my tags are named/spelled correctly to get at the newslinetext data, but the cdata text never shows up.
What do I need to do with nokogiri to get this text?
You're trying to parse XML using Nokogiri's HMTL parser. If node as from the XML parser then r would be nil since XML is case sensitive; your r is not nil so you're using the HTML parser which is case insensitive.
Use Nokogiri's XML parser and you will get things like this:
>> r = doc.at_xpath('.//NewsLineText')
=> #<Nokogiri::XML::Element:0x8066ad34 name="NewsLineText" children=[#<Nokogiri::XML::Text:0x8066aac8 "\n ">, #<Nokogiri::XML::CDATA:0x8066a9c4 "\n Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly creme brulee.\n ">, #<Nokogiri::XML::Text:0x8066a8d4 "\n">]>
>> r.text
=> "\n \n Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly creme brulee.\n \n"
and you'll be able to get at the CDATA through r.text or r.children.
Ah I see. What #mu said is correct. But to get at the cdata directly, maybe:
xml =<<EOF
<NewsLineText>
<![CDATA[
Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly creme brulee.
]]>
</NewsLineText>
EOF
node = Nokogiri::XML xml
cdata = node.search('NewsLineText').children.find{|e| e.cdata?}

Distinct Result via xQuery

I'm trying to get reviewers who review one or more books published after 2010.
for $r in doc("review.xml")//Reviews//Review,
$b in doc("book.xml")//Books//Book
where $b/Title = $r/BookTitle
and $b/Year > 2010
return {$r/Reviewer}
The following are both XML files.
review.xml:
<Reviews>
<Review>
<ReviewID>R1</ReviewID>
<BookTitle>B1</BookTitle>
<Reviewer>AAA</Reviewer>
</Review>
<Review>
<ReviewID>R2</ReviewID>
<BookTitle>B1</BookTitle>
<Reviewer>BBB</Reviewer>
</Review>
<Review>
<ReviewID>R3</ReviewID>
<BookTitle>B2</BookTitle>
<Reviewer>AAA</Reviewer>
</Review>
<Review>
<ReviewID>R4</ReviewID>
<BookTitle>B3</BookTitle>
<Reviewer>AAA</Reviewer>
</Review>
<Reviews>
book.xml:
<Books>
<Book>
<Title>B1</Title>
<Year>2005</Year>
</Book>
<Book>
<Title>B2</Title>
<Year>2011</Year>
</Book>
<Book>
<Title>B3</Title>
<Year>2012</Year>
</Book>
</Books>
I'll get two AAA by my xQuery code. I was wondering if I can get the distinct result, which means only one AAA. I've tried distinct-value() but don't know how to use it probably. Thanks for your reply!
----My Updated Solution with XML format for xQuery 1.0----
<root>
{
for $x in distinct-values
(
for $r in doc("review.xml")//Reviews//Review,
$b in doc("book.xml")//Books//Book
where $b/Title = $r/BookTitle
and $b/Year > 2010
return {$r/Reviewer}
)
return <reviewer>{$x}</reviewer>
}
</root>
To preserve nodes, you can use the "group by" clause and select the first item of a group sequence:
for $r in doc("review.xml")//Review,
$b in doc("book.xml")//Book
let $n := $r/Reviewer
where $b/Title = $r/BookTitle
and $b/Year > 2010
group by $n
return $r[1]/Reviewer
The following query will give you all distint reviewer names (note that the values are atomized, which means the element nodes are removed):
distinct-values(
for $r in doc("review.xml")//Reviews//Review,
$b in doc("book.xml")//Books//Book
where $b/Title = $r/BookTitle
and $b/Year > 2010
return $r/Reviewer
)

Resources