XMLPath Query for nested XML fragment - xpath

I'm trying to write a xpath query to pull data from an xml document. Unfortunately the document has a xml fragment embedded in it that seems to have lost its encoding (< has become &lt > has become &gt etc).
An example of the xml doc is:
<OrderData xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Id>1</Id>
<RawData><?xml version="1.0" encoding="UTF-16"?>
<Data xmlns="nnn-mmm-com">
<Order Action="Remove" >
<Instrument InstID="1"></Order><
/Data>
</RawData>
</OrderData>
I'm trying to extract the following values:
Id
Action
InstID
Getting the Id is no problem, but drilling into the fragment inside RawData is proving beyond me. Any pointers gratefully received
(I'm planning to execute the xpath query in Hive using Hive-XML-SerDe which is xpath 1.0)
Thanks

With XPath 3.1 you can parse the embedded XML document and turn it into a node tree, which you can then process using path expressions. So:
/OrderData/RawData/parse-xml(.)/*:Data/*:Instrument/#InstID
should get what you want.
You didn't say what version of XPath your library supports, which usually means that it only supports 1.0, so you may need to find a different library.

Related

Wildcard XPath nested element selection by example

I am being given XML that looks like this:
<?xml version='1.0' encoding='UTF-8'?>
<singleton-list>
<fizzbuzz>
<amountId>123</amountId>
<countryId>456</countryId>
<action>Overwrite</action>
</fizzbuzz>
</singleton-list>
However, the outer-most element (in this case, singleton-list) will be different each time, and will be one of many different values. For instance I might get another XML message that looks like this:
<?xml version='1.0' encoding='UTF-8'?>
<aggregateList>
<fizzbuzz>
<amountId>123</amountId>
<countryId>456</countryId>
<action>Overwrite</action>
</fizzbuzz>
</aggregateList>
I'm trying to write an XPath that selects the fizzbuzz out of each message I receive, regardless of what the outer-most element is (singleton-list, aggregateList, or otherwise).
I think I could do something involving a wildcard, but I'm not sure if asterisks have special meaning in XPath or if I'm going about this the wrong way.
My best attempt at an XPath to do this selection is:
/*/fizzbuzz
Is this correct or is there a better way to do this?
either you can use //fizzbuzz or /*/fizzbuzz.

Nokogiri not parsing XML in ruby - xmlns issue?

Given the following ruby code :
require 'nokogiri'
xml = "<?xml version='1.0' encoding='UTF-8'?>
<ProgramList xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' xmlns:xsd='http://www.w3.org/2001/XMLSchema' xmlns='http://publisher.webservices.affili.net/'>
<TotalRecords>145</TotalRecords>
<Programs>
<ProgramSummary>
<ProgramID>6540</ProgramID>
<Title>Matalan</Title>
<Limitations>A bit of text
</Limitations>
<URL>http://www.matalan.co.uk</URL>
<ScreenshotURL>http://www.matalan.co.uk/</ScreenshotURL>
<LaunchDate>2009-11-02T00:00:00</LaunchDate>
<Status>1</Status>
</ProgramSummary>
<ProgramSummary>
<ProgramID>11787</ProgramID>
<Title>Club 18-30</Title>
<Limitations/>
<URL>http://www.club18-30.com/</URL>
<ScreenshotURL>http://www.club18-30.com</ScreenshotURL>
<LaunchDate>2013-05-16T00:00:00</LaunchDate>
<Status>1</Status>
</ProgramSummary>
</Programs>
</ProgramList>"
doc = Nokogiri::XML(xml)
p doc.xpath("//Programs")
gives :
=> []
Not what is expected.
On further investigation if I remove xmlns='http://publisher.webservices.affili.net/' from the initial <ProgramList> tag I get the expected output.
Indeed if I change xmlns='http://publisher.webservices.affili.net/' to xmlns:anything='http://publisher.webservices.affili.net/' I get the expected output.
So my question is what is going on here? Is this malformed XML? And what is the best strategy for dealing with it?
While it's hardcoded in this example the XML is (will be) coming from a web service.
Update
I realise I can use the remove_namespaces! method but the Nokogiri docs do say that it's "...probably is not a good thing in general" to do this. Also I'm interested in why it's happening and what the 'correct' XML should be.
The xmlns='http://publisher.webservices.affili.net/' indicates the default namespace for all elements under the one where it appears (including the element itself). That means that all elements that don’t otherwise have an explicit namespace fall under this namespace.
XPath queries don’t have default namespaces (at least in XPath 1.0), so any name that appears in one without a prefix refers to that element in no namespace.
In your code, you want to find Program elements in the http://publisher.webservices.affili.net/ namespace (since that is the default namespace), but are looking (in your XPath query) for Program elements in no namespace.
To explicitly specify the namespace in the query, you can do something like this:
doc.xpath("//pub:Programs", "pub" => "http://publisher.webservices.affili.net/")
Nokogiri makes this a little easier for namespaces declared on the root element (as in this case), declaring them for you with the same prefix. It will also declare the default namespace using the xmlns prefix, so you can also do:
doc.xpath("//xmlns:Programs")
which will give you the same result.

XPath expression using idref

I am reading about and testing XQuery and like test tools I use BaseX(www.basex.org) and saxon-HE 9.4.0.6N.
For the following simple XML file - no schema attached to the sample.xml:
<rootab>
<l1>
<items p="a">
<itema x1="10" id="abc">testa</itema>
<itemb x1="10" id="dfe">testb</itemb>
<itemc x1="10" id="jgh">testc</itemc>
</items>
</l1>
<l2>
<items p="b">
<itema x1="10" xidref="abc">testa</itema>
<itemc x1="10" xidref="jgh">testc</itemc>
<itemd x1="10" xidref="abc">testA101</itemd>
<iteme x1="10" xidref="jgh">testB202</iteme>
</items>
</l2>
</rootab>
In Basex_GUI if I enter the following XPath expression: //idref("abc")/..
the result is: <itema x1="10" xidref="abc">testa</itema>
In BaseX_GUI if I add the simple XQuery expression:
for $x in doc('sample.xml')//idref("abc")/..
return <aaa>{$x}</aaa>
the result is:
<aaa>
<itema x1="10" xidref="abc">testa</itema>
</aaa>
<aaa>
<itemd x1="10" xidref="abc">testA101</itemd>
</aaa>
q1) Why the XPath expression returned only one node? I expected two...
In Saxon, by using the below xql file:
<test>
{
doc('sample.xml')//idref("abc")/..
}
</test>
or the XQuery expression , I receive the same result by running the command query sample.xql:
<?xml version="1.0" encoding="UTF-8"?><test/>
q2)what is wrong in my Saxon test ?
thank you in advance for your help!
Basically, idref() is sensitive to DTD validation - it recognizes attributes declared as type IDREF in your DTD.
You haven't shown us your DTD, and more importantly, you haven't shown how the input to the queries is supplied. There are many ways of constructing input in which the "IDREF-ness" of an attribute is lost - for example, going via a DOM. Even when you use the doc() function in Saxon, the way the input tree is built depends on many factors including configuration options and your URIResolver.
I see you are using .NET. When Saxon uses the Microsoft XML parser on .NET, it doesn't know which attributes are IDs and IDREFs, so the id() and idref() functions don't work (the MS parser simply doesn't supply this information). You therefore need to use the JAXP parser (Xerces) that comes with the Saxon product. I think this is the default these days.
So not really an answer, but hopefully some background that explains some of the things that can go wrong.

XPath-REXML-Ruby: Selecting multiple siblings/ancestors/descendants

This is my first post here. I have just started working with Ruby and am using REXML for some XML handling. I present a small sample of my xml file here:
<record>
<header>
<identifier>oai:lcoa1.loc.gov:loc.gmd/g3195.ct000379</identifier>
<datestamp>2004-08-13T15:32:50Z</datestamp>
<setSpec>gmd</setSpec>
</header>
<metadata>
<titleInfo>
<title>Meet-konstige vertoning van de grote en merk-waardige zons-verduistering</title>
</titleInfo>
</metadata>
</record>
My objective is to match the last numerical value in the tag with a list of values that I have from an array. I have achieved this with the following code snippet:
ids = XPath.match(xmldoc, "//identifier[text()='oai:lcoa1.loc.gov:loc.gmd/"+mapid+"']")
Having got a particular identifier that I wish to investigate, now I want to go back to and select and then select to get the value in the node for that particular identifier.
I have looked at the XPath tutorials and expressions and many of the related questions on this website as well and learnt about axes and the different concepts such as ancestor/following sibling etc. However, I am really confused and cannot figure this out easily.
I was wondering if I could get any help or if someone could point me towards an online resource "easy" to read.
Thank you.
UPDATE:
I have been trying various combinations of code such as:
idss = XPath.match(xmldoc, "//identifier[text()='oai:lcoa1.loc.gov:loc.gmd/"+mapid+"']/parent::header/following-sibling::metadata/child::mods/child::titleInfo/child::title")
The code compiles but does not output anything. I am wondering what I am doing so wrong.
Here's a way to accomplish it using XPath, then going up to the record, then XPath to get the title:
require 'rexml/document'
include REXML
xml=<<END
<record>
<header>
<identifier>oai:lcoa1.loc.gov:loc.gmd/g3195.ct000379</identifier>
<datestamp>2004-08-13T15:32:50Z</datestamp>
<setSpec>gmd</setSpec>
</header>
<metadata>
<titleInfo>
<title>Meet-konstige</title>
</titleInfo>
</metadata>
</record>
END
doc=Document.new(xml)
mapid = "ct000379"
text = "oai:lcoa1.loc.gov:loc.gmd/g3195.#{mapid}"
identifier_nodes = XPath.match(doc, "//identifier[text()='#{text}']")
record_node = identifier_nodes.first.parent.parent
record_node.elements['metadata/titleInfo/title'].text
=> "Meet-konstig"

How to find all nodes of a specific type in XPath

Lets say i have the following form data instance in my view.xml:
<xhtml:html xmlns:xhtml="http://www.w3.org/1999/xhtml"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xxforms="http://orbeon.org/oxf/xml/xforms"
xmlns:exforms="http://www.exforms.org/exf/1-0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<xhtml:head>
<xforms:instance id="instanceData">
<form xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<fruits>
<fruit>
<fruit-name>Mango</fruit-name>
</fruit>
<fruit>
<fruit-name>Apple</fruit-name>
</fruit>
<fruit>
<fruit-name>Banana</fruit-name>
</fruit>
</fruits>
</form>
</xforms:instance>
</xhtml:head>
I want to select all the fruit names from the above instance.
I tried the following ways but it always selects the first fruit.
instance('instanceData')/fruits/fruit[*]/fruit-name
instance('instanceData')/fruits/fruit/fruit-name
instance('instanceData')/fruits/fruit[position()>0]/fruit-name
Please provide a way to overcome this in XPATH
try this
"//fruit-name"
It shall find all fruitnames wherever they are in the document hierarchy.
If you want to select all the <fruit-name> from the instance instanceData (<xforms:instance id="instanceData">) that looks like the one you have in your question, the following should do it:
instance('instanceData')/fruits/fruit/fruit-name
If this doesn't work, one common reason is that you have a default namespace declaration in the document that contains your instance, like: xmlns="http://www.w3.org/1999/xhtml". If you have this, you need to undo that default namespace declaration on where you declare the instance, with:
<xforms:instance xmlns="" id="instanceData">
(And if this is the issue, my advice is not to use default namespace declarations. Ever. Instead declare xmlns:xhtml="http://www.w3.org/1999/xhtml" and use the xhtml prefix everywhere.)
First:
It may be a typo any way to point out you xml has wrong node ending
<service>
Second:
your XPATH is very much valid but when you parse it out you need to iterate over the result set as like its a sequence of node and not a single value.
e.g) in JDOM :< Element.selectObject Vs selectSingleNodes Vs selectAsArray kind.
In your XForms you need to iterate over the resultset to get the list of fruits.
if you want only fruit names then you could try
instance('instanceData')/fruits/fruit/fruit-name/text()

Resources