Is my understanding of XPath axes correct? - xpath

I have made an info-graphic depicting the various axes in XPath. However, I am not sure as to whether they are correct.
I get confused in following, following-sibling, preceding and preceding-sibling
Is my diagram correct ?
The original image is here: http://imgur.com/4ekJxca
(Taken from Pro XML Development with Java)
Here is my understanding of the nodes I get confused in:
descendant:: selects the nodes (element and text only) which are children and grandchildren of the context node.
following:: selects any node (text only) which was not selected by descendant.
following-sibling:: all the 'brothers' of the context node. That is, text and element nodes which are children of the same parent as the context node, after the context node.
preceding::sibling all the 'brothers' of the context node. That is, text and element nodes which are children of the same parent as the context node, before the context node.
preceeding:: all the nodes (text only) that do not appear along the ancestor:: axis and are not nested in any element node. (I am sure I screwed this up)
XML
<?xml version="1.0" encoding="UTF-8"?>
<catalog xmlns:journal="http://www.apress.com/catalog/journal" >
<journal:journal title="XML" publisher="IBM developerWorks">
<article journal:level="Intermediate"
date="February-2003">
<title>Design XML Schemas Using UML</title>
<author>Ayesha Malik</author>
</article>
</journal:journal>
<journal title="Java Technology" publisher="IBM developerWorks">
<article level="Advanced" date="January-2004">
<title>Design service-oriented architecture
frameworks with J2EE technology</title>
<author>Naveen Balani</author>
</article>
<article level="Advanced" date="October-2003">
<title>Advance DAO Programming</title>
<author>Sean Sullivan </author>
</article>
</journal>
</catalog>

The best way to gain accurate intuition about preceding and following axes is to imagine XML as a set of nested boxes or intervals, where each interval extends from the start tag to its matching end tag. In this picture you can see that any two distinct intervals a and b must be in exactly one of the following relationships:
a contains b (a/descendant::b);
a is contained by b (a/ancestor::b);
a is followed by b (a/following::b).
a is preceded by b (a/preceding::b);
If you keep to this model, you will never have a doubt in the semantics of the XPath axes.
Incidentally, this is why the tree model is bad for your intuition: it doesn't put the "nested boxes" paradigm to the forefront, so it's easy to get confused.

Related

Only xpath for extracting text for multiple conditions in xml - no code possible

I have an example file with three conditions to be met... I also have no control over the xml file I get:
<?xml version="1.0" encoding="UTF-8"?>
<rootelement>
<Description>
<Note countries="AR,GB,US" >
<P countries="AR" >We want this one as it's AR.</P>
<P countries="US" >We don't want this one as it's not AR.</P>
<P countries="GB" >We don't want this either as it's not AR.</P>
</Note>
</Description>
<Description>
<Note countries="AR,GB,US" >
<P>Everyone in AR, GB and US gets to buy.</P>
<P>No restrictions for this product in these countries.</P>
</Note>
</Description>
<Description>
<Note>
<P>No country, that's because it will be treated as AR.</P>
</Note>
</Description>
</rootelement>
The task is threefold:
Extract text from <P> where countries="AR", other values are always ignored
Extract text from <P> where it's parent element (in this example but it's not always the case) contains AR in the countries attribute (countries="AR,GB,US" for example)
Extract text from current element (<P> in this example, not always) when there is no countries attribute present in the current element or it's ancestors
I hope that's clear, I tried to put three examples in the xml above and I need to extract these texts with my rule(s):
<P countries="AR" >We want this one as it's AR.</P>
<P>Everyone in AR, GB and US gets to buy.</P>
<P>No restrictions for this product in these countries.</P>
<P>No country, that's because it will be treated as AR.</P>
Ideally I want one rule. But I could use several as the rules are applied hierarchically.
If I use this in the application I'm feeding:
//*[contains(#countries,'AR')]/*
All good to get the first three, but I also get US and GB which I don't want. I can exclude them with this:
//*[contains(#countries,'AR')]/*[not(contains(#countries,'US')) and not(contains(#countries,'GB'))]
But the expression will become unmanageable in practice as there are many languages and I often need to change the ones I'm looking for. I cannot figure out how to say just exclude any that don't contain AR.
And then I still have the last problem of being able to extract if the countries attribute is missing altogether. This bit I'm at a complete loss to know how to resolve without affecting the previous results.
Here's an XPath 1 expression which I think captures the logic you've described:
//*[text()[normalize-space()]]
[
not(ancestor-or-self::*/#countries) or
contains(ancestor-or-self::*[#countries][1]/#countries, 'AR')
]
Any element which has a child text node which is not just white space, and
which has no countries attribute of its own or on any of its ancestor elements, or
has 'AR' either in its own countries attribute or the first countries attribute of any of its ancestors.
NB the ancestor-or-self axis is a 'reverse' axis which means the expression ancestor-or-self::* will return the context node itself, then its parent, then its grand-parent, etc, in that order, finishing at the root element of the document. The expression ancestor-or-self::*[#countries] will filter that list to include only the elements which have a countries attribute, and ancestor-or-self::*[#countries][1] will return the first element in that list. If the element that contains the text has a countries attribute, then it will be first in that list, otherwise the nearest ancestor will be first. I think this "inheritance" is what you're wanting to achieve?
Results:
<P countries="AR">We want this one as it's AR.</P>
<P>Everyone in AR, GB and US gets to buy.</P>
<P>No restrictions for this product in these countries.</P>
<P>No country, that's because it will be treated as AR.</P>

Select XML Node by position

I have the following XML structure
<Root>
<BundleItem>
<Item>1</Item>
<Item>2</Item>
<Item>3</Item>
</BundleItem>
<Item>4</Item>
<Item>5</Item>
<Item>6</Item>
<BundleItem>
<Item>7</Item>
<Item>8</Item>
<Item>9</Item>
</BundleItem>
</Root>
And by providing the following xPath
//Item[1]
I am selecting
<Item>1</Item>
<Item>4</Item>
<Item>7</Item>
My goal is to select only <Item>1</Item> or <Item>7</Item> regardless of the parent element where they are found and only depending on the position, which i am providing in the xPath.
Is it possible to do that only by using the position and without providing additional criterias in the xPath ?
//Item[1] selects the all the first child elements that are <Item/> regardless of their parent.
To get the two items you are looking for you could use //Item[text() = 1 or text() = 7].
A good tutorial can be found at w3schools.com and you can play with XPath expressions over your XML input here. (I am not affiliated with either of these resources but find them useful.)

Using XPATH previous:: more like an array

I've got XML like this
<root>
...
<a>
<a>
<a>
<c>
...
It's very flat with LOTS of A elements and a few C elements. The A elements are sensor data and the last reading is bogus, I need the one before. So I'd like to use the C elements as a marker and each of A elements 2 before each C. So I'm trying out an XPATH like:
/root/c/preceding-sibling::a
but I'm getting all previous A elements, I was hoping for something a bit more direct such as:
/root/c/preceeding-sibling[-2]
which would just grab the 2nd sibling before C (no matter the type) I guess I'm asking for array like functionality on an XPATH so what ever I match I can ask for "the second element before that"
Is this possible?
You can
just grab the 2nd sibling before C (no matter the type)
with the XPath expression
/root/c/preceding-sibling::*[2]
The node count for preceding-sibling:: is going backwards. The node with the index [1] is the node before c and the node with the index [2] is the node before this - which is
the second element before that

What is the difference between "element" and "//element" in XPath?

I am reading this XPath examples: https://msdn.microsoft.com/en-us/library/ms256086(v=vs.110).aspx and I want to know the difference between these 2 expressions:
author
All <author> elements within the current context.
//author
All <author> elements in the document.
What is the difference between these two cases? If the "current context" is the root node, would that make the two equivalent?
For this simple XML file:
<root>
<author>
<first-name></first-name>
</author>
</root>
I tried it on this site https://www.freeformatter.com/xpath-tester.html
Why does author not returning anything as I expected it should (but //author works)?
The description you cite for the relative XPath expression, author,
All <author> elements within the current context.
is wrong1. It should instead say,
All <author> child elements of the current context node.
//author would indeed select all <author> elements in the document because // selects along the descendant-or-self axis.
The reason author doesn't select anything for your XML document is that with the context node set to the document root, you'd have to include root/author to select the <author> children of <root> or just root to select the <root> element itself.
1 As of today, 2018-06-24, but I've submitted feedback that it should be corrected, so hopefully it will fixed be soon.
"element" selects all immediate children named "element" of the current node and is identical to "./element".
"//element" selects all "element" nodes at any depth, starting from the root (ignoring your current node).
And to complete the list:
".//element" would select "element" children below the current node, at any depth.
"/element" would search at the root level only (in you example, you would need "/root" to get anything).
And as for "author" not finding anything: you first need to be at the level of your root node. "/root/author" would get the node you wanted, or first select "/root" and from there you can select "author".

Xpath expression to find non-child elements by attribute

here's a nice puzzle. Suppose we have this bit of code:
<page n="1">
<line n="3">...</line>
</page>
It is real easy to locate the line element "n=3" within the page element "n=1" with a simple xpath expression: xpath(//page[#n='1')/line[#n='3']). Great, beautiful, elegant.
Now suppose what we have is this encoding (folks familiar with the TEI will know where this is coming from).
<pb n="1"/>
(arbitrary amounts of stuff)
<lb n="3"/>
We want to find the lb element with n="3", which follows the pb element with n="1". But note -- this lb element could be almost anywhere following the pb: it may not be (and most likely is not) a sibling, but could be a child of a sibling of the pb, or of the pb's parent, etc etc etc.
So my question: how would you search for this lb element with n="3", which follows the pb element with n="1", with XPath?
Thanks in advance
Peter
Use:
//pb[#n='1']/following::lb[#n='2']
|
//pb[#n='1']/descendant::lb[#n='2']
This selects any lb element that follows the specified pb in document order -- even if the wanted lb element is a descendant of the pb element.
Do note that the following expression doesn't in general select all wanted lb elements (it fails to select any of these that are descendants of the pb element):
//pb[#n='1']/following::lb[#n='2']
Explanation:
As defined in the W3C XPath specification, the following:: and descendant:: axes are non-overlapping:
"the following axis contains all nodes in the same document as the
context node that are after the context node in document order,
excluding any descendants and excluding attribute nodes and namespace nodes"
That would be
//pb[#n=1]/following::lb[#n=3]

Resources