XPath difference between two similar path and other questions - xpath

I've to made some exercices but
I don't really understand the difference between two similar path
I've the tree :
<b>
<t></t>
<a>
<n></n>
<p></p>
<p></p>
</a>
<a>
<n></n>
<p></p>
</a>
<a></a>
</b>
And we expect that each final tag contain one text node.
I've to explain the difference between //a//text() and //a/text()
I see that //a//text() return all text nodes and it seems legit,
but why //a/text() return the last "a node" -> text node ?
Another question :
why //p[1] return for each "a node", the first "p" child node ?
-> I've two results
<b>
<t></t>
<a>
<n></n>
**<p></p>**
<p></p>
</a>
<a>
<n></n>
**<p></p>**
</a>
<a></a>
</b>
Why the answer is not the first "p" node for the whole document ?
Thanks for all !

Difference between 1: //a//text() and 2: //a/text()
Let's break it down: //a selects all a elements, no matter where they are in the document. Suppose you have /a, that would select all root a elements.
If the / path expression comes after another element in an XPath expression, it will select elements directly descending the element before that in the XPath expression (ie child elements).
If the // path expression comes after another element in an XPath expression, it will select all elements that are descendant of the previous element, no matter where they are under the previous element.
Applying to your two XPath expressions:
//a//text(): Select all a elements no matter where they are in the document, and for those elements select text() no matter where they are under the a elements selected.
//a/text(): Select all a elements no matter where they are in the document, and for those elements select any direct descendant text().
Why //p[1] returns for each "a node", the first "p" child node?
Suppose you were to write //a/p[1], this would select the first p child element of any a element anywhere in the document. By writing //p[1] you are omitting an explicit parent element, but the predicate still selects the first child element of any parent the p element has.
In this case there are two parent a elements, for which the first p child element is selected.
It would be good to search for a good introduction to XPath on your favorite search engine. I've always found this one from w3schools.com to be a good one.

Related

How to find direct children which contain nodes with specified text with xpath?

I need to extract all children which have nodes with some text. Html structure might be the following:
<div>
<div>
A
</div>
<p>
<b>A</b>
</p>
<span>
B
</span>
</div>
I need to extract child nodes which have "A" text. It should return div and p nodes
I tried the following xpaths:
./*/*[contains(text(), 'A')]
./*/*[./*[contains(text(), 'A')]]
but the first one returns only div with "A" text and the second one returns only p with "A" text
Is it possible to construct xpath which will return both children?
Node containing "A" text might be at any level in the child node
If you need XPath that returns both child nodes, try to use
./*/*[contains(., "A")]
I suspect contains() is wrong here, unless you really want to select a node whose value is "HAT" as well as one whose value is "A".
Try
*/*[normalize-space(.)='A']

XPath Find any element with text() within node selection

I have a fairly deeply nested xml structure and I would like to find an element with a particular value after I have already selected a node. In the sample below I have an array of 'B' and after selecting each of the 'B' nodes I would like to get the text of one of the children (which are not consistent) that starts with the word 'history'
<A>
<Items>
<B>
<C>
<D>history: pressed K,E</D> // could be nested anywhere
</C>
</B>
<B>
<C>
<E>history: pressed W</E>
</C>
</B>
</Items>
</A>
// Select the nodes from the array (B)
var nodes = select(xmldoc, "//A/Items/B");
// Iterate through the nodes.
nodes.forEach(node){
// is it possible to select any element that starts with the text 'history' from the already selected node.
var history = select(node, "???[starts-with(.,'history')]");
all the samples I have seen start with : //*[text()] which searches from the root of the structure.
//B//*[starts-with(normalize-space(), 'history')]
looks like it would do what you intend.
It selects "any descendant element of <B> whose text content starts with 'history'".
Manual iteration to find further nodes is not typically necessary. XPath does that for you. If you must iterate for some other reason, use the context node . to select.
nodes.forEach(function (node) {
var history = select(node, "./*[starts-with(.,'history')]");
});
If you are actually looking for "any text node..."
//B//text()[starts-with(normalize-space(), 'history')]
Or "any element node that has no further child elements..."
//B//*[not(*) and starts-with(normalize-space(), 'history')]

How can I select nodes that don't contain links but which do contain specific text using xpath

Given the following HTML:
$content =
'<html>
<body>
<div>
<p>During the interim there shall be nourishment supplied</p>
</div>
<div>
<p>During the interim there shall be interim nourishment supplied</p>
</div>
<div>
<ul><li>During the interim there shall be nourishment supplied</li></ul>
</div>
</body>
</html>';
I want all the nodes containing the word "interim" but not if the word "interim" is part of a link element.
The nodes I would expect back are the first P node and the LI node only.
I've tried the following:
'//*/text()[not(a) and contains(.,"interim")]'
... but this still returns the A and also returns part of it's parent P node (the part after the A), neither of which are desired. You can see my attempt here: https://glot.io/snippets/ehp7hmmglm
If you use the XPath expression //*[not(self::a) and not(a) and text()[contains(.,"interim")]] then you get all elements that do not contain an a element, are not a elements and contain a text node child containing that word.

XPath: limit scope of result set

Given the XML
<a>
<c>
<b id="1" value="noob"/>
</c>
<b id="2" value="tube"/>
<a>
<c>
<b id="3" value="foo"/>
</c>
<b id="4" value="goo"/>
<b id="5" value="noob"/>
<a>
<b id="6" value="near"/>
<b id="7" value="bar"/>
</a>
</a>
</a>
and the Xpath 1.0 query
//b[#id=2]/ancestor::a[1]//b[#value="noob"]
The Xpath above returns both node ids 1 and 5. The goal is to limit the result to just node id=1 since it is the only #value="noob" element that is a descendant of the same <a> that (//b[#id=2]) is also a descendant of.
In other words, "Find all b elements who's value is "noob" that are descendants of the a element which also has a descendant whose id is 2, but is not the descendant of any other a element". How's that for convoluted? In practice the id number and values would be variable and there would hundreds of node types.
If the id=2, we would expect to return element id=1 not id=5 since it is contained in another a element. If the id=4, we would expect to return id=5, but not id=1 since it is not in the first ancestor a element as id=4.
Edit:
Based on the comments of Dimitre and Alejandro, I found this helpful blog entry explaining the use of count() with the | union operator as well as some other excellent tips.
Use:
//b[#value='noob']
[count(ancestor::a[1] | //b[#id=2]/ancestor::a[1]) = 1]
Explanation:
The second predicate assures that both b elements have the same nearest ancestor a.
Remember: In XPath 1.0 the test for node identity is:
count($n1 | $n2) = 1
First, this
is there some way to limit the result
set to the <b> elements that are ONLY
the children of the immediate <a>
element of the start node
(//b[#id=2])?
//b[#value='noob'][ancestor::a[1]/b/#id=2]
It's not the same as:
Starting at a node whose id is equal
to 2, find all the elements whose
value is "noob" that are descendants
of the immediate parent c element
without passing through another c
element
Wich is:
//c[b/#id=2]//*[.='noob'][ancestor::c[1][b/#id=2]]
Besides these expressions, when you are dealing with "context marks" you can use the set's membership test as in:
$node[count(.|$node-set)=count($node-set)]
I leave you its use for this case as an exercise...
//b[#id=2]/ancestor::a[1]//b[#value="noob" and not(ancestor::a[2]=//b[#id=2]/ancestor::a[1])] ?
that works only for your case though, not sure how generic it should be!

XPATH filter tag-less children

Is there any way to specify that I want to select only tag-less child elements (in the following example - "text")?
<div>
<p>...</p>
"text"
</div>
The text() function matches text nodes. Example: //div/text() — matches all text children within all div elements.
Use:
/*/text()[normalize-space()]
This selects all text nodes that are children of the top element of the document and that do not consist only of white-space characters.
In the concrete example this will select only the text node with string value:
'
"text"
'
The XPath expressions:
/*/text()
or
/div/text()
both select two text nodes, the first of which contains only white-space and the second is the same text node as above:
'
"text"
'
select only tag-less child elements
To me this sounds like selecting all elements that don't have other elements as children. But then again, "text" in your example is not an element, but a text node, so I'm not really sure what do you want to select...
Anyway, here is a solution for selecting such elements.
//*[not(*)]
Selects all elements that don't have an element as a child. Replace the first * with an element name if you only want to select certain elements that don't have child elements. Also note that using // is generally slow since it runs through the whole document. Consider using more specific path when possible (like /div/*[not(*)] in this case).

Resources