restrict xpath within nested element - xpath

I want to find all index terms by section, but sections are nested. Here is a simple example.
<chapter>
<section><title>First Top Section</title>
<indexterm text="dog"/>
<para>
<indexterm text="tree"/>
</para>
<section><title>SubSection</title>
<indexterm text="cat"/>
</section>
</section>
<section><title>Second Top Section</title>
<indexterm text="elephant" />
</section>
</chapter>
Is there any xpath expression to get a result like this:
First Top Section = ["dog", "tree"]
Subsection = ["cat"]
Second Top Section = ["elephant"]
Of course I get all the descendant indexterms under a section with an expression like this:
/chapter/section//indexterm
But indexterms can be inside other elements in a section--they're not necessarily children.
Is it possible to get indexterms specific to their parent section using xpath?

You can put a predicate at the section level:
/chapter/section[title = 'First Top Section']//indexterm
but this will include all indexterm elements under the given section, including those in subsections. To exclude them you could do something like
/chapter/section[title = 'First Top Section']//indexterm[count(ancestor::section) = 1]
to pick out those indexterm elements that have exactly one section ancestor (i.e. the "First Top Section" you started with).
More generally, if you have a reference to a specific section element then you can get all the indexterm elements inside it but not inside a subsection by first evaluating
count(ancestor-or-self::section)
as a number, and with the current section element as the context node, and then build up another expression
.//indexterm[count(ancestor::section) = thenumberyoujustcounted]
and evaluate that as a node set, again with the original section element as the context node.

If you can use XPath 2.0, you could do:
XML Input
<chapter>
<section><title>First Top Section</title>
<indexterm text="dog"/>
<para>
<indexterm text="tree"/>
</para>
<section><title>SubSection</title>
<indexterm text="cat"/>
</section>
</section>
<section><title>Second Top Section</title>
<indexterm text="elephant" />
</section>
</chapter>
XPath 2.0
for $section in //section
return concat($section/title,' - ["',
string-join($section//indexterm[ancestor::section[1] is $section]/#text,
'", "'),'"]
')
Output
First Top Section - ["dog", "tree"]
SubSection - ["cat"]
Second Top Section - ["elephant"]

Related

xPath - Why is this exact text selector not working with the data test id?

I have a block of code like so:
<ul class="open-menu">
<span>
<li data-testid="menu-item" class="menu-item option">
<svg>...</svg>
<div>
<strong>Text Here</strong>
<small>...</small>
</div>
</li>
<li data-testid="menu-item" class="menu-item option">
<svg>...</svg>
<div>
<strong>Text</strong>
<small>...</small>
</div>
</li>
</span>
</ul>
I'm trying to select a menu item based on exact text like so in the dev tools:
$x('.//*[contains(#data-testid, "menu-item") and normalize-space() = "Text"]');
But this doesn't seem to be selecting the element. However, when I do:
$x('.//*[contains(#data-testid, "menu-item")]');
I can see both of the menu items.
UPDATE:
It seems that this works:
$x('.//*[contains(#class, "menu-item") and normalize-space() = "Text"]');
Not sure why using a class in this context works and not a data-testid. How can I get my xpath selector to work with my data-testid?
Why is this exact text selector not working
The fact that both li elements are matched by the XPath expression
if omitting the condition normalize-space() = "Text" is a clue.
normalize-space() returns ... Text Here ... for the first li
in the posted XML and ... Text ... for the second (or some other
content in place of ... from div/svg or div/small) causing
normalize-space() = "Text" to fail.
In an update you say the same condition succeeds. This has nothing to
do with using #class instead of #data-testid; it must be triggered
by some content change.
How can I get my xpath selector to work with my data-testid?
By testing for an exact text match in the li's descendant strong
element,
.//*[#data-testid = "menu-item" and div/strong = "Text"]
which matches the second li. Making the test more robust is usually
in order, e.g.
.//*[contains(#data-testid,"menu-item") and normalize-space(div/strong) = "Text"]
Append /div/small or /descendant::small, for example, to the XPath
expression to extract just the small text.
data-testid="menu-item" is matching both the outer li elements while text content you are looking for is inside the inner strong element.
So, to locate the outer li element based on it's data-testid attribute value and it's inner strong element text value you can use XPath expression like this:
//*[contains(#data-testid, "menu-item") and .//normalize-space() = "Text"]
Or
.//*[contains(#data-testid, "menu-item") and .//*[normalize-space() = "Text"]]
I have tested, both expressions are working correctly

XPath expression to match across two associated elements

I’ve got the following XML of associated elements:
<doc>
<!-- A block of style elements. -->
<styles>
<style id='style-1' class='bar'>…</style>
<style id='style-2' class='baz'>…</style>
…
</styles>
<!-- Document content. -->
<p style='style-1'>…</p>
<p style='style-2'>…</p>
…
</doc>
For an XSLT template I’m looking for an XPath expression matches “an element p whose style is of class bar”?
Pure XPath 1.0 expression that will return all elements p whose style is of class bar :
//p[#style = //style[#class='bar']/#id]
Basically, the XPath looks for <p> elements where style attribute equals id of <style class='bar'>.
Presuming that is an accurate representation of your document's structure, I would advise using this, without double-slashes (//) since double-slashes can be very inefficient:
/doc/p[#style = /doc/styles/style[#class = 'bar']/#id]

XPath difference between two similar path and other questions

I've to made some exercices but
I don't really understand the difference between two similar path
I've the tree :
<b>
<t></t>
<a>
<n></n>
<p></p>
<p></p>
</a>
<a>
<n></n>
<p></p>
</a>
<a></a>
</b>
And we expect that each final tag contain one text node.
I've to explain the difference between //a//text() and //a/text()
I see that //a//text() return all text nodes and it seems legit,
but why //a/text() return the last "a node" -> text node ?
Another question :
why //p[1] return for each "a node", the first "p" child node ?
-> I've two results
<b>
<t></t>
<a>
<n></n>
**<p></p>**
<p></p>
</a>
<a>
<n></n>
**<p></p>**
</a>
<a></a>
</b>
Why the answer is not the first "p" node for the whole document ?
Thanks for all !
Difference between 1: //a//text() and 2: //a/text()
Let's break it down: //a selects all a elements, no matter where they are in the document. Suppose you have /a, that would select all root a elements.
If the / path expression comes after another element in an XPath expression, it will select elements directly descending the element before that in the XPath expression (ie child elements).
If the // path expression comes after another element in an XPath expression, it will select all elements that are descendant of the previous element, no matter where they are under the previous element.
Applying to your two XPath expressions:
//a//text(): Select all a elements no matter where they are in the document, and for those elements select text() no matter where they are under the a elements selected.
//a/text(): Select all a elements no matter where they are in the document, and for those elements select any direct descendant text().
Why //p[1] returns for each "a node", the first "p" child node?
Suppose you were to write //a/p[1], this would select the first p child element of any a element anywhere in the document. By writing //p[1] you are omitting an explicit parent element, but the predicate still selects the first child element of any parent the p element has.
In this case there are two parent a elements, for which the first p child element is selected.
It would be good to search for a good introduction to XPath on your favorite search engine. I've always found this one from w3schools.com to be a good one.

xpath: check if element is within other element

I have quite a large XML structure that in its simplest form looks kinda like this:
<document>
<body>
<section>
<p>Some text</p>
</section>
</body>
<backm>
<section>
<p>Some text</p>
<figure><title>This</title></figure>
</section>
</backm>
</document>
The section levels can be almost limitless (both within the body and backm elements) so I can have a section in section in section in section, etc. and the figure element can be within a numlist, an itenmlist, a p, and a lot more elements.
What I want to do is to check if the title in figure element is somewhere within the backm element. Is this possible?
A document could have multiple <backm> elements and it could have multiple <figure><title>Title</title></figure> elements in it. How you build your query depends on the situations you're trying to distinguish between.
//backm/descendant::figure/title
Will return the <title> elements that are the child of a <figure> element and the descendant of a <backm> element.
So:
count(//backm/descendant::figure/title) > 0
Will return True if there are 1 or more such title elements.
You can also express this using Double Negation
not(//backm[not(descendant::figure/title)])
I'm under the impression that this should have better performance.
//title[parent::figure][ancestor::backm]
Lists all <title> elements with a parent of <figure> and an <backm> ancestor.

XPath selector by class AND index

I have the following HTML:
<div>
<p>foo</p>
<p class='foo'>foo</p>
<p class='foo'>foo</p>
<p>bar</p>
</div>
How can i select second P tag with class 'foo' by XPath?
The following expression should do it:
//p[#class="foo"][2]
Edit: The use of [2] here selects elements according to their position among their siblings, rather than from among the matched nodes. Since both your tables are the first children of their parent elements, [1] will match both of them, while [2] will match neither. If you want the second such element in the entire document, you need to put the expression in brackets so that [2] applies to the nodeset:
(//p[#class="foo"])[2]
(//table[#class="info"])[2]

Resources