Select the xth element on a page that is a yth child of its parent - xpath

There are lots of similar questions, however I wasn't able to find an answer to this.
Imagine you have a HTML page like this:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>Page title</title>
</head>
<body>
<div id="content">
<table>
<tr>
<td>A</td>
<td>B</td>
<td>C</td>
</tr>
<tr>
<td>D</td>
<td>E</td>
<td>F</td>
</tr>
</table>
</div>
</body>
</html>
and you want to select the second <td> element on the page that is a first child of its parent. In this case, it's the element <td>D</td>.
Note that this wording should be kept intact, for example it's not the same as selecting the second <tr> and then its first child (results in the same element), because the original page I'm working with is far more complex than this minimal testcase and this approach wouldn't work there.
What I have done so far:
A CSS selector #content td:first-child finds me A and D, now I am able to select the second element either via JS (document.querySelectorAll("query")[1]) or in Java (where I'm working with those elements in the end). However, it's quite inconsistent to use additional code for what could be done via a selector.
Similarly, I can use an XPath expression: id('content')//td[1]. It's the equivalent to the CSS selector above. It returns a node-set, so I thought that id('content')//td[1][2] will work the way I wanted, but no luck.
After some time, I discovered ( id('content')//td[1] )[2] to be working the way I want so I went for that and am quite happy with it.
Still, it's a letdown for me to see that I couldn't do a single query to get my element, and therefore an academic question is in place: Is there any other solution, either with a CSS selector, or an XPath expression to do my query? What did I miss? Can it be done?

CSS selectors currently don't provide any way to select the nth element in a set of globally-matched elements or the nth occurrence of some element in the entire DOM. The structural :nth-*() functional pseudo-classes that are provided by both Selectors 3 and Selectors 4 all count by the nth child of its parent matching the criteria, rather than by the nth element in the entire DOM.
The current Selectors syntax doesn't provide an intuitive way to say "this is the nth of a set of matched elements in the DOM"; even :nth-match() and :nth-last-match() in Selectors 4 have a pretty awkward syntax as they currently stand. So that is indeed a letdown.
As for XPath, the expression to use is (id('content')//td[1])[2], as you have already found. The outer () simply means "this entire subexpression should be evaluated before the [2] predicate" or "the [2] predicate should operate on the result of this entire subexpression, not just //td[1]." Without them, the expression td[1][2] would be treated collectively, with two conflicting predicates that would never work together (you can't have the same element be both first and second!).
Having parentheses around a subexpression doesn't make it an extra query per se; if it were, then you could consider each of id('content'), //td, [1] and [2] a "query" as well in its own right, with implied (or optional) parentheses. And that's a lot of queries :)

Use this simple XPath expression:
(//td[1])[2]
XSLT - based verification:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="node()|#*">
<xsl:copy-of select="(//td[1])[2]"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<html lang="en">
<head>
<meta charset="utf-8" />
<title>Page title</title>
</head>
<body>
<div id="content">
<table>
<tr>
<td>A</td>
<td>B</td>
<td>C</td>
</tr>
<tr>
<td>D</td>
<td>E</td>
<td>F</td>
</tr>
</table>
</div>
</body>
</html>
the XPath expression is evaluated and the result of this evaluation is copied to the output:
<td>D</td>

Related

xpath: how to select items between item A and item B

I have an HTML page with this structure:
<big><b>Staff in:</b></big>
<br>
<a href='...'>Movie 1</a>
<br>
<a href='...'>Movie 2</a>
<br>
<a href='...'>Movie 3</a>
<br>
<br>
<big><b>Cast in:</b></big>
<br>
<a href='...'>Movie 4</a>
How do I select Movies 1, 2, and 3 using Xpath?
I wrote this query
'//big/b[text()="Staff in:"]/following::a'
but it returns Movies 1, 2, 3, and 4. I guess I need to find a way to get items after <big><b>Staff in: but before the next <big>.
Thanks,
Assuming that <big><b>Staff in:</b></big> is a unique element that we can use as 'anchor', you can try this way :
//big[b='Staff in:']/following-sibling::a[preceding-sibling::big[1][b='Staff in:']]
Basically, the xpath finds all <a> that is following sibling of the 'anchor' <big> element mentioned above, and restrict the result to those having nearest preceding sibling <big> equals the anchor element.
output in xpath tester given markup in question as input (with minimal adjustment to make it well-formed XML) :
Element='Movie 1'
Element='Movie 2'
Element='Movie 3'
//a[preceding::b[text()="Staff in:"] and following::b[text()="Cast in:"]]
Returns all a after the element b with text Staff in: but before the element b with the text Cast in:.
You may need to add some more conditions to make it more specific depending on whether or not these b elements are unique on the page.
Just to add up and following the stackoverflow link here XPath axis, get all following nodes until here is the complete solution that i have worked up with xslt editor. Firstly /*/ is used instead of // as this is faster. Second the logic says all anchor nodes which are siblings of big are returned if they satisfy the inner condition that they have preceding sibling of big node equal to what they are following. Also presumed you have distinct big node.
The x-path looks like
/*/big[b="Cast in:"]/following-sibling::a [1 = count(preceding-sibling::big[1]| ../big[b="Cast in:"])]
The xslt solution looks like
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<html>
<body>
<h2>My Movie Collection</h2>
<table border="1">
<tr bgcolor="#9acd32">
<th>Title</th>
</tr>
<xsl:variable name="placeholder" select="/*/big" />
<xsl:for-each select="$placeholder">
<xsl:variable name="i" select="position()" />
<b>
<xsl:value-of select="$i" />
<xsl:value-of select="$placeholder[$i]" />
</b>
<xsl:for-each
select="following-sibling::a [1 = count(preceding-
sibling::big[1]| ../big[b=$placeholder[$i]])]">
<tr>
<td>
<xsl:value-of select="." />
</td>
</tr>
</xsl:for-each>
</xsl:for-each>
</table>
</body>
</html>
</xsl:template>
</xsl:stylesheet>

xpath: check if element is within other element

I have quite a large XML structure that in its simplest form looks kinda like this:
<document>
<body>
<section>
<p>Some text</p>
</section>
</body>
<backm>
<section>
<p>Some text</p>
<figure><title>This</title></figure>
</section>
</backm>
</document>
The section levels can be almost limitless (both within the body and backm elements) so I can have a section in section in section in section, etc. and the figure element can be within a numlist, an itenmlist, a p, and a lot more elements.
What I want to do is to check if the title in figure element is somewhere within the backm element. Is this possible?
A document could have multiple <backm> elements and it could have multiple <figure><title>Title</title></figure> elements in it. How you build your query depends on the situations you're trying to distinguish between.
//backm/descendant::figure/title
Will return the <title> elements that are the child of a <figure> element and the descendant of a <backm> element.
So:
count(//backm/descendant::figure/title) > 0
Will return True if there are 1 or more such title elements.
You can also express this using Double Negation
not(//backm[not(descendant::figure/title)])
I'm under the impression that this should have better performance.
//title[parent::figure][ancestor::backm]
Lists all <title> elements with a parent of <figure> and an <backm> ancestor.

Select adjacent sibling elements without intervening non-whitespace text nodes

Given markup like:
<p>
<code>foo</code><code>bar</code>
<code>jim</code> and then <code>jam</code>
</p>
I need to select the first three <code>—but not the last. The logic is "Select all code elements that have a preceding-or-following-sibling-element that is also a code, unless there exist one or more text nodes with non-whitespace content between them.
Given that I am using Nokogiri (which uses libxml2) I can only use XPath 1.0 expressions.
Although a tricky XPath expression is desired, Ruby code/iterations to perform the same on a Nokogiri document are also acceptable.
Note that the CSS adjacent sibling selector ignores non-element nodes, and so selecting nokodoc.css('code + code') will incorrectly select the last <code> block.
Nokogiri.XML('<r><a/><b/> and <c/></r>').css('* + *').map(&:name)
#=> ["b", "c"]
Edit: More test cases, for clarity:
<section><ul>
<li>Go to <code>N</code> and
then <code>Y</code><code>Y</code><code>Y</code>.
</li>
<li>If you see <code>N</code> or <code>N</code> then…</li>
</ul>
<p>Elsewhere there might be: <code>N</code></p>
<p><code>N</code> across parents.</p>
<p>Then: <code>Y</code> <code>Y</code><code>Y</code> and <code>N</code>.</p>
<p><code>N</code><br/><code>N</code> elements interrupt, too.</p>
</section>
All the Y above should be selected. None of the N should be selected. The content of the <code> are used only to indicate which should be selected: you may not use the content to determine whether or not to select an element.
The context elements in which the <code> appear are irrelevant. They may appear in <li>, they may appear in <p>, they may appear in something else.
I want to select all the consecutive runs of <code> at once. It is not a mistake that there is a space character in the middle of one of sets of Y.
Use:
//code
[preceding-sibling::node()[1][self::code]
or
preceding-sibling::node()[1]
[self::text()[not(normalize-space())]]
and
preceding-sibling::node()[2][self::code]
or
following-sibling::node()[1][self::code]
or
following-sibling::node()[1]
[self::text()[not(normalize-space())]]
and
following-sibling::node()[2][self::code]
]
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select=
"//code
[preceding-sibling::node()[1][self::code]
or
preceding-sibling::node()[1]
[self::text()[not(normalize-space())]]
and
preceding-sibling::node()[2][self::code]
or
following-sibling::node()[1][self::code]
or
following-sibling::node()[1]
[self::text()[not(normalize-space())]]
and
following-sibling::node()[2][self::code]
]"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<section><ul>
<li>Go to <code>N</code> and
then <code>Y</code><code>Y</code><code>Y</code>.
</li>
<li>If you see <code>N</code> or <code>N</code> then…</li>
</ul>
<p>Elsewhere there might be: <code>N</code></p>
<p><code>N</code> across parents.</p>
<p>Then: <code>Y</code> <code>Y</code><code>Y</code> and <code>N</code>.</p>
<p><code>N</code><br/><code>N</code> elements interrupt, too.</p>
</section>
the contained XPath expression is evaluated and the selected nodes are copied to the output:
<code>Y</code>
<code>Y</code>
<code>Y</code>
<code>Y</code>
<code>Y</code>
<code>Y</code>
//code[
(
following-sibling::node()[1][self::code]
or (
following-sibling::node()[1][self::text() and normalize-space() = ""]
and
following-sibling::node()[2][self::code]
)
)
or (
preceding-sibling::node()[1][self::code]
or (
preceding-sibling::node()[1][self::text() and normalize-space() = ""]
and
preceding-sibling::node()[2][self::code]
)
)
]
I think this does what you want, though I won’t claim you’d actually want to use it.
I’m assuming text nodes are always merged together so that there won’t be two adjacent to each other, which I believe is generally the case, but might not be if you’re doing DOM manipulations beforehand. I’ve also assumed that there won’t be any other elements between code elements, or that if there are they prevent selection like non-whitespace text.
I think this is what you want:
/p/code[not(preceding-sibling::text()[not(normalize-space(.)="")])]

how do I formulate this xpath expression?

given the following div element
<div class="info">
title
<span class="a">123</span>
<span class="b">456</span>
<span class="c">789</span>
</div>
I want to retrieve contents of the span with class "b". However, some divs I want to parse lack the second two spans (of class "b" and "c"). For these divs, I want the contents of the span with class "a". Is it possible to create a single XPath expression that selects this?
If it is not possible, is it possible to create a selector that retrieves the entire contents of the div? ie retrieves
title
<span class="a">123</span>
<span class="b">456</span>
<span class="c">789</span>
If I can do that, I can use a regex to find the data I want. (I can select the text within the div, but I'm not sure how to select the tags also. Just the text yields 123456789.)
More efficient -- requires no union:
//div/span
[#class='b'
or
#class='a'
and
not(parent::*[span[#class='b']])
]
An expression (like the one below) that is the union of two absolute "// expressions", typically performs two complete document tree traversals and then the union operation does deduplication and sorting in document order -- all this can be signifficantly less efficient than a single tree traversal, unless the XPath processor has an intelligent optimizer.
An example of such inefficient expression:
//div/span[#class='b'] | //div[not(./span[#class='b'])]/span[#class='a']
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select=
"//div/span
[#class='b'
or
#class='a'
and
not(parent::*[span[#class='b']])
]"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<div class="info">
title
<span class="a">123</span>
<span class="b">456</span>
<span class="c">789</span>
</div>
The Xpath expression is evaluated and the selected elements (in this case just one) are copied to the output:
<span class="b">456</span>
When the same transformation is applied on a different XML document, where there is no class='b':
<div class="info">
title
<span class="a">123</span>
<span class="x">456</span>
<span class="c">789</span>
</div>
the same XPath expression is evaluated and the correctly selected element is copied to the output:
<span class="a">123</span>
The xpath expression should be something like:
//div/span[#class='b'] | //div[not(./span[#class='b'])]/span[#class='a']
The expression left of the union operator | will select you all the b-class spans inside all divs, the expression on the right hand side will first query all divs that do not have a b-class span and then select their a-class span. The | operator combines the results of the two sets.
See here for selecting nodes with not() and here for combining results with the | operator.
Also, to refer to the second part of your question have a look here.
Using node() in your xpath you can select everything (nodes + text) that is below the node selected. So you can get everything in the div returned by
//div/node()
for future processing by other means.
An expression that works on your input without the union operator:
//div/span[#class='a' or #class='b'][count(../span[#class='b']) + 1]
This is just for fun. I'd probably use something more like #inVader's answer in production code.

XPath query to identify untagged text

Consider this HTML:
<html>
<head>
</head>
<body>
<table>
<tr>
<td>
<h1>title</h1>
<h3>item 1</h3>
text details for item 1
<h3>item 2</h3>
text details for item 2
<h3>item 3</h3>
text details for item 3
</td>
</tr>
</table>
</body>
</html>
I'm not terribly familiar with XPath, but it seems to me that there is no notation which will match the "text details" sections individually. Can you confirm?
Use:
/html/body/table/tr/td/h3/following-sibling::text()[1]
This means: Get the first following sibling text node of every h3 element that is a child of every tr element that is a child of every table element that is a child of every body element that is a child of the html top element.
Or, if you only know that the wanted text nodes are the immediate following siblings of all h3 elements in the docunent, then tis XPath expression selects them:
//h3/following-sibling::text()[1]
in the world of Xml/Xpath
Text - is a type of Element Node.
so considering your example
TD has 7 child nodes
TD.getChild(3) should return the "text details for item 1" Value.
in XPath
$x//table/tr/td/text()[1]

Resources