XPath query. Preceding-sibling of a conditionally reduced set of nodes - xpath

I got html code like the following:
<p style="margin:0 0 0.5em 0;"><b>Blablub</b></p>
<table> ... </table>
Now I want to query the content of the <b> right above the table but only if the table does not have any attributes. I tried the following query:
//table[not(#*)]/preceding-sibling::p/b
If I remove the preceding-sibling::p/b part entirely it works. It gives me exactly the tables I need. However, if I use this query it gives me content of an <b> tag which precedes a table WITH attributes.

Use:
//table[not(#*)]/preceding-sibling::*[1][self::p]/b
This means: Select all b elements that are children of all p elements that are the first preceding sibling of a table that has no attributes.
This is quite different from the problematic expression cited in the question:
//table[not(#*)]/preceding-sibling::p[1]/b
The latter selects the b children of the first p following sibling -- there is no guarantee that the first p following sibling is also the first element sibling.

Related

XPath with specific following sibling case

I have structure that looks something like this
<p>
<br>
<b>Text to fetch </b>
<br>
"Some random text"
<b>Text not to fetch</b>
I need XPath that will allow me to fetch following sibling of the br element only if there is no text between br element and his following sibling.
If I do something like this
//br/following-sibling::b/text()[1]
It will fetch both Text to fetch and Text not to fetch, while I only need Text to fetch.
Another possible XPath :
//br/following-sibling::node()[normalize-space()][1][self::b]/text()
brief explanation:
//br/following-sibling::node(): find all nodes that is following-sibling of br element, where the nodes are..
[normalize-space()]: not empty (whitespace only), then..
[1]: for each br found, take only the first of such node, then..
[self::b]: check if the node is a b element, then if it is a b element..
/text(): return text node that is child of the b element
Try below XPath to avoid matching b nodes with preceding sibling text:
//br/following-sibling::b[not(preceding-sibling::text()[1][normalize-space()])]/text()

How to find direct children which contain nodes with specified text with xpath?

I need to extract all children which have nodes with some text. Html structure might be the following:
<div>
<div>
A
</div>
<p>
<b>A</b>
</p>
<span>
B
</span>
</div>
I need to extract child nodes which have "A" text. It should return div and p nodes
I tried the following xpaths:
./*/*[contains(text(), 'A')]
./*/*[./*[contains(text(), 'A')]]
but the first one returns only div with "A" text and the second one returns only p with "A" text
Is it possible to construct xpath which will return both children?
Node containing "A" text might be at any level in the child node
If you need XPath that returns both child nodes, try to use
./*/*[contains(., "A")]
I suspect contains() is wrong here, unless you really want to select a node whose value is "HAT" as well as one whose value is "A".
Try
*/*[normalize-space(.)='A']

XPath difference between two similar path and other questions

I've to made some exercices but
I don't really understand the difference between two similar path
I've the tree :
<b>
<t></t>
<a>
<n></n>
<p></p>
<p></p>
</a>
<a>
<n></n>
<p></p>
</a>
<a></a>
</b>
And we expect that each final tag contain one text node.
I've to explain the difference between //a//text() and //a/text()
I see that //a//text() return all text nodes and it seems legit,
but why //a/text() return the last "a node" -> text node ?
Another question :
why //p[1] return for each "a node", the first "p" child node ?
-> I've two results
<b>
<t></t>
<a>
<n></n>
**<p></p>**
<p></p>
</a>
<a>
<n></n>
**<p></p>**
</a>
<a></a>
</b>
Why the answer is not the first "p" node for the whole document ?
Thanks for all !
Difference between 1: //a//text() and 2: //a/text()
Let's break it down: //a selects all a elements, no matter where they are in the document. Suppose you have /a, that would select all root a elements.
If the / path expression comes after another element in an XPath expression, it will select elements directly descending the element before that in the XPath expression (ie child elements).
If the // path expression comes after another element in an XPath expression, it will select all elements that are descendant of the previous element, no matter where they are under the previous element.
Applying to your two XPath expressions:
//a//text(): Select all a elements no matter where they are in the document, and for those elements select text() no matter where they are under the a elements selected.
//a/text(): Select all a elements no matter where they are in the document, and for those elements select any direct descendant text().
Why //p[1] returns for each "a node", the first "p" child node?
Suppose you were to write //a/p[1], this would select the first p child element of any a element anywhere in the document. By writing //p[1] you are omitting an explicit parent element, but the predicate still selects the first child element of any parent the p element has.
In this case there are two parent a elements, for which the first p child element is selected.
It would be good to search for a good introduction to XPath on your favorite search engine. I've always found this one from w3schools.com to be a good one.

using xpath to select an element after another

I've seen similar questions, but the solutions I've seen won't work on the following. I'm far from an XPath expert. I just need to parse some HTML. How can I select the table that follows Header 2. I thought my solution below should work, but apparently not. Can anyone help me out here?
content = """<div>
<p><b>Header 1</b></p>
<p><b>Header 2</b><br></p>
<table>
<tr>
<td>Something</td>
</tr>
</table>
</div>
"""
from lxml import etree
tree = etree.HTML(content)
tree.xpath("//table/following::p/b[text()='Header 2']")
Some alternatives to #Arup's answer:
tree.xpath("//p[b='Header 2']/following-sibling::table[1]")
select the first table sibling following the p containing the b header containing "Header 2"
tree.xpath("//b[.='Header 2']/following::table[1]")
select the first table in document order after the b containing "Header 2"
See XPath 1.0 specifications for details on the different axes:
the following axis contains all nodes in the same document as the context node that are after the context node in document order, excluding any descendants and excluding attribute nodes and namespace nodes
the following-sibling axis contains all the following siblings of the context node; if the context node is an attribute node or namespace node, the following-sibling axis is empty
You need to use the below XPATH 1.0 using the Axes preceding.
//table[preceding::p[1]/b[.='Header 2']]

CRLF causing problems in finding an element via xpath

I am working on a test case and need to find text within a table. The only thing to key off of is the label in the previous column. The keys are Next Trckng/Dschrg, Next Full, Next Qtrly, Next Mdcr. I would like to create an xpath expression that will find the Text 1, Text 2, Text 3, and Text 4 based on the key. Since all the keys have the word Next in them, I have mocked this up to find all four of them at once.
//td[preceding-sibling::td[contains(descendant::text(),'Next')]]/a
The third one is not found because it does not have an 'a' element, which is fine. the problem comes in the very first td. It has a span in it, unlike the others. The span is on a second physical line from the td. It appears that the CRLF is preventing FirePath from finding the first td, when I put the span on the same line as the td, it is found. The problem is that I cannot change the actual page, this is a test case.
Is this a FireBug issue or is this actually resulting in two text elements in the DOM? How do I tweak the xpath to find all four nodes?
Here is the HTML:
<table border=1>
<tbody>
<tr>
<td>
<span id="xxx"><a><img></a></span>
Next Trckng/Dschrg:</td>
<td><a>Text 1</a></td>
<td>Next Full:</td>
<td><a>Text 2</a></td>
<td>Next Qtrly:</td>
<td> <!-- Text 3 --></td>
<td>Next Mdcr:</td>
<td><a>Text 4</a></td>
<td>Change Of Therapy:</td>
</tr>
</tbody>
</table>
The problem is with the expression contains(descendant::text(),'Next'). The contains function takes two strings as arguments. Since you pass a node-set as first argument, it is converted to a string. The conversion works by calling the string function on the node-set which according to the spec returns the string-value of the node that is first in document order. In your case, this will be the first text child of a td element. For the first td element, this is a text node containing only whitespace.
The solution is simple: Pass the current td element to the contains function:
contains(., 'Next')
The string-value of this single node will contain the concatenation of the string-values of all text node descendants.

Resources