Get count of specific nodes between two specific sibling nodes - html-agility-pack

I'm using HtmlAgilityPack to get a filtered DOM of <h2> and <h3> nodes and using Xpath 1.0 (from my Xpath 1.0 crash course this week) I need to get the number of <h3>'s (the number varies) that are between sibling <h2>'s as follows:
<div>
<h2>heading 1</h2>
<h3>sub 1.1</h3>
<h3>sub 1.2</h3>
<h2>heading 2</h2>
<h3>sub 2.1</h3>
<h2>heading 3</h2>
....
</div>
When I iterate (using C#) through the filtered nodes I want the exact number of <h3>'s that are after a <h2> and before the next <h2>. When I use the following I get all the <h3>'s as the result.
int countH3 = n.SelectNodes("./preceding-sibling::h2[2]/following-sibling::h2[3]/preceding-sibling::h3").Count(); //the [position] is set dynamically
For the node structure above would like the result of the code line to be:
countH3 = 1
but it is:
countH3 = 3
I've found many similar SO questions regarding "sibling nodes between sibling nodes" and have to thank #LarsH for his comment in another question that /preceding::h3 returns ALL <h3>'s which helped explain the issue. I think I may need to use the Kayessian method of node-set intersection but get the "invalid token" error when I include the . | union character as follows:
countH3 = n.SelectNodes("./h2[2]/following-sibling::h2[3]
[count(.|./h2[2]/following-sibling::h2[3]/preceding-sibling::h3)=
count(./h2[2]/following-sibling::h2[3]/preceding-sibling::h3)]").Count();
Any suggestions appreciated.

Related

Xpath - How to select a node but not its child nodes

I am trying to select a node but not any of its child nodes.
Example Input:
<Header attr1="Hello">
<child1> hello </child1>
<child2>world</child2>
</Header>
Expected Output: <Header attr1="Hello"> </Header>
Code:
Document xmlDoc = saxBuilder.build(inputStream);
Xpath x = XPath.newInstance("/Header");
eleMyElement = x.selectSingleNode(xmlDoc);
XMLOutputter output = new XMLOutputter();
output.outputString(eleMyElement) --> this is the output
I tried with /Header as XPath, it gives me the header along with child nodes.
You need to distinguish what is selected from what is displayed.
The XPath expression /Header selects one node only, the Header element. You say "it gives me", but what is "it"? Something is displaying the results of the XPath selection, and it is choosing to display the results by rendering the selected element with all its children. You need to look at the code that is displaying the result.
In this case you can simply do
eleMyElement.getContent().clear();
and all child nodes will be deleted.

How to find direct children which contain nodes with specified text with xpath?

I need to extract all children which have nodes with some text. Html structure might be the following:
<div>
<div>
A
</div>
<p>
<b>A</b>
</p>
<span>
B
</span>
</div>
I need to extract child nodes which have "A" text. It should return div and p nodes
I tried the following xpaths:
./*/*[contains(text(), 'A')]
./*/*[./*[contains(text(), 'A')]]
but the first one returns only div with "A" text and the second one returns only p with "A" text
Is it possible to construct xpath which will return both children?
Node containing "A" text might be at any level in the child node
If you need XPath that returns both child nodes, try to use
./*/*[contains(., "A")]
I suspect contains() is wrong here, unless you really want to select a node whose value is "HAT" as well as one whose value is "A".
Try
*/*[normalize-space(.)='A']

How can I select nodes that don't contain links but which do contain specific text using xpath

Given the following HTML:
$content =
'<html>
<body>
<div>
<p>During the interim there shall be nourishment supplied</p>
</div>
<div>
<p>During the interim there shall be interim nourishment supplied</p>
</div>
<div>
<ul><li>During the interim there shall be nourishment supplied</li></ul>
</div>
</body>
</html>';
I want all the nodes containing the word "interim" but not if the word "interim" is part of a link element.
The nodes I would expect back are the first P node and the LI node only.
I've tried the following:
'//*/text()[not(a) and contains(.,"interim")]'
... but this still returns the A and also returns part of it's parent P node (the part after the A), neither of which are desired. You can see my attempt here: https://glot.io/snippets/ehp7hmmglm
If you use the XPath expression //*[not(self::a) and not(a) and text()[contains(.,"interim")]] then you get all elements that do not contain an a element, are not a elements and contain a text node child containing that word.

Xpath get node that only contains nodes of a certain type

Take this (id attributes only added so I can refer to them below)
<div id="one">
<figure>foo</figure>
<figure>bar</figure>
</div>
<div id="two">
<figure>foo</figure>
<div>bar</div>
</div>
<div id="three">
<div>bar</div>
</div>
How can I select all div elements whose children are all figure elements, i.e. selecting div one only in the given example?
I sort of need //div[count(not figure)>0].
This is one possible way :
//div[not(*[name() != 'figure']) and not(text()[normalize-space()])]
The left-side of and make sure the div doesn't have child element named other than 'figure', and the right-side make sure it doesn't have non-empty child text node.
or, the same approach but using count() :
//div[count(*[name() != 'figure']|text()[normalize-space()]) = 0]
I did it like this:
//div[figure][count(figure) = count(*)]
This finds divs that must contain at least one figure, and then it checks that the count of figure elements matches the count of all other elements; if this is true then it cannot contain anything else.

using xpath to select an element after another

I've seen similar questions, but the solutions I've seen won't work on the following. I'm far from an XPath expert. I just need to parse some HTML. How can I select the table that follows Header 2. I thought my solution below should work, but apparently not. Can anyone help me out here?
content = """<div>
<p><b>Header 1</b></p>
<p><b>Header 2</b><br></p>
<table>
<tr>
<td>Something</td>
</tr>
</table>
</div>
"""
from lxml import etree
tree = etree.HTML(content)
tree.xpath("//table/following::p/b[text()='Header 2']")
Some alternatives to #Arup's answer:
tree.xpath("//p[b='Header 2']/following-sibling::table[1]")
select the first table sibling following the p containing the b header containing "Header 2"
tree.xpath("//b[.='Header 2']/following::table[1]")
select the first table in document order after the b containing "Header 2"
See XPath 1.0 specifications for details on the different axes:
the following axis contains all nodes in the same document as the context node that are after the context node in document order, excluding any descendants and excluding attribute nodes and namespace nodes
the following-sibling axis contains all the following siblings of the context node; if the context node is an attribute node or namespace node, the following-sibling axis is empty
You need to use the below XPATH 1.0 using the Axes preceding.
//table[preceding::p[1]/b[.='Header 2']]

Resources