XPath to select n-th until last child? - xpath

Consider the following HTML:
<div>
<div>I do NOT want this</div>
<div>but I want this</div> <--------- 2
<div>and this</div>
<!-- many more entries -->
<div>and also this last one</div> <--------- last one
</div>
Using XPath I want to select all div/div[from 2 until last one]. How can I do this?

you can use the position method, this should work for you
//div/div[position() > 1]

Related

Select only the second preceding tag if there is one, or then just the first one

I have this HTML:
<h4>block 1</h4>
<p>paragraph 1</p>
<p>paragraph 2</p>
<table></table>
<h4>block 2</h4>
<p>paragraph 1</p>
<table></table>
As you can see, the first block contains two <p></p> tags, while the second block only has one.
I am currently using this XPath: //table/preceding::p[1], which returns:
1. <p>paragraph 2</p>
2. <p>paragraph 1</p>
However, this is what I'd like to have:
1. <p>paragraph 1</p>
2. <p>paragraph 1</p>
So basically the farest "preceding" table p tag, as explained in my question title.
I want to keep using //table/preceding, as this is very important in my case.
I already tried //table/preceding::p[1 or 2], but that selects both.
I also tried //table/preceding::p[2] but that will select both paragraphs from the first block, and none from the second one.
As you can probably notice, I'm pretty new to XPath. How can I achieve the desired result?
Try this one to get select desired paragraphs
//table/preceding-sibling::h4[1]/following-sibling::p[1]

Select all nodes between two elements excluding unnecessary element from the intersection using XPath

There’s a document structured as follows:
<div class="document">
<div class="title">
<AAA/>
</div class="title">
<div class="lead">
<BBB/>
</div class="lead">
<div class="photo">
<CCC/>
</div class="photo">
<div class="text">
<!-- tags in text sections can vary. they can be `div` or `p` or anything. -->
<DDD>
<EEE/>
<DDD/>
<CCC/>
<FFF/>
<FFF>
<GGG/>
</FFF>
</DDD>
</div class="text">
<div class="more_text">
<DDD>
<EEE/>
<DDD/>
<CCC/>
<FFF/>
<FFF>
<GGG/>
</FFF>
</DDD>
</div class="more_text">
<div class="other_stuff">
<DDD/>
</div class="other_stuff">
</div class="document">
The task is to grab all the elements between <div class="lead"> and <div class="other_stuff"> except the <div class="photo"> element.
The Kayessian method for node-set intersection $ns1[count(.|$ns2) = count($ns2)] works perfectly. After substituting $ns1 with //*[#class="lead"]/following::* and $ns2 with //*[#class="other_stuff"]/preceding::*,
the working code looks like this:
//*[#class="lead"]/following::*[count(. | //*[#class="other_stuff"]/preceding::*)
= count(//*[#class="other_stuff"]/preceding::*)]/text()
It selects everything between <div class="lead"> and <div class="other_stuff"> including the <div class="photo"> element. I tried several ways to insert not() selector in the formula itself
//*[#class="lead" and not(#class="photo ")]/following::*
//*[#class="lead"]/following::*[not(#class="photo ")]
//*[#class="lead"]/following::*[not(self::class="photo ")]
(the same things with /preceding::* part) but they don't work. It looks like this not() method is ignored – the <div class="photo"> element remains in the selection.
Question 1: How to exclude the unnecessary element from this intersection?
It’s not an option to select from <div class="photo"> element excluding it automatically because in other documents it can appear in any position or doesn't appear at all.
Question 2 (additional): Is it OK to use * after following:: and preceding:: in this case?
It initially selects everything up to the end and to the beginning of the whole document. Could it be better to specify the exact end point for the following:: and preceding:: ways? I tried //*[#class="lead"]/following::[#class="other_stuff"] but it doesn’t seem to work.
Question 1: How to exclude the unnecessary element from this intersection?
Adding another predicate, [not(self::div[#class='photo'])] in this case, to your working XPath should do. For this particular case, the entire XPath would look like this (formatted for readability) :
//*[#class="lead"]
/following::*[
count(. | //*[#class="other_stuff"]/preceding::*)
=
count(//*[#class="other_stuff"]/preceding::*)
][not(self::div[#class='photo'])]
/text()
Question 2 (additional): Is it OK to use * after following:: and preceding:: in this case?
I'm not sure if it would be 'better', what I can tell is following::[#class="other_stuff"] is invalid expression. You need to mention the element to which the predicate will be applied, for example, 'any element' following::*[#class="other_stuff"], or just 'div' following::div[#class="other_stuff"].

Xpath: select div that contains class AND whose specific child element contains text

With the help of this SO question I have an almost working xpath:
//div[contains(#class, 'measure-tab') and contains(., 'someText')]
However this gets two divs: in one it's the child td that has someText, the other it's child span.
How do I narrow it down to the one with the span?
<div class="measure-tab">
<!-- table html omitted -->
<td> someText</td>
</div>
<div class="measure-tab"> <-- I want to select this div (and use contains #class)
<div>
<span> someText</span> <-- that contains a deeply nested span with this text
</div>
</div>
To find a div of a certain class that contains a span at any depth containing certain text, try:
//div[contains(#class, 'measure-tab') and contains(.//span, 'someText')]
That said, this solution looks extremely fragile. If the table happens to contain a span with the text you're looking for, the div containing the table will be matched, too. I'd suggest to find a more robust way of filtering the elements. For example by using IDs or top-level document structure.
You can use ancestor. I find that this is easier to read because the element you are actually selecting is at the end of the path.
//span[contains(text(),'someText')]/ancestor::div[contains(#class, 'measure-tab')]
You could use the xpath :
//div[#class="measure-tab" and .//span[contains(., "someText")]]
Input :
<root>
<div class="measure-tab">
<td> someText</td>
</div>
<div class="measure-tab">
<div>
<div2>
<span>someText2</span>
</div2>
</div>
</div>
</root>
Output :
Element='<div class="measure-tab">
<div>
<div2>
<span>someText2</span>
</div2>
</div>
</div>'
You can change your second condition to check only the span element:
...and contains(div/span, 'someText')]
If the span isn't always inside another div you can also use
...and contains(.//span, 'someText')]
This searches for the span anywhere inside the div.

Xpath fetch specific nodes without their child nodes from XML

I have XML data that looks like this
<priceData>
<div class='price'>
<div class='price-old'>20.00</div>
<div class='price-new'>10.00</div>
<div class='price-tax'>8.00</div>
</div>
<div class='price'>
40.00 <div class='price-tax'>25.00</div>
</div>
</priceData>
I want to use Xpath to extract data for "price-new" from the first price div, and value 40.00 from the second price div. This must be done using single expression.
I tried expressions like
//div[contains(#class, 'price') and not(contains(#class, 'tax')) and not(contains(#class, '-old'))]
and
//div[contains(#class, 'price') and not(contains(#class, 'tax')) and not(descendant::div[contains(#class, '-old') and not(contains(#class, '-tax'))]) and not(contains(#class, '-old'))]
and some others but I can't get it to work how it is supposed to.
I always end up with fetching extra nodes from the first case and I only need the single node (price-new or price if there are no more nodes in it).
You can try using xpath union (|) to combine 2 queries into one. Given markup in the question as XML input, the following xpath (formatted for readability) :
//div[#class='price']/div[#class='price-new']/text()
|
//div[#class='price']/text()[normalize-space()]
returned 'expected' result in xpath tester :
Text='10.00'
Text='40.00'

Extracting contents from a list split across different divs

Consider the following html
<div id="relevantID">
<div class="column left">
<h1> Section-Header-1 </h1>
<ul>
<li>item1a</li>
<li>item1b</li>
<li>item1c</li>
<li>item1d</li>
</ul>
</div>
<div class="column">
<ul> <!-- Pay attention here -->
<li>item1e</li>
<li>item1f</li>
</ul>
<h1> Section-Header-2 </h1>
<ul>
<li>item2a</li>
<li>item2b</li>
<li>item2c</li>
<li>item2d</li>
</ul>
</div>
<div class="column right">
<h1> Section-Header-3 </h1>
<ul>
<li>item3a</li>
<li>item3b</li>
<li>item3c</li>
<li>item3d</li>
</ul>
</div>
</div>
My objective is to extract the items for each Section headers. However, inconveniently the designer of the webpage decided to break up the data into three columns, adding an additional div (with classes column right etc).
My current method of extraction was using the xpath
for section headers, I use the xpath (get all h1 elements withing a div with given id)
//div[#id="relevantID"]//h1
above returns a list of h1 elements, looping over each element I apply the additional selector, for each matched h1 element, look up the next ul node and retreive all its li nodes.
following-sibling::ul//li
But thanks to the designer's aesthetics, I am failing in the one particular case I've marked in the HTML file. Where the items are split across two different column divs.
I can probably bypass this problem by stripping out the column divs entirely, but I don't think modifying the html to make a selector match is considered good (I haven't seen it needed anywhere in the examples I've browsed so far).
What would be a good way to extract data that has been formatted like this? Full solutions are not neccessary, hints/tips will do. Thanks!
The columns do frustrate use of following-sibling:: and preceding-sibling::, but you could instead use the following:: and preceding:: axis if the columns at least keep the list items in proper document order. (That is indeed the case in your example.)
The following XPath will select all li items, regardless of column, occurring after the "Section-Header-1" h1 and before the "Section-Header-2" h1 header in document order:
//div[#id='relevantID']//li[normalize-space(preceding::h1) = 'Section-Header-1'
and normalize-space(following::h1) = 'Section-Header-2']
Specifically, it selects the following items from your example HTML:
<li>item1a</li>
<li>item1b</li>
<li>item1c</li>
<li>item1d</li>
<li>item1e</li>
<li>item1f</li>
You can combine following-sibling and preceding-sibling to get possible li elements in a div before the h2 and use the union operator |. As example for the second h2:
((//div[#id="relevantID"]//h1)[2]/preceding-sibling::ul//li) |
((//div[#id="relevantID"]//h1)[2]/following-sibling::ul//li)
Result:
<li>item1e</li>
<li>item1f</li>
<li>item2a</li>
<li>item2b</li>
<li>item2c</li>
<li>item2d</li>
As you're already selecting all h1 using //div[#id="relevantID"]//h1 and retrieving all li items for each h1 using as a second step following-sibling::ul//li, you could combine this to following-sibling::ul//li | preceding-sibling::ul//li.

Resources