Make XPath stop at a certain depth? - xpath

I have the following HTML
<span class="medium bold day-time-clock">
09:00
<div class="tooltip-box first-free-tip ">
<div class="tooltip-box-inner">
<span class="fa fa-clock-o"></span>
Some more text
</div>
</div>
</span>
I want an XPath that only gets the text 09:00, not Some more text NOT using text()[1] because that causes other problems. My current XPath looks like this
("//span[1][contains(#class, 'day-time-clock')]/text()")
I want one that ignores this whole part of the HTML
<div class="tooltip-box first-free-tip ">
<div class="tooltip-box-inner">
<span class="fa fa-clock-o"></span>
Some more text
</div>
</div>

You can limit the level of descendant:: nodes with position().
So the following expression does work:
span/descendant::node()[2 > position()]
Adjust the number in the predicate to your needs, 2 is only an example. A disadvantage of this approach is that the counting of the descendants is only accurate for the first child in the descending tree.
Another approach is limiting the both: the ancestors and the descendants:
span/descendant::node()[3 > count(ancestor::*) and 1 > count(descendant::*)]
Here, too, you have to adjust the numbers in the predicates to get any useful results.

Use normalize-space() for select all non-whitespace nodes of the document:
//span[contains(#class, 'day-time-clock')]/text()[normalize-space()]

I think (if I understand you correctly) that
"..//div[contains(#class, 'tooltip-box')]/parent::span"
gets you there.

Related

How to get specific xpath tag value

<div class="container">
<span class="price">
<bdi> 140 </bdi>
</span>
<span class="price">
<del>
<bdi>90</bdi>
</del>
<ins>
<bdi> 120 </bdi>
</ins>
</span>
</div>
I want to scrape a site which html formatting like below. Here I dont want to bdi tag value which is under del tag and want bdi tag value which is under span class and ins tag. Is there any path to figure it out?
Don't pretty much usual //span/ins/bdi/text() work for you?
This is "text of <bdi> which parent is <ins> which parent is <span>"?
CSS variant span>ins>bdi::text should also work I suppose.
Sorry, haven't noticed that you need two values. In that case .xpath('//bdi[not(parent::del)]/text()').extract() will work well.

Getting single element with similar xpaths but with different same level, "neighboring" node

I'm trying to get the xpath of an element with a similar xpath to others but has a "neighbor" element that's different . Please see example below.
<div>
<div id='a'> </div>
<span> Text here </span> #this is what i'm trying to get
</div>
<div>
<div id='b'> </div>
<span> Text here </span>
</div>
I tried using //div//span, but this gives me the 2 spans. So i tried using //div//child::div[#id='a']//ancestor::div//child::span, but it doesn't look pleasant and looks repetitive. Does this have a better implementation?
try
//div[div[#id='a']]/span
it says get the span child node of all div nodes with child node div (with an #id equal to 'a').

XPATH help needed for text()

<div class="from">
<span class="label">Reported by: Rhjj,
<span class="ocation">US</span>
</span> <span class="dat"> </span> </div>
Here I just want the output as "Reported by :Rhjj". But when i use the XPATH as
//div[contains(#class,"from")]//span[contains(#class,"label")] "US" also gets selected.
Is there any other way to select only Reported by: Rhjj, other than using text() and using substring_before comma. Even this is not consistent
//div[contains(#class,"fromTime")]//span[contains(#class,"label")]/text()
The text you want is the first node under the span element with an attribute named class (note I've taken the names from the XML, not your code.). This works for the snippet of XML you've provided.
/div[#class="from"]/span[#class="label"]/node()[1]

xpath:how to find a node that not contains text?

I have a html like:
...
<div class="grid">
"abc"
<span class="searchMatch">def</span>
</div>
<div class="grid">
<span class="searchMatch">def</span>
</div>
...
I want to get the div which not contains text,but xpath
//div[#class='grid' and text()='']
seems doesn't work,and if I don't know the text that other divs have,how can I find the node?
Let's suppose I have inferred the requirement correctly as:
Find all <div> elements with #class='grid' that have no directly-contained non-whitespace text content, i.e. no non-whitespace text content unless it's within a child element like a <span>.
Then the answer to this is
//div[#class='grid' and not(text()[normalize-space(.)])]
You need a not() statement + normalize-space() :
//div[#class='grid' and not(normalize-space(text()))]
or
//div[#class='grid' and normalize-space(text())='']

Select all nodes between two elements excluding unnecessary element from the intersection using XPath

There’s a document structured as follows:
<div class="document">
<div class="title">
<AAA/>
</div class="title">
<div class="lead">
<BBB/>
</div class="lead">
<div class="photo">
<CCC/>
</div class="photo">
<div class="text">
<!-- tags in text sections can vary. they can be `div` or `p` or anything. -->
<DDD>
<EEE/>
<DDD/>
<CCC/>
<FFF/>
<FFF>
<GGG/>
</FFF>
</DDD>
</div class="text">
<div class="more_text">
<DDD>
<EEE/>
<DDD/>
<CCC/>
<FFF/>
<FFF>
<GGG/>
</FFF>
</DDD>
</div class="more_text">
<div class="other_stuff">
<DDD/>
</div class="other_stuff">
</div class="document">
The task is to grab all the elements between <div class="lead"> and <div class="other_stuff"> except the <div class="photo"> element.
The Kayessian method for node-set intersection $ns1[count(.|$ns2) = count($ns2)] works perfectly. After substituting $ns1 with //*[#class="lead"]/following::* and $ns2 with //*[#class="other_stuff"]/preceding::*,
the working code looks like this:
//*[#class="lead"]/following::*[count(. | //*[#class="other_stuff"]/preceding::*)
= count(//*[#class="other_stuff"]/preceding::*)]/text()
It selects everything between <div class="lead"> and <div class="other_stuff"> including the <div class="photo"> element. I tried several ways to insert not() selector in the formula itself
//*[#class="lead" and not(#class="photo ")]/following::*
//*[#class="lead"]/following::*[not(#class="photo ")]
//*[#class="lead"]/following::*[not(self::class="photo ")]
(the same things with /preceding::* part) but they don't work. It looks like this not() method is ignored – the <div class="photo"> element remains in the selection.
Question 1: How to exclude the unnecessary element from this intersection?
It’s not an option to select from <div class="photo"> element excluding it automatically because in other documents it can appear in any position or doesn't appear at all.
Question 2 (additional): Is it OK to use * after following:: and preceding:: in this case?
It initially selects everything up to the end and to the beginning of the whole document. Could it be better to specify the exact end point for the following:: and preceding:: ways? I tried //*[#class="lead"]/following::[#class="other_stuff"] but it doesn’t seem to work.
Question 1: How to exclude the unnecessary element from this intersection?
Adding another predicate, [not(self::div[#class='photo'])] in this case, to your working XPath should do. For this particular case, the entire XPath would look like this (formatted for readability) :
//*[#class="lead"]
/following::*[
count(. | //*[#class="other_stuff"]/preceding::*)
=
count(//*[#class="other_stuff"]/preceding::*)
][not(self::div[#class='photo'])]
/text()
Question 2 (additional): Is it OK to use * after following:: and preceding:: in this case?
I'm not sure if it would be 'better', what I can tell is following::[#class="other_stuff"] is invalid expression. You need to mention the element to which the predicate will be applied, for example, 'any element' following::*[#class="other_stuff"], or just 'div' following::div[#class="other_stuff"].

Resources