XPath to select 1 element if one of two exists - xpath

I want to select one element if one out of 2 exist while using this for 2 pages
1st page (price with discount)
<div class="price">
<span class="originalRetailPrice">$2,990.00</span>
</div>
</div>
<div class="price">
<span class="salePrice">$1,794.00</span>
</div>
or 2nd page (only one price)
<div class="price">
$298.00
</div>
I have used
//span[#class="originalRetailPrice"] | (//div[#class="priceBlock"])[1]
but I get the price twice
What I want is to select the first price when it's class="originalRetailPrice" or when it's //div[#class="price"]/text()[1]
So finally I want to make the selection to work on both pages

Use // to get texts at any level inside <div class="price">:
//div[#class="price"][1]//text()
Result:
Text=''
Text='$2,990.00'
Text=''
And filter the empty texts with: text()[normalize-space() and not(ancestor::a | ancestor::script | ancestor::style)]
//div[#class="price"][1]//text()[normalize-space() and not(ancestor::a | ancestor::script | ancestor::style)]
Result 1st page:
Text='$2,990.00'
Result 2nd page:
Text='$298.00'

You can try this way :
//span[#class="originalRetailPrice"] | //div[#class="price" and not(span[#class="originalRetailPrice"])]/text()[1]
The 2nd part (right side of |) select div[#class="price"] element only if it doesn't have child span[#class="originalRetailPrice"].

Related

Make XPath stop at a certain depth?

I have the following HTML
<span class="medium bold day-time-clock">
09:00
<div class="tooltip-box first-free-tip ">
<div class="tooltip-box-inner">
<span class="fa fa-clock-o"></span>
Some more text
</div>
</div>
</span>
I want an XPath that only gets the text 09:00, not Some more text NOT using text()[1] because that causes other problems. My current XPath looks like this
("//span[1][contains(#class, 'day-time-clock')]/text()")
I want one that ignores this whole part of the HTML
<div class="tooltip-box first-free-tip ">
<div class="tooltip-box-inner">
<span class="fa fa-clock-o"></span>
Some more text
</div>
</div>
You can limit the level of descendant:: nodes with position().
So the following expression does work:
span/descendant::node()[2 > position()]
Adjust the number in the predicate to your needs, 2 is only an example. A disadvantage of this approach is that the counting of the descendants is only accurate for the first child in the descending tree.
Another approach is limiting the both: the ancestors and the descendants:
span/descendant::node()[3 > count(ancestor::*) and 1 > count(descendant::*)]
Here, too, you have to adjust the numbers in the predicates to get any useful results.
Use normalize-space() for select all non-whitespace nodes of the document:
//span[contains(#class, 'day-time-clock')]/text()[normalize-space()]
I think (if I understand you correctly) that
"..//div[contains(#class, 'tooltip-box')]/parent::span"
gets you there.

How do exclude elements from an Xpath query?

I'm trying to select the ingredients in an ingredients list, but there are also tooltips scattered amongst them (on the BBC Good Food site).
As a stripped-down example:
<li class="ingredients-list__item" itemprop="ingredients">
400g
<a href="/glossary/new-potatoes" class="ingredients-list__glossary-link tooltip-processed">
new potato
<div id="gf-tooltip-0" class="gf-tooltip" role="tooltip">
<div class="gf-tooltip__content">
<div class="gf-tooltip__text">
<p>unwanted tooltip</p>
</div>
</div>
</div>
</a>, halved if large
<span class="ingredients-list__glossary-element" id="ingredients-glossary"></span>
</li>
I'm trying to end up with '400g new potato, halved if large', or equally good, ['400g', 'new potato', ', halved if large'].
Amongst other things I've tried:
s.xpath("//li[#class='ingredients-list__item'][not(div[#class='gf-tooltip'])]//text()").extract()
But this still returns the text in the tooltip div.
One possible way would be excluding text nodes where any of the ancestor is a tooltip div (broken into 2 lines for readability) :
//li[#class='ingredients-list__item']
//text()[not(ancestor::div[#class='gf-tooltip'])]

Select all nodes between two elements excluding unnecessary element from the intersection using XPath

There’s a document structured as follows:
<div class="document">
<div class="title">
<AAA/>
</div class="title">
<div class="lead">
<BBB/>
</div class="lead">
<div class="photo">
<CCC/>
</div class="photo">
<div class="text">
<!-- tags in text sections can vary. they can be `div` or `p` or anything. -->
<DDD>
<EEE/>
<DDD/>
<CCC/>
<FFF/>
<FFF>
<GGG/>
</FFF>
</DDD>
</div class="text">
<div class="more_text">
<DDD>
<EEE/>
<DDD/>
<CCC/>
<FFF/>
<FFF>
<GGG/>
</FFF>
</DDD>
</div class="more_text">
<div class="other_stuff">
<DDD/>
</div class="other_stuff">
</div class="document">
The task is to grab all the elements between <div class="lead"> and <div class="other_stuff"> except the <div class="photo"> element.
The Kayessian method for node-set intersection $ns1[count(.|$ns2) = count($ns2)] works perfectly. After substituting $ns1 with //*[#class="lead"]/following::* and $ns2 with //*[#class="other_stuff"]/preceding::*,
the working code looks like this:
//*[#class="lead"]/following::*[count(. | //*[#class="other_stuff"]/preceding::*)
= count(//*[#class="other_stuff"]/preceding::*)]/text()
It selects everything between <div class="lead"> and <div class="other_stuff"> including the <div class="photo"> element. I tried several ways to insert not() selector in the formula itself
//*[#class="lead" and not(#class="photo ")]/following::*
//*[#class="lead"]/following::*[not(#class="photo ")]
//*[#class="lead"]/following::*[not(self::class="photo ")]
(the same things with /preceding::* part) but they don't work. It looks like this not() method is ignored – the <div class="photo"> element remains in the selection.
Question 1: How to exclude the unnecessary element from this intersection?
It’s not an option to select from <div class="photo"> element excluding it automatically because in other documents it can appear in any position or doesn't appear at all.
Question 2 (additional): Is it OK to use * after following:: and preceding:: in this case?
It initially selects everything up to the end and to the beginning of the whole document. Could it be better to specify the exact end point for the following:: and preceding:: ways? I tried //*[#class="lead"]/following::[#class="other_stuff"] but it doesn’t seem to work.
Question 1: How to exclude the unnecessary element from this intersection?
Adding another predicate, [not(self::div[#class='photo'])] in this case, to your working XPath should do. For this particular case, the entire XPath would look like this (formatted for readability) :
//*[#class="lead"]
/following::*[
count(. | //*[#class="other_stuff"]/preceding::*)
=
count(//*[#class="other_stuff"]/preceding::*)
][not(self::div[#class='photo'])]
/text()
Question 2 (additional): Is it OK to use * after following:: and preceding:: in this case?
I'm not sure if it would be 'better', what I can tell is following::[#class="other_stuff"] is invalid expression. You need to mention the element to which the predicate will be applied, for example, 'any element' following::*[#class="other_stuff"], or just 'div' following::div[#class="other_stuff"].

Xpath / find all elements which contains attribute

I want to find all elements which have an attribute that contains the word: "aut".
For example:
<div aut20="one" class="model"> Some text </div>
<span aut="two" class="model_1" ng-one="two"> Some text 2 </span>
<a class="three"> some text 2 </a>
Then the xpath query result would be <div> and <span> elements because it has "aut20" and "aut".
//#*[contains(local-name(),'aut')]/..

How to get a list of concatenated text nodes

My purpose is to request on a xml structure, using only one XPath evaluation, in order to get a list of strings containing the concatenation of text3 and text5 for each "my_class" div.
The structure example is given below:
<div>
<div>
<div class="my_class">
<div class="my_class_1"></div>
<div class="my_class_2">text2</div>
<div class="my_class_3">
text3
<div class="my_class_4">text4</div>
<div class="my_class_5">text5</div>
</div>
</div>
<div class="my_class_6"></div>
</div>
<div>
<div class="my_class">
<div class="my_class_1"></div>
<div class="my_class_2">text12</div>
<div class="my_class_3">
text13
<div class="my_class_4">text14</div>
<div class="my_class_5">text15</div>
</div>
</div>
</div>
</div>
This means I want to get this list of results:
- in index 0 => text3 text5
- in index 1 => text13 text15
I currently can only get the my_class nodes, but with the text12 that I want to exclude ; or a list of each string, not concatened.
How I could proceed ?
Thanks in advance for helping.
EDIT : I remove text4 and text14 from my search to be exact in my example
EDIT: Now the question has changed...
XPath 1.0: There is no such thing as "list of strings" data type. You can use this expression to select all the container elements of the text nodes you want:
/div/div/div[#class='my_class']/div[#class='my_class_3']
And then get with the proper DOM method of your host language the string value of every of those selected elements (the concatenation of all descendant text nodes) the descendat text nodes you want and concatenate their string value with the proper relative XPath or DOM method:
text()[1]|div[#class='my_class_5']
XPath 2.0: There is a sequence data type.
/div/div/div[#class='my_class']
/div[#class='my_class_3']
/concat(text()[1],div[#class='my_class_5'])
Could you not just use:
//my_class/my_class_3
And then get the .innerText from that? There might be a bit of spacing cleanup to do but it should contain all the inside text (including that from the class 4 and 5) but without the tags.
Edit: After clairification
concat(/div/div/div[#class=my_class]/div[#class=my_class_3]/text(), ' ', /div/div/div[#class=my_class]/div[#class=my_class_5]/text())
That might work

Resources