I'm having the following problem, I want to create a Xpath expression that results into 3 matches, where match1=text1text1text, match2=text2text2text2 etc.
I'm able to find each p seperate, but i just can't 'group' them.
What the expression should say is: "after a[#class="random-a-tag"] give me all the text of all the p-tags that exist before each element where div[#id]"
<a class="random-a-tag"></a>
<p>text1</p>
<p>text1</p>
<p>text1</p>
<div id="1"></div>
<p>text2</p>
<p>text2</p>
<p>text2</p>
<div id="2"></div>
<p>text3</p>
<p>text3</p>
<p>text3</p>
<div id="3"></div>
You can do this in two steps.
find all divs
//a[#class="random-a-tag"]/following-sibling::div[#id]
and for each div #id do:
//a[#class="random-a-tag"]/following-sibling::p[following-sibling::div[1][#id="3"]]/text()
This cannot be done with a single XPath 1.0 expression.
The most you can select is the set of p for a given value of #id:
/*/a/following-sibling::div
[#id=$pId]
/preceding-sibling::p
[count(.
|
/*/a/following-sibling::div
[#id=$pId]
/preceding-sibling::div[1]
/preceding-sibling::p
)
=
count(/*/a/following-sibling::div
[#id=$pId]
/preceding-sibling::div[1]
/preceding-sibling::p
)
+1
]
If $pId is (substituted by) 2, and the above XPath expression is applied on this XML document (your XML fragment, wrapped in a top element to make it well-formed XML document):
<t>
<a class="random-a-tag"></a>
<p>text1</p>
<p>text1</p>
<p>text1</p>
<div id="1"></div>
<p>text2</p>
<p>text2</p>
<p>text2</p>
<div id="2"></div>
<p>text3</p>
<p>text3</p>
<p>text3</p>
<div id="3"></div>
</t>
then this selects the following nodes:
<p>text2</p>
<p>text2</p>
<p>text2</p>
In the above XPath expression we use the wellknown Kayessian (created by #Michael Kay) formula for node-set intersection:
$ns1[count(.|$ns2) = count($ns2)]
is the intersection of the nodesets $ns1 and $ns2.
II. XPath 2.0 solution:
(a/following-sibling::div
[#id=$pId]
/preceding-sibling::p
except
a/following-sibling::div
[#id=$pId]
/preceding-sibling::div[1]
/preceding-sibling::p
)/string()
When this XPath 2.0 expression is evaluated against the same XML document (above) and $pId is 2, the result is exactly the wanted text:
text2 text2 text2
Related
I have a html like:
...
<div class="grid">
"abc"
<span class="searchMatch">def</span>
</div>
<div class="grid">
<span class="searchMatch">def</span>
</div>
...
I want to get the div which not contains text,but xpath
//div[#class='grid' and text()='']
seems doesn't work,and if I don't know the text that other divs have,how can I find the node?
Let's suppose I have inferred the requirement correctly as:
Find all <div> elements with #class='grid' that have no directly-contained non-whitespace text content, i.e. no non-whitespace text content unless it's within a child element like a <span>.
Then the answer to this is
//div[#class='grid' and not(text()[normalize-space(.)])]
You need a not() statement + normalize-space() :
//div[#class='grid' and not(normalize-space(text()))]
or
//div[#class='grid' and normalize-space(text())='']
There’s a document structured as follows:
<div class="document">
<div class="title">
<AAA/>
</div class="title">
<div class="lead">
<BBB/>
</div class="lead">
<div class="photo">
<CCC/>
</div class="photo">
<div class="text">
<!-- tags in text sections can vary. they can be `div` or `p` or anything. -->
<DDD>
<EEE/>
<DDD/>
<CCC/>
<FFF/>
<FFF>
<GGG/>
</FFF>
</DDD>
</div class="text">
<div class="more_text">
<DDD>
<EEE/>
<DDD/>
<CCC/>
<FFF/>
<FFF>
<GGG/>
</FFF>
</DDD>
</div class="more_text">
<div class="other_stuff">
<DDD/>
</div class="other_stuff">
</div class="document">
The task is to grab all the elements between <div class="lead"> and <div class="other_stuff"> except the <div class="photo"> element.
The Kayessian method for node-set intersection $ns1[count(.|$ns2) = count($ns2)] works perfectly. After substituting $ns1 with //*[#class="lead"]/following::* and $ns2 with //*[#class="other_stuff"]/preceding::*,
the working code looks like this:
//*[#class="lead"]/following::*[count(. | //*[#class="other_stuff"]/preceding::*)
= count(//*[#class="other_stuff"]/preceding::*)]/text()
It selects everything between <div class="lead"> and <div class="other_stuff"> including the <div class="photo"> element. I tried several ways to insert not() selector in the formula itself
//*[#class="lead" and not(#class="photo ")]/following::*
//*[#class="lead"]/following::*[not(#class="photo ")]
//*[#class="lead"]/following::*[not(self::class="photo ")]
(the same things with /preceding::* part) but they don't work. It looks like this not() method is ignored – the <div class="photo"> element remains in the selection.
Question 1: How to exclude the unnecessary element from this intersection?
It’s not an option to select from <div class="photo"> element excluding it automatically because in other documents it can appear in any position or doesn't appear at all.
Question 2 (additional): Is it OK to use * after following:: and preceding:: in this case?
It initially selects everything up to the end and to the beginning of the whole document. Could it be better to specify the exact end point for the following:: and preceding:: ways? I tried //*[#class="lead"]/following::[#class="other_stuff"] but it doesn’t seem to work.
Question 1: How to exclude the unnecessary element from this intersection?
Adding another predicate, [not(self::div[#class='photo'])] in this case, to your working XPath should do. For this particular case, the entire XPath would look like this (formatted for readability) :
//*[#class="lead"]
/following::*[
count(. | //*[#class="other_stuff"]/preceding::*)
=
count(//*[#class="other_stuff"]/preceding::*)
][not(self::div[#class='photo'])]
/text()
Question 2 (additional): Is it OK to use * after following:: and preceding:: in this case?
I'm not sure if it would be 'better', what I can tell is following::[#class="other_stuff"] is invalid expression. You need to mention the element to which the predicate will be applied, for example, 'any element' following::*[#class="other_stuff"], or just 'div' following::div[#class="other_stuff"].
I have XML data that looks like this
<priceData>
<div class='price'>
<div class='price-old'>20.00</div>
<div class='price-new'>10.00</div>
<div class='price-tax'>8.00</div>
</div>
<div class='price'>
40.00 <div class='price-tax'>25.00</div>
</div>
</priceData>
I want to use Xpath to extract data for "price-new" from the first price div, and value 40.00 from the second price div. This must be done using single expression.
I tried expressions like
//div[contains(#class, 'price') and not(contains(#class, 'tax')) and not(contains(#class, '-old'))]
and
//div[contains(#class, 'price') and not(contains(#class, 'tax')) and not(descendant::div[contains(#class, '-old') and not(contains(#class, '-tax'))]) and not(contains(#class, '-old'))]
and some others but I can't get it to work how it is supposed to.
I always end up with fetching extra nodes from the first case and I only need the single node (price-new or price if there are no more nodes in it).
You can try using xpath union (|) to combine 2 queries into one. Given markup in the question as XML input, the following xpath (formatted for readability) :
//div[#class='price']/div[#class='price-new']/text()
|
//div[#class='price']/text()[normalize-space()]
returned 'expected' result in xpath tester :
Text='10.00'
Text='40.00'
I want to find the first occurrence of a tree. Example:
<div id='post>
<p>text1</p>
<p>text2</p>
<img src="a.jpg">
<img src="b.jpg">
<p>text3</p>
<p>text4</p>
<img src="c.jpg">
<p>text5</p>
</div>
I want to find the first occurrence of "p/img/#src".
When i do xpath search: .//div/p/img[1]/#src
it gives 2 hits, a.jpg and c.jpg
What is the xpath for only the first occurrence (a.jpg).
I would say .//div/(p/img)[1]/#src but is of course not working.
The best option would be:
(//img[#src])[1]/#src
or
(//p//img[#src])[1]/#src
ensuring img itself within a p element.
As Martin says img is not a child of p. Moreover in your example are missing single quote closing of id attribute inside div and tag closing of img.
Here your xml corrected:
<div id='post'>
<p>text1</p>
<p>text2</p>
<img src="a.jpg"/>
<img src="b.jpg"/>
<p>text3</p>
<p>text4</p>
<img src="c.jpg"/>
<p>text5</p>
</div>
Now to select the first image you can use simply //img[1]/#src or //img[#src="a.jpg"]
My purpose is to request on a xml structure, using only one XPath evaluation, in order to get a list of strings containing the concatenation of text3 and text5 for each "my_class" div.
The structure example is given below:
<div>
<div>
<div class="my_class">
<div class="my_class_1"></div>
<div class="my_class_2">text2</div>
<div class="my_class_3">
text3
<div class="my_class_4">text4</div>
<div class="my_class_5">text5</div>
</div>
</div>
<div class="my_class_6"></div>
</div>
<div>
<div class="my_class">
<div class="my_class_1"></div>
<div class="my_class_2">text12</div>
<div class="my_class_3">
text13
<div class="my_class_4">text14</div>
<div class="my_class_5">text15</div>
</div>
</div>
</div>
</div>
This means I want to get this list of results:
- in index 0 => text3 text5
- in index 1 => text13 text15
I currently can only get the my_class nodes, but with the text12 that I want to exclude ; or a list of each string, not concatened.
How I could proceed ?
Thanks in advance for helping.
EDIT : I remove text4 and text14 from my search to be exact in my example
EDIT: Now the question has changed...
XPath 1.0: There is no such thing as "list of strings" data type. You can use this expression to select all the container elements of the text nodes you want:
/div/div/div[#class='my_class']/div[#class='my_class_3']
And then get with the proper DOM method of your host language the string value of every of those selected elements (the concatenation of all descendant text nodes) the descendat text nodes you want and concatenate their string value with the proper relative XPath or DOM method:
text()[1]|div[#class='my_class_5']
XPath 2.0: There is a sequence data type.
/div/div/div[#class='my_class']
/div[#class='my_class_3']
/concat(text()[1],div[#class='my_class_5'])
Could you not just use:
//my_class/my_class_3
And then get the .innerText from that? There might be a bit of spacing cleanup to do but it should contain all the inside text (including that from the class 4 and 5) but without the tags.
Edit: After clairification
concat(/div/div/div[#class=my_class]/div[#class=my_class_3]/text(), ' ', /div/div/div[#class=my_class]/div[#class=my_class_5]/text())
That might work