Xpath how to extract all texts in a scope? - xpath

<div class="main">
<p>Peter got some troubles.</p>
<p>I gave him my hand.</p>
<p>But Sam didn't.</p>
</div>
How can I extract all texts in the div.main with xpath?
I've tried string(//div[#class="main"]/p), but it only extracted the first line:
Peter got some troubles.
But I hope I can process all lines like:
Peter got some troubles.
I gave him my hand.
But Sam didn't.

The string value of the div element should give you what you want. In other words, take off the /p at the end of your XPath expression. The problem with your expression is that string() takes only the first node in the nodeset.

Related

What's wrong with this xpath statement?

Trying to get the color WHITE out of the line of code.
<a href="javascript:void(0)" class="itemAttr current" title="WHITE" data-
value="WHITE"><img src="https://gloimg.rglcdn.com/rosegal/pdm-product-
pic/Clothing/2019/06/05thumb-img/1559762268621192281.jpg"></a>
I've tried this:
color = driver.find_element_by_xpath("""//p[#id="select-attr-
0"]/a[#href="javascript:void(0)"]#title""").click()
I get this error message:
The string
'//p[#id="select-attr-0"]/a[#href="javascript:void(0)"]#title' is not
a valid XPath expression.
What I want is to get "WHITE".
It looks like you are missing a / before the #title attribute. Try this xpath instead:
//p[#id="select-attr-0"]/a[#href="javascript:void(0)"]/#title
In order to get an attribute value of an element, you need to put '/' before the '#title', so the following should work (provided the parent element p is correctly addressed):
//p[#id="select-attr-0"]/a[#href="javascript:void(0)"]/#title
When working with XPATHs, it is often useful to use one of free online testers to get instant path feedback, e.g. this one
Try using the below xpath snippet.
//p[#id='select-attr- 0']//child::a[#value='WHITE']

Avoid parentheses in path using XPath 1.0

The following XML structure represents a website with many articles. Every article contains, among many other things, date of its creation and possibly arbitrarily many dates of its modification. I want to get the date of the last access (either creation or last modification) to every article using XPath 1.0.
<website>
<article>
<date><strong>22.11.2017</strong></date>
<edits>
<edit><strong>17.12.2017</strong></edit>
</edits>
</article>
<article>
<date><strong>17.4.2016</strong></date>
<edits></edits>
</article>
<article>
<date><strong>3.5.2011</strong></date>
<edits>
<edit><strong>4.5.2011</strong></edit>
<edit><strong>12.8.2012</strong></edit>
</edits>
</article>
<article>
<date><strong>12.2.2009</strong></date>
<edits></edits>
</article>
<article>
<date><strong>23.11.1987</strong></date>
<edits>
<edit><strong>3.4.2001</strong></edit>
<edit><strong>11.5.2006</strong></edit>
<edit><strong>13.9.2012</strong></edit>
</edits>
</article>
</website>
In other words, the expected output is:
<strong>17.12.2017</strong>
<strong>17.4.2016</strong>
<strong>12.8.2012</strong>
<strong>12.2.2009</strong>
<strong>13.9.2012</strong>
So far I've only created this path:
//article/*[self::date or self::edits/edit][last()]
that looks for date and nonempty edits nodes in every article and selects the latter one. But I don't know how to access the latest strong of every such selection and the naive //strong[last()] appended to the end of the path doesn't work.
I found a solution in XPath 2.0. Either of these paths should work, if I'm not mistaken:
//article/(*[self::date or self::edits/edit][last()]//strong)[last()]
//article/(*//strong)[last()]
Such use of parentheses within path is invalid in XPath 1.0 though.
This XPath 1.0 expression
/website/article/descendant::strong[parent::date|parent::edit][last()]
Selects the nodes:
<strong>17.12.2017</strong>
<strong>17.4.2016</strong>
<strong>12.8.2012</strong>
<strong>12.2.2009</strong>
<strong>13.9.2012</strong>
Tested in http://www.xpathtester.com/xpath/56d8f7bc4b9c8c064fdad16f22469026
Do note: position predicates acts over the context list.
Here is the simple xpath to get your output.
//article/descendant-or-self::strong[last()]

XPath (1.0) Match consecutive elements until specific child or end

This is for XPath 1.0.
Here is an example of the mark up that I am matching against. The actual number of elements is not known ahead of time and thus varies, but following this sort of of pattern:
<div class="entry">
<p><iframe /></p>
<p>Text 1</p>
<p>Text 2</p>
<p>Test 3</p>
<p><iframe /></p>
<p>
<a>Test 4</a>
<br />
<a>Test 5</a>
</p>
</div>
I am trying to to match every <p> that does not contain an <iframe>, up until the next <p> that does contain an <iframe> or until the end of the enclosing <div> element.
To make things slightly more complicated, for specific reasons I need to use each <iframe> as the base, a la //div[#class='entry']//iframe, so that each nodeset is based from
(//div[#class='entry']//iframe)[1]
(//div[#class='entry']//iframe)[2]
...
and thus, in this case, matching
<p>Text 1</p>
<p>Text 2</p>
<p>Test 3</p>
and
<p>
<a>Test 4</a>
<br />
<a>Test 5</a>
</p>
respectively.
I tried some of the following for testing to no avail:
(//div[#class='entry']//iframe)/ancestor::p/following-sibling::p[preceding-sibling::p[iframe]]
(or for testing):
(//div[#class='entry']//iframe)[1]/ancestor::p/following-sibling::p[preceding-sibling::p[iframe]]
(//div[#class='entry']//iframe)[2]/ancestor::p/following-sibling::p[preceding-sibling::p[iframe]]
and some variations thereof but what happens for the first set is it gets all <iframe>-less <p> elements all the way to the end instead of stopping at the next <p> that contains a <iframe>.
I've been at this for a while and even though I'm usually quite handy with this sort of thing, I can't quite work my way thorigh this one and none of the search results from Google and such have helped.
Thanks. Any help is always appreciated.
Edit: It can be assumed that there is only one occurrence of <div class="entry"> in the document.
What you are asking for can't be done in one single XPath 1.0 expression without help. The problem is that the question you want to ask is
Starting from an element X (the p-containing-an-iframe), find the other p elements for which that element's nearest preceding p-with-an-iframe is the original node X
If we had a variable $x holding a reference to the top-level context node (the p[iframe] we're starting from) then you could say something like the following (in XPath 2.0)
following-sibling::p[not(iframe)][preceding-sibling::p[iframe][1] is $x]
XPath 1.0 doesn't have an is operator to compare node identity but there are other proxies you can use for this, for example
following-sibling::p[not(iframe)][count(preceding-sibling::p[iframe])
= (count($x/preceding-sibling::p[iframe]) + 1)]
i.e. those following p elements that have one more preceding-sibling::p[iframe] than $x has.
The nub of the problem then is how to get at the outer context node from inside the inner predicate - pure XPath 1.0 has no way to do this. In XSLT you have the current() function, but otherwise you have two basic choices:
If your XPath library allows you to provide variable bindings to your expressions, then inject a variable $x containing the context node and use the expression I've given above.
If you can't inject variables then use two separate XPath queries in sequence.
First execute the expression
count(preceding-sibling::p[iframe]) + 1
with the relevant p[iframe] as context node, and take the result as a number. Or alternatively, if you're already iterating over these p[iframe] elements in your host language then just take the iteration number from there directly, you don't need to count it up using XPath. Either way, you can then build a second expression dynamically:
following-sibling::p[not(iframe)][count(preceding-sibling::p[iframe]) = N]
(where N is the result of the first expression/iteration counter) and evaluate that with the same context node, taking the final result as a node set.
I'm not sure I understood completely, but sometimes it helps to comment on an attempted solution rather than trying to explain.
Please try the following XPath expression:
//div[#class='entry']//iframe//p[not(descendant::iframe)]
And let me know if this yields the correct result.
If not,
explain how the result differs from what you need
please show a more complete HTML sample: a reasonable document with multiple div elements, and more than one where div[#class = 'entry'] - and otherwise covering all the complexity you describe.
explain why you added [1] and [2] to your expressions
give more details about the platform you're using XPath with, perhaps post code

Extracting content between two tags with XPath

I've just started working with XPath recently and run into a problem. Here is the code I want to extract from:
<h3>Some Company</h3>
Mainstreet 1234
<br>
98776, Country
<br>
How would I extract the content between the closing h3 and br tag?
Try //h3/following-sibling::text()[following::br]
This could work h3/following-sibling::node()[not(preceding-sibling::br) and not(self::br)] (returns "Mainstreet 1234" for me).
But I'm affraid your real xml and real needs are more complicated than provided sample so it is possible you will need to further adjust it to fit you requirements.
If your code was in the block below:
<par>
<h3>Some Company</h3>
Mainstreet 1234
<br>
98776, Country
</br>
</par>
You will need to tell XPath to give you the text inside every par node that is after an h3 node and before a br node.
In XPath terms this translates to:
//par/text()[preceding::*[name()='h3'] and following::*[name()='br']]
The above would search everywhere in the document for a par node. You can get more specific about the content of the h3 and/or br nodes as well:
//par/text()[preceding::*[name()='h3' and text()='Some Company'] and following::*[name()='br']]
Please let me know if the above does not resolve your problem.

How to match the brother of a certain XML element in ruby?

I played around with nokogiri in ruby and the XML searching feature, e.g.:
a = Nokogiri.XML(open 'a.xml')
x = a.search('//div[#class="foo"]').text
which works quite nice.
But how can I specify to match the next (brother) element on the same level (and only the next)?
For example for this input:
<div>
<div>...</div>
<div>...</div>
<div class="foo"></div>
<div>EXTRACT ME</dev>
...
</div>
The actual input is some non-XHTML html, but so far Nokogiri.XML does not complain.
Btw, what filter syntax f.search actually expects? xpath?
Taking the hint from Brian Agnew and DevNull I guess that f.search actually expects xpath syntax and using the following-sibling predicate the following expression matches what was asked:
a = x.search('//div[#class="foo"]/following-sibling::div[1]')
I think you want XPath's following-sibling predicate.

Resources