XPath selector by class AND index - xpath

I have the following HTML:
<div>
<p>foo</p>
<p class='foo'>foo</p>
<p class='foo'>foo</p>
<p>bar</p>
</div>
How can i select second P tag with class 'foo' by XPath?

The following expression should do it:
//p[#class="foo"][2]
Edit: The use of [2] here selects elements according to their position among their siblings, rather than from among the matched nodes. Since both your tables are the first children of their parent elements, [1] will match both of them, while [2] will match neither. If you want the second such element in the entire document, you need to put the expression in brackets so that [2] applies to the nodeset:
(//p[#class="foo"])[2]
(//table[#class="info"])[2]

Related

How to find direct children which contain nodes with specified text with xpath?

I need to extract all children which have nodes with some text. Html structure might be the following:
<div>
<div>
A
</div>
<p>
<b>A</b>
</p>
<span>
B
</span>
</div>
I need to extract child nodes which have "A" text. It should return div and p nodes
I tried the following xpaths:
./*/*[contains(text(), 'A')]
./*/*[./*[contains(text(), 'A')]]
but the first one returns only div with "A" text and the second one returns only p with "A" text
Is it possible to construct xpath which will return both children?
Node containing "A" text might be at any level in the child node
If you need XPath that returns both child nodes, try to use
./*/*[contains(., "A")]
I suspect contains() is wrong here, unless you really want to select a node whose value is "HAT" as well as one whose value is "A".
Try
*/*[normalize-space(.)='A']

How can I select nodes that don't contain links but which do contain specific text using xpath

Given the following HTML:
$content =
'<html>
<body>
<div>
<p>During the interim there shall be nourishment supplied</p>
</div>
<div>
<p>During the interim there shall be interim nourishment supplied</p>
</div>
<div>
<ul><li>During the interim there shall be nourishment supplied</li></ul>
</div>
</body>
</html>';
I want all the nodes containing the word "interim" but not if the word "interim" is part of a link element.
The nodes I would expect back are the first P node and the LI node only.
I've tried the following:
'//*/text()[not(a) and contains(.,"interim")]'
... but this still returns the A and also returns part of it's parent P node (the part after the A), neither of which are desired. You can see my attempt here: https://glot.io/snippets/ehp7hmmglm
If you use the XPath expression //*[not(self::a) and not(a) and text()[contains(.,"interim")]] then you get all elements that do not contain an a element, are not a elements and contain a text node child containing that word.

XPath difference between two similar path and other questions

I've to made some exercices but
I don't really understand the difference between two similar path
I've the tree :
<b>
<t></t>
<a>
<n></n>
<p></p>
<p></p>
</a>
<a>
<n></n>
<p></p>
</a>
<a></a>
</b>
And we expect that each final tag contain one text node.
I've to explain the difference between //a//text() and //a/text()
I see that //a//text() return all text nodes and it seems legit,
but why //a/text() return the last "a node" -> text node ?
Another question :
why //p[1] return for each "a node", the first "p" child node ?
-> I've two results
<b>
<t></t>
<a>
<n></n>
**<p></p>**
<p></p>
</a>
<a>
<n></n>
**<p></p>**
</a>
<a></a>
</b>
Why the answer is not the first "p" node for the whole document ?
Thanks for all !
Difference between 1: //a//text() and 2: //a/text()
Let's break it down: //a selects all a elements, no matter where they are in the document. Suppose you have /a, that would select all root a elements.
If the / path expression comes after another element in an XPath expression, it will select elements directly descending the element before that in the XPath expression (ie child elements).
If the // path expression comes after another element in an XPath expression, it will select all elements that are descendant of the previous element, no matter where they are under the previous element.
Applying to your two XPath expressions:
//a//text(): Select all a elements no matter where they are in the document, and for those elements select text() no matter where they are under the a elements selected.
//a/text(): Select all a elements no matter where they are in the document, and for those elements select any direct descendant text().
Why //p[1] returns for each "a node", the first "p" child node?
Suppose you were to write //a/p[1], this would select the first p child element of any a element anywhere in the document. By writing //p[1] you are omitting an explicit parent element, but the predicate still selects the first child element of any parent the p element has.
In this case there are two parent a elements, for which the first p child element is selected.
It would be good to search for a good introduction to XPath on your favorite search engine. I've always found this one from w3schools.com to be a good one.

xpath translate two expressions

I can't figure out two expressions in xpath. Can someone help ?
Here they are
substring-after(substring-before(//ul[#id='biblio']/li[3], ']', '['))
//h2[normalize-space(string())='name']/preceding::h1[1]
Your first expression:
substring-after(substring-before(//ul[#id='biblio']/li[3], ']', '['))
First this may find all ul elements which are at (self) or a descendant of the context of your XPath. These must have an id attribute with the value 'biblio' to me matched, from there it will find the 3rd li child element(s) from the matching ul element(s).
It will then perform the substring functions on the text() of the li element(s) after atmomizing them to a string.
So for example if the text of a matched li element was hello [world]. You would end up with just world as the result. As a more complete example, given the XML input:
<div>
<ul id="biblio">
<li>thing [one]</li>
<li>thing [two]</li>
<li>thing [three]</li>
</ul>
<ul id="biblio">
<li>other [a]</li>
<li>other [b]</li>
<li>other [c]</li>
</ul>
</div>
You would get a sequence of two strings as the result of your XPath expression which would be three and c. Note that the use of <div> in the example input is just a container and could be any element.
Your second expression:
//h2[normalize-space(string())='name']/preceding::h1[1]
First this may find all the h2 elements which are at (self) or a descendant of the context of your XPath. These must have a text() that when atmomised to a string is equal to name. From there you then select the 1st preceding h1.
So for example, given the XML input:
<div>
<h1>title1</h1>
<p>stuff</p>
<h1>title2</h1>
<p>more stuff</p>
<h2>name</h2>
<p>other stuff</p>
</div>
You would get the following XML output as a result of your XPath expression:
<h1>title2</h1>
Hope that helps you understand...

Parsing HTML with Nokogiri in Ruby

With this HTML code:
<div class="one">
.....
</div>
<div class="one">
.....
</div>
<div class="one">
.....
</div>
<div class="one">
.....
</div>
How can I select with Nokogiri the second or third div whose class is one?
You can use Ruby to pare down a large results set to specific items:
page.css('div.one')[1,2] # Two items starting at index 1 (2nd item)
page.css('div.one')[1..2] # Items with indices between 1 and 2, inclusive
Because Ruby indexing starts at zero you must take care with which items you want.
Alternatively, you can use CSS selectors to find the nth item:
# Second and third items from the set, jQuery-style
page.css('div.one:eq(2),div.one:eq(3)')
# Second and third children, CSS3-style
page.css('div.one:nth-child(2),div.one:nth-child(3)')
Or you can use XPath to get back specific matches:
# Second and third children
page.xpath("//div[#class='one'][position()=2 or position()=3]")
# Second and third items in the result set
page.xpath("(//div[#class='one'])[position()=2 or position()=3]")
With both the CSS and XPath alternatives note that:
Numbering starts at 1, not 0
You can use at_css and at_xpath instead to get back the first-such matching element, instead of a NodeSet.
# A NodeSet with a single element in it:
page.css('div.one:eq(2)')
# The second div element
page.at_css('div.one:eq(2)')
Finally, note that if you are selecting a single element by index with XPath, you can use a shorter format:
# First div.one seen that is the second child of its parent
page.at_xpath('//div[#class="one"][2]')
# Second div.one in the entire document
page.at_xpath('(//div[#class="one"])[2]')
page.css('div.one')[1] # For the second
page.css('div.one')[2] # For the third

Resources