Parsing HTML with Nokogiri in Ruby - ruby

With this HTML code:
<div class="one">
.....
</div>
<div class="one">
.....
</div>
<div class="one">
.....
</div>
<div class="one">
.....
</div>
How can I select with Nokogiri the second or third div whose class is one?

You can use Ruby to pare down a large results set to specific items:
page.css('div.one')[1,2] # Two items starting at index 1 (2nd item)
page.css('div.one')[1..2] # Items with indices between 1 and 2, inclusive
Because Ruby indexing starts at zero you must take care with which items you want.
Alternatively, you can use CSS selectors to find the nth item:
# Second and third items from the set, jQuery-style
page.css('div.one:eq(2),div.one:eq(3)')
# Second and third children, CSS3-style
page.css('div.one:nth-child(2),div.one:nth-child(3)')
Or you can use XPath to get back specific matches:
# Second and third children
page.xpath("//div[#class='one'][position()=2 or position()=3]")
# Second and third items in the result set
page.xpath("(//div[#class='one'])[position()=2 or position()=3]")
With both the CSS and XPath alternatives note that:
Numbering starts at 1, not 0
You can use at_css and at_xpath instead to get back the first-such matching element, instead of a NodeSet.
# A NodeSet with a single element in it:
page.css('div.one:eq(2)')
# The second div element
page.at_css('div.one:eq(2)')
Finally, note that if you are selecting a single element by index with XPath, you can use a shorter format:
# First div.one seen that is the second child of its parent
page.at_xpath('//div[#class="one"][2]')
# Second div.one in the entire document
page.at_xpath('(//div[#class="one"])[2]')

page.css('div.one')[1] # For the second
page.css('div.one')[2] # For the third

Related

XPATH - grab content of div after named element

There are a number of labels, I want to specify them in xpath and then grab the text after them, example:
<div class="info-row">
<div class="info-label"><span>Variant:</span></div>
<div class="info-content">
<p>750 ml</p>
</div>
</div>
So in this case, I want to say "after the span named 'Variant' grab the p tag:
Result: 750ml
I tried:
//span[text()='Variant:']/following-sibling::p
and variations of this but to no avail.
'following-sibling' function selects all siblings after the current node,
there no siblings for span with text 'Variant:', and correct to search siblings for span parent.
Here is an example which will work
//span[text()='Variant:']/ancestor::div[#class="info-label"]/following-sibling::div/p

How to find direct children which contain nodes with specified text with xpath?

I need to extract all children which have nodes with some text. Html structure might be the following:
<div>
<div>
A
</div>
<p>
<b>A</b>
</p>
<span>
B
</span>
</div>
I need to extract child nodes which have "A" text. It should return div and p nodes
I tried the following xpaths:
./*/*[contains(text(), 'A')]
./*/*[./*[contains(text(), 'A')]]
but the first one returns only div with "A" text and the second one returns only p with "A" text
Is it possible to construct xpath which will return both children?
Node containing "A" text might be at any level in the child node
If you need XPath that returns both child nodes, try to use
./*/*[contains(., "A")]
I suspect contains() is wrong here, unless you really want to select a node whose value is "HAT" as well as one whose value is "A".
Try
*/*[normalize-space(.)='A']

How can I select nodes that don't contain links but which do contain specific text using xpath

Given the following HTML:
$content =
'<html>
<body>
<div>
<p>During the interim there shall be nourishment supplied</p>
</div>
<div>
<p>During the interim there shall be interim nourishment supplied</p>
</div>
<div>
<ul><li>During the interim there shall be nourishment supplied</li></ul>
</div>
</body>
</html>';
I want all the nodes containing the word "interim" but not if the word "interim" is part of a link element.
The nodes I would expect back are the first P node and the LI node only.
I've tried the following:
'//*/text()[not(a) and contains(.,"interim")]'
... but this still returns the A and also returns part of it's parent P node (the part after the A), neither of which are desired. You can see my attempt here: https://glot.io/snippets/ehp7hmmglm
If you use the XPath expression //*[not(self::a) and not(a) and text()[contains(.,"interim")]] then you get all elements that do not contain an a element, are not a elements and contain a text node child containing that word.

Xpath get node that only contains nodes of a certain type

Take this (id attributes only added so I can refer to them below)
<div id="one">
<figure>foo</figure>
<figure>bar</figure>
</div>
<div id="two">
<figure>foo</figure>
<div>bar</div>
</div>
<div id="three">
<div>bar</div>
</div>
How can I select all div elements whose children are all figure elements, i.e. selecting div one only in the given example?
I sort of need //div[count(not figure)>0].
This is one possible way :
//div[not(*[name() != 'figure']) and not(text()[normalize-space()])]
The left-side of and make sure the div doesn't have child element named other than 'figure', and the right-side make sure it doesn't have non-empty child text node.
or, the same approach but using count() :
//div[count(*[name() != 'figure']|text()[normalize-space()]) = 0]
I did it like this:
//div[figure][count(figure) = count(*)]
This finds divs that must contain at least one figure, and then it checks that the count of figure elements matches the count of all other elements; if this is true then it cannot contain anything else.

XPath selector by class AND index

I have the following HTML:
<div>
<p>foo</p>
<p class='foo'>foo</p>
<p class='foo'>foo</p>
<p>bar</p>
</div>
How can i select second P tag with class 'foo' by XPath?
The following expression should do it:
//p[#class="foo"][2]
Edit: The use of [2] here selects elements according to their position among their siblings, rather than from among the matched nodes. Since both your tables are the first children of their parent elements, [1] will match both of them, while [2] will match neither. If you want the second such element in the entire document, you need to put the expression in brackets so that [2] applies to the nodeset:
(//p[#class="foo"])[2]
(//table[#class="info"])[2]

Resources