Nokogiri: apply class to element that has a certain descendant - ruby

Let's say I have this html that has various depths of descendants and a mixture of element types:
<div class="foo">
<div class="bar"></div>
</div>
<div class="foo">
<div class="baz"></div>
</div>
<div class="foo">
<u><span class="duh">
<div class="bar"></div>
</span></u>
</div>
<div class="foo">
<div class="baz"></div>
</div>
And I want to apply a class of bex to all the foos that contain classes of bar so it looks like:
<div class="bex">
<div class="bar"></div>
</div>
<div class="foo">
<div class="baz"></div>
</div>
<div class="bex">
<u><span class="duh">
<div class="bar"></div>
</span></u>
</div>
<div class="foo">
<div class="baz"></div>
</div>
How wld I do that with ruby/nokogiri? Tried all sorts of things and can't quite get it. Thanks.
Edit: closed the duh, oops.

I spent a long time wondering why the second foo wasn't found.
Your data is broken, "duh isn't closed.
To select the nodes, you can use :
doc.xpath("//div[#class='foo' and .//div[#class='bar']]")
As an example :
data = %q(<div class="foo">
<div class="bar"></div>
</div>
<div class="foo">
<div class="baz"></div>
</div>
<div class="foo">
<u><span class="duh">
<div class="bar"></div>
</span></u>
</div>
<div class="foo">
<div class="baz"></div>
</div>)
require 'nokogiri'
doc = Nokogiri.HTML(data)
doc.xpath("//div[#class='foo' and .//div[#class='bar']]").each do |node|
node["class"] = 'bex'
end
puts doc

Related

xpath: How to combine multiple conditions on different axes

I try to extract all links based on these three conditions:
Must be part of <div data-test="cond1">
Must have a <a href="..." class="cond2">
Must not have a <img src="..." class="cond3">
The result should be "/product/1234".
<div data-test="test1">
<div>
<div data-test="cond1">
Link 1
<div class="test4">
<div class="test5">
<div class="test6">
<div class="test7">
<div class="test8">
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<div data-test="test2">
<div>
<div data-test="cond1">
Link 2
<div class="test4">
<div class="test5">
<div class="test6">
<div class="test7">
<div class="test8">
<img src="bild.jpg" class="cond3">
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
I'm able to extract the links with the following xpath query.
//div[starts-with(#data-test,"cond")]/a[starts-with(#class,"cond")]/#href
(I know the first part is not really neccessary. But better safe than sorry.)
But I'm still struggling with excluding the links containing an descendant img tag and how to add it to the query above.
This should do what you want:
//div[#data-test="cond1" and not(.//img[#class="cond3"])]
/a[#class="cond2"]
/#href
/product/1234

How to get last child while parsing HTML with Goutte in Laravel

<div class="table">
<div class="table-head">
<div class="table-head-title">Ranking Equipos</div>
</div>
<div class="table-body">
<div class="table-body-row active">
<div class="col-key">Mark</div>
<div class="col-value">9233</div>
</div>
<div class="table-body-row">
<div class="col-key">Amanda</div>
<div class="col-value">7216</div>
</div>
<div class="table-body-row">
<div class="col-key">Mark</div>
<div class="col-value">6825</div>
</div>
<div class="table-body-row">
<div class="col-key">Paul</div>
<div class="col-value">6184</div>
</div>
<div class="table-body-row">
<div class="col-key">Amanda</div>
<div class="col-value">5866</div>
</div>
</div>
</div>
This is my HTML and I want to get last child of .table-body.
I tried to use JavaScript like logic and used indexing like this
$lastChild = $node->filter('.table-body .table-body-row')[4]; but it shows error. Cannot use object of type "Symfony\Component\DomCrawler\Crawler" as array
I was stuck in similar situation recently and I resolve this by using last() method. Syntax is here: $node->filter('.table-body .table-body-row')->last();

How to select text node without preceding text in XPath?

<div class="a">
<div class="a random number of div wrapers">
<div>Random1<em>Median</em>
<div class="b">
<div class="c">Edit</div>
</div>
</div>
<div>Random2<em>Median</em></div>
<div>
<em>Median</em>
</div>
<div>Random3<em>Median</em></div>
<div>Random4<em>Median</em>
<div>Random4<em>Median</em></div>
</div>
</div>
<div class="a">
<div class="a random number of div wrapers">
<div>Random1<em>Median</em></div>
<div>Random2<em>Median</em></div>
<div>
<em>Median</em>
</div>
<div>Random3<em>Median</em>
<div class="b">
<div class="c">Edit</div>
</div>
</div>
<div>Random4<em>Median</em>
</div>
</div>
In this case, how to get the two nodes contains 'Median' that doesn't have text before it using XPath?
I prefer not using the index because the node position could be random.
Maybe try:
//*[.='Median'][not(preceding-sibling::text()[normalize-space()])]

XPath: how to select elements that are related to other on the same level

The question is simple but I don't have enough practice for this case :)
How to get price text value from every div within "block" if we know that we need only item_promo elements.
<div class="block">
<div class="item_promo">item</div>
<div class="item_price">123</div>
</div>
<div class="block">
<div class="item_promo">item</div>
<div class="item_price">456</div>
</div>
<div class="block">
<div class="item_promo">item</div>
<div class="item_price">789</div>
</div>
<div class="block">
<div class="item">item</div>
<div class="item_price">222</div>
</div>
<div class="block">
<div class="item">item</div>
<div class="item_price">333</div>
</div>
You could use the xpath :
//div[#class='block']/*[#class='item_promo']/following-sibling::div[#class='item_price']/text()
You look for div elements that has attribute class with value item_promo and look at its following sibling which has an attribute item_price and grab the text.
This XPath,
//div[div/#class='item_promo']/div[#class='item_price']
will return those item_price class div elements with sibling item_promo class div elements:
<div class="item_price">123</div>
<div class="item_price">456</div>
<div class="item_price">789</div>
This will work regardless of label/price order.

Xpath keeps selecting all objects of the given class instead of the first

This one has me stumped., I'm trying to select the first class = csb-quantity-listbox object of the below using the XPATH //select[#class='csb-quantity-listbox'][1], but instead of selecting the first quantity listbox it's selecting ALL the listboxes on the page with that class (see image below).
What am I doing wrong?
<div class="gwt-product-detail-products-container">
<div class="gwt-product-detail-products-header-column">
</div>
<div id="gwt-product-detail-widget-id-12766" class="gwt-product-detail-widget">
<div class="gwt-product-detail-widget-image-column ui-draggable" title="12766">
<div class="gwt-product-detail-widget-options-column">
</div>
<div class="gwt-product-detail-widget-price-column">
</div>
<div class="gwt-product-detail-widget-quantity-panel">
<select class="csb-quantity-listbox" name="quantity_12766"></select>
</div>
<div class="gwt-bundle-add-to-cart-btn">
</div>
</div>
</div>
<div id="gwt-product-detail-widget-id-10617" class="gwt-product-detail-widget">
<div class="gwt-product-detail-widget-image-column ui-draggable" title="10617">
<div class="gwt-product-detail-widget-options-column">
</div>
<div class="gwt-product-detail-widget-price-column">
</div>
<div class="gwt-product-detail-widget-quantity-panel">
<select class="csb-quantity-listbox" name="quantity_10617"></select>
</div>
<div class="gwt-bundle-add-to-cart-btn">
</div>
</div>
</div>
</div>
Image:
You just need to put brackets around the statement before the [1]
Like so:
(//select[#class='csb-quantity-listbox'])[1]

Resources