Using Nokogiri to find element before another element - ruby

I have a partial HTML document:
<h2>Destinations</h2>
<div>It is nice <b>anywhere</b> but here.
<ul>
<li>Florida</li>
<li>New York</li>
</ul>
<h2>Shopping List</h2>
<ul>
<li>Booze</li>
<li>Bacon</li>
</ul>
On every <li> item, I want to know the category the item is in, e.g., the text in the <h2> tags.
This code does not work, but this is what I'm trying to do:
#page.search('li').each do |li|
li.previous('h2').text
end

Nokogiri allows you to use xpath expressions to locate an element:
categories = []
doc.xpath("//li").each do |elem|
categories << elem.parent.xpath("preceding-sibling::h2").last.text
end
categories.uniq!
p categories
The first part looks for all "li" elements, then inside, we look for the parent (ul, ol), the for an element before (preceding-sibling) which is an h2. There can be more than one, so we take the last (ie, the one closest to the current position).
We need to call "uniq!" as we get the h2 for each 'li' (as the 'li' is the starting point).
Using your own HTML example, this code output:
["Destinations", "Shopping List"]

You are close.
#page.search('li').each do |li|
category = li.xpath('../preceding-sibling::h2').text
puts "#{li.text}: category #{category}"
end

The code:
categories = []
Nokogiri::HTML("yours HTML here").css("h2").each do |category|
categories << category.text
end
The result:
categories = ["Destinations", "Shopping List"]

Related

Nokogiri only get list items with links first

I have a document that looks like the following:
<ul>
<li>
LinkContent
</li>
</li>
Content Link
</li>
</ul>
I would like to only obtain the list items that start with an <a> tag, i.e. the first <li> would be a hit but the second would not.
I tried getting all list items and regex matching on the html content but it doesn't appear to be working:
list.search('li').each do |item|
if /^<a href="\/Synergies".*$/.match(item)
puts link # hit?
end
end
Any advice would be appreciated!
You can check whether the item's first child is either not text or empty text:
list.search('li').each do |item|
if !item.children.first.text? || item.children.first.text.strip.empty?
puts item # hit?
end
end
If you want to exclude items that don't begin with a link, you can select the first child and check its parents in the condition:
list.search('li > a:first-child').each do |item|
if !item.parent.children.first.text? || item.parent.children.first.text.strip.empty?
puts item # hit?
end
end

How to get text from list items with Mechanize?

<div class="carstd">
<ul>
<li class="cars">"Car 1"</li>
<li class="cars">"Car 2"</li>
<li class="cars">"Car 3"</li>
<li class="cars">"Car 4"</li>
</ul>
</div>
I want strip the text from each list item with mechanize and print it out. I've tried
puts page.at('.cars').text.strip but it only gets the first item. I've also tried
page.links.each do |x|
puts x.at('.cars').text.strip
end
But I get an error undefined method 'at' for #<Mechanize::Page::Link:0x007fe7ea847810>.
There's no links there. Links are a elements that get converted into special Mechanize objects.
You want something like:
page.search('li.cars').text # the text of all the li's mashed together as a string
or
page.search('li.cars').map{|x| x.text} # the text of each `li` as an array of strings

Match and exclude multiple classes with Watir

I would like to be able to match against a class while excluding certain classes as well.
I can use something like follows to get all li elements that match the specified class, but I'm not sure how I can screen out classes at the same time.
b = Watir::Browser.new
free_boxes = b.lis(:class, /cellGridGameStandard/)
I would like to change this into something that will match all li elements with the cellGridGameStandard class, but excludes all elements that also contain either the notEligible class or the ownAlready class.
Here are a couple of options.
Let us assume that the html is like:
<ul>
<li class="cellGridGameStandard">
Element 1
</li>
<li class="cellGridGameStandard ownAlready">
Element 2
</li>
<li class="cellGridGameStandard notEligible">
Element 3
</li>
<li class="cellGridGameStandard">
Element 4
</li>
</ul>
The first and fourth li elements match the specified criteria.
One option would be to check for lis that do not have the ownAlready or notEligible class:
matching = browser.lis(:class => 'cellGridGameStandard')
.find_all { |li|
['ownAlready', 'notEligible'].none? {
|class_name| li.class_name.split.include? class_name
}
}
p matching.collect(&:text)
#=> ["Element 1", "Element 4"]
Another option, which is easier to write but sometimes considered harder to read, is to use a css locator:
matching = browser.elements(:css => 'li.cellGridGameStandard:not(.ownAlready):not(.notEligible)')
p matching.collect(&:text)
#=> ["Element 1", "Element 4"]

looping through a collection of divs in Watir

We're using watir for testing and wondered how to select a group of divs that meet a particular criteria? In our case the (simplified) html looks like this:
<div class="month_main>
<div class="month_cell">
some divs
</div>
<div class="month_cell">
some_other_divs
</div>
<div class = "month_cell OverridenDay">
<div id = "month-2013-05-04"/>
</div>
</div>
We would like to loop through all divs with an id starting with 'month' that are contained in month_cell parent divs that also have classes of OverridenDay. Is there an Xpath or regular expression we can use in conjunction with the Watir browser class to do this?
General
You can get a collection of elements in a similar way to getting a single element. You basically need to pluralize the element type method. For example:
#Singular method returns first matching div
browser.div
#Pluralized method returns all matching divs
browser.divs
Collections can be used using the same locators as single elements.
Solution
For your problem, you can do:
#Iterate over divs that have the class 'month_cell OverridenDay'
browser.divs(:class => 'month_cell OverridenDay').each do |overridden_div|
#Within each div with class 'month_cell OverridenDay',
# iterate over any children divs where the id starts with month
overridden_div.divs(:id => /^month/).each do |div|
#Do something with the div that has id starting with month
puts div.id
end
end
#=> "month-2013-05-0"
If you need to create a single collection that includes all of the matching divs, you will need to use a css or xpath selector.
Using a css-selector (note that in watir-webdriver, only the elements method supports css-locators):
divs = browser.elements(:css => 'div.month_cell.OverridenDay div[id^=month]')
divs.each do |e|
puts e.id
end
#=> "month-2013-05-0"
Using xpath:
divs = browser.divs(:xpath => '//div[#class="month_cell OverridenDay"]//div[starts-with(#id, "month")]')
divs.each do |e|
puts e.id
end
#=> "month-2013-05-0"

Get element after another elements with Hpricot and Ruby

I have the following HTML:
<ul class="filtering_new" width="50%">
<li class="filter">1</li>
<li class="filter">2</li>
<script>Alert('1');</script>
<li class="filter">3</li>
</ul>
How can I get li with inner_html = 3?
I tried like this:
page.search("//ul.filtering_new").each do |list|
puts list.search("li").size
end
where page is the HTML document.
size = 2, but it should be 3.
I tried to do like in manual https://github.com/hpricot/hpricot/wiki/hpricot-challenge
but I cannot even find <script.
list.search("script")
returns nothing.
I don't think you can mixup XPath with CSS Selector when using search. In your example you do. Try:
//ul[#class='filtering_new']
or
ul.filtering_new
inside search.
Most XML/HTML parsing in Ruby uses Nokogiri these days, so I'll recommend that parser. However, both Hpricot and Nokogiri support XPath and CSS, so they are fairly interchangeable.
I'd go about it this way:
html = <<EOT
<ul class="filtering_new" width="50%">
<li class="filter">1</li>
<li class="filter">2</li>
<script>Alert('1');</script>
<li class="filter">3</li>
</ul>
EOT
require 'nokogiri'
doc = Nokogiri::HTML(html)
li = doc.search('//li[#class="filter"]').select{ |n| n.text.to_i == 3 }
li # => [#<Nokogiri::XML::Element:0x8053fc84 name="li" attributes=[#<Nokogiri::XML::Attr:0x8053fb6c name="class" value="filter">] children=[#<Nokogiri::XML::Text:0x80546f98 "3">]>]
That finds the candidate nodes, then returns them as a NodeSet to be iterated over, where they are selected/rejected based on the node's text.
li = doc.search('//li[text() = "3"]')
li # => [#<Nokogiri::XML::Element:0x8053fc84 name="li" attributes=[#<Nokogiri::XML::Attr:0x8053fb6c name="class" value="filter">] children=[#<Nokogiri::XML::Text:0x80546f98 "3">]>]
That offloads more of the comparison to the underlying libXML library, where it runs a lot faster.

Resources