Watir scraping sequential elements : so simple, but no - ruby

This is so simple...
I want to scrap some web page like that with watir (gem of ruby:)
<div class="Time">time1</div>
<div class="Locus">locus1</div>
<div class="Locus">locus2</div>
<div class="Time">time2</div>
<div class="Locus">locus3</div>
<div class="Time">time3</div>
<div class="Locus">locus4</div>
<div class="Locus">locus5</div>
<div class="Locus">locus6</div>
<div class="Time">time4</div>
etc..
The result should be an array like that :
time1 locus1
time1 locus2
time2 locus3
time3 locus4
time3 locus5
time3 locus6
time4 xxx
All the divs are at the same level (not imbricated).
No way to find the solution using the watir methods...
Thx for your help

For each Locus element, you can retrieve the preceding Time element via the #preceding_sibling method:
result = browser.divs(class: 'Locus').map do |div|
time = div.preceding_sibling(class: 'Time').text
locus = div.text
"#{time} #{locus}"
end
p result
#=> ["time1 locus1", "time1 locus2", "time2 locus3", "time3 locus4", "time3 locus5", "time3 locus6"]
Note that if the list is long, you may want to retrieve the HTML via Watir but then do the parsing in Nokogiri. This would save a lot of execution time, but at the cost of readability.
doc = Nokogiri::HTML.parse(browser.html) # where `browser` is the usual Watir::Browser
result = doc.css('.Locus').map do |div|
time = div.at('./preceding-sibling::div[#class="Time"]').text
locus = div.text
"#{time} #{locus}"
end
p result
#=> ["time1 locus1", "time1 locus2", "time1 locus3", "time1 locus4", "time1 locus5", "time1 locus6"]

Related

Watir::ElementCollection click action in loop

Going off the example from the documentation http://www.rubydoc.info/gems/watir-webdriver/0.6.11/Watir/ElementCollection#each-instance_method, I am trying to click each element on the page that has the same class.
This is a code snippet of what I've come up with so far:
#b.divs(:class => 'portal-thumbnail-card').each do |div|
#b.div(:class => 'portal-thumbnail-card').click
puts 'foo'
# my puts statement outputs 'foo' 6 times (matches the number of elements with that class)
# right now this only clicks on the FIRST element, having issues with the other part :(
end
Even though this doesn't involve any page reloading, are click actions possible?
The problem is that you are locating the div to click during each iteration of the loop. In English, your code actually says, "for each div element with the class 'portal-thumbnail-card', click the first div on the page with class 'portal-thumbnail-card'."
What you actually want to do is click the div element that is the subject of each iteration:
#b.divs(:class => 'portal-thumbnail-card').each do |div|
div.click
puts 'foo'
end
The divs method returns a Watir::DivCollection, which is a collection of Watir::Div objects. For example:
require 'watir-webdriver'
b = Watir::Browser.new
b.goto('http://example.org')
divs = b.divs
puts divs.class
#=> Watir::DivCollection
divs.each { |d| puts d.class}
#=> Watir::Div
So--within your iterator--you want to refer to the block-local variable (i.e. div.click) instead of the browser's instance variable (i.e. #b.div(:class => 'portal-thumbnail-card').click)
use flash method for see element what you try click
require 'watir-webdriver'
browser = Watir::Browser.new
browser.goto "data:text/html,#{DATA.read}"
browser.divs(:class => 'portal-thumbnail-card').each do |div|
# browser.div(:class => 'portal-thumbnail-card').flash #you variant
div.flash #correct variant
puts 'foo'
end
browser.close
__END__
<html>
<div class='portal-thumbnail-card'>
<button id="button1">Button 1</button>
</div>
<div class='portal-thumbnail-card'>
<button id="button2">Button 2</button>
</div>
<div class='portal-thumbnail-card'>
<button id="button3">Button 3</button>
</div>
<div class='portal-thumbnail-card'>
<button id="button4">Button 4</button>
</div>
<div class='portal-thumbnail-card'>
<button id="button5">Button 5</button>
</div>
</html>

How to use Nokogiri to split content between successive h2 tags and wrap it under a chapter div

I want to split a document into "chapters". A chapter starts at a h2 and includes all siblings up to but not including the next h2 tag.
I.e. given this
<div id="content">
<h2>First</h2>
<p>one</p>
<h2>Second</h2>
<p>two</p>
<h2>Third</h2>
</div>
I want this
<div id="dad">
<div class="chapter">
<h2>First</h2>
<p>one</p>
</div>
<div class="chapter">
<h2>Second</h2>
<p>two</p>
</div>
<div class="chapter">
<h2>Third</h2>
</div>
</div>
Whilst I've used Nokogiri and xml to do some basic manipulation, I'm banging my heading wondering how to first group the nodes into chapter blocks and then wrap them in place with the chapter div.
Can anyone help?
You should group your nodes by headers (include related subling nodes) and then transform them to output format.
Here is an idea of algorithm to group nodes:
array = [
:header,
:text,
:text,
:header,
:text,
:header,
:text,
:text,
]
groupped_array = array.reduce([]) do |res, item|
res.tap do
res << [] if item == :header
res.last << item
end
end
p groupped_array
Result:
➜ ruby group_nodes.rb
[[:header, :text, :text], [:header, :text], [:header, :text, :text]]
I think you can add nokogiri here without big problems and transform result to your output format.

Need clarification with 'each-do' block in my ruby code

Given an html file:
<div>
<div class="NormalMid">
<span class="style-span">
"Data 1:"
1
2
</span>
</div>
...more divs
<div class="NormalMid">
<span class="style-span">
"Data 20:"
20
21
22
23
</span>
</div>
...more divs
</div
Using these SO posts as reference:
How do I integrate these two conditions block codes to mine in Ruby?
and
How to understand this Arrays and loops in Ruby?
My code:
require 'nokogiri'
require 'pp'
require 'open-uri'
data_file = 'site.htm'
file = File.open(data_file, 'r')
html = open(file)
page = Nokogiri::HTML(html)
page.encoding = 'utf-8'
rows = page.xpath('//div[#class="NormalMid"]')
details = rows.collect do |row|
detail = {}
[
[row.children.first.element_children,row.children.first.element_children],
].each do |part, link|
data = row.children[0].children[0].to_s.strip
links = link.collect {|item| item.at_xpath('#href').to_s.strip}
detail[data.to_sym] = links
end
detail
end
details.reject! {|d| d.empty?}
pp details
The output:
[{:"Data 1:"=>
["http://www.site.com/data/1",
"http://www.site.com/data/2"]},
...
{:"Data 20 :"=>
["http://www.site.com/data/20",
"http://www.site.com/data/21",
"http://www.site.com/data/22",
"http://www.site.com/data/20",]},
...
}]
Everything is going good, exactly what I wanted.
BUT if you change these lines of code:
detail = {}
[
[row.children.first.element_children,row.children.first.element_children],
].each do |part, link|
to:
detail = {}
[
[row.children.first.element_children],
].each do |link|
I get the output of
[{:"Data 1:"=>
["http://www.site.com/data/1"]},
...
{:"Data 20 :"=>
["http://www.site.com/data/20"]},
...
}]
Only the first anchor href is stored in the array.
I just need some clarification on why its behaving that way because the argument part in the argument list is not being used, I figure I didn't need it there. But my program doesn't work correctly if I delete the corresponding row.children.first.element_children as well.
What is going on in the [[obj,obj],].each do block? I just started ruby a week ago, and I'm still getting used to the syntax, any help will be appreciated. Thank You :D
EDIT
rows[0].children.first.element_children[0] will have the output
Nokogiri::XML::Element:0xcea69c name="a" attributes=[#<Nokogiri::XML::Attr:0xcea648
name="href" value="http://www.site.com/data/1">] children[<Nokogiri::XML::Text:0xcea1a4
"1">]>
puts rows[0].children.first.element_children[0]
1
You made your code overly complicated. Looking at your code,it seems you are trying to get something like below:
require 'nokogiri'
doc = Nokogiri::HTML::Document.parse <<-eotl
<div>
<div class="NormalMid">
<span class="style-span">
"Data 1:"
1
2
</span>
</div>
<div class="NormalMid">
<span class="style-span">
"Data 20:"
20
21
22
23
</span>
</div>
</div
eotl
rows = doc.xpath("//div[#class='NormalMid']/span[#class='style-span']")
val = rows.map do |row|
[row.at_xpath("./text()").to_s.tr('"','').strip,row.xpath(".//#href").map(&:to_s)]
end
Hash[val]
# => {"Data 1:"=>["http://site.com/data/1", "http://site.com/data/2"],
# "Data 20:"=>
# ["http://site.com/data/20",
# "http://site.com/data/21",
# "http://site.com/data/22",
# "http://site.com/data/23"]}
What is going on in the [[obj,obj],].each do block?
Look the below 2 parts:
[[1],[4,5]].each do |a|
p a
end
# >> [1]
# >> [4, 5]
[[1,2],[4,5]].each do |a,b|
p a, b
end
# >> 1
# >> 2
# >> 4
# >> 5

capybara - Find with xPath is leaving the within scope

I am trying to build a date selector with Capybara using the default Rails date, time, and datetime fields. I am using the within method to find the select boxes for the field but when I use xPath to find the correct box it leaves the within scope and find the first occurrence on the page of the element.
Here is the code I am using. The page I am testing on has 2 datetime fields but I can only get it to change the first because of this error. At the moment I have an div container with id that wraps up the datetime field but I do plan on switching the code to find by the label.
module Marketron
module DateTime
def select_date(field, options = {})
date_parse = Date.parse(options[:with])
year = date_parse.year.to_s
month = date_parse.strftime('%B')
day = date_parse.day.to_s
within("div##{field}") do
find(:xpath, "//select[contains(#id, \"_#{FIELDS[:year]}\")]").select(year)
find(:xpath, "//select[contains(#id, \"_#{FIELDS[:month]}\")]").select(month)
find(:xpath, "//select[contains(#id, \"_#{FIELDS[:day]}\")]").select(day)
end
end
def select_time(field, options = {})
require "time"
time_parse = Time.parse(options[:with])
hour = time_parse.hour.to_s.rjust(2, '0')
minute = time_parse.min.to_s.rjust(2, '0')
within("div##{field}") do
find(:xpath, "//select[contains(#id, \"_#{FIELDS[:hour]}\")]").find(:xpath, "option[contains(#value, '#{hour}')]").select_option
find(:xpath, "//select[contains(#id, \"_#{FIELDS[:minute]}\")]").find(:xpath, "option[contains(#value, '#{minute}')]").select_option
end
end
def select_datetime(field, options = {})
select_date(field, options)
select_time(field, options)
end
private
FIELDS = {year: "1i", month: "2i", day: "3i", hour: "4i", minute: "5i"}
end
end
World(Marketron::DateTime)
You should specify in the xpath that you want to start with the current node by adding a . to the start:
find(:xpath, ".//select[contains(#id, \"_#{FIELDS[:year]}\")]")
Example:
I tested an HTML page of this (hopefully not over simplifying your page):
<html>
<div id='div1'>
<span class='container'>
<span id='field_01'>field 1</span>
</span>
</div>
<div id='div2'>
<span class='container'>
<span id='field_02'>field 2</span>
</span>
</div>
</html>
Using the within methods, you can see your problem when you do this:
within("div#div1"){ puts find(:xpath, "//span[contains(#id, \"field\")]").text }
#=> field 1
within("div#div2"){ puts find(:xpath, "//span[contains(#id, \"field\")]").text }
#=> field 1
But you can see that but specifying the xpath to look within the current node (ie using .), you get the results you want:
within("div#div1"){ puts find(:xpath, ".//span[contains(#id, \"field\")]").text }
#=> field 1
within("div#div2"){ puts find(:xpath, ".//span[contains(#id, \"field\")]").text }
#=> field 2

Remove all nodes after a specified node [duplicate]

This question already has answers here:
Nokogiri: Select content between element A and B
(3 answers)
Closed 2 years ago.
I'm grabbing a div of text from a url and would like to remove everything underneath a paragraph which has a backtotop class. I'd seen a traverse snippet of code here on stackoverflow which looks promising, but I can't figure out how to get it incorporated so #el only contains everything up to the first p.backtotop in the div.
my code:
#doc = Nokogiri::HTML(open(url))
#el = #doc.css("div")[0]
end
traverse snippet:
doc = Nokogiri::HTML(code)
stop_node = doc.css("p.backtotop")
doc.traverse do |node|
break if node == stop_node
# else, do whatever, e.g. `puts node.name`
end
Find the div you want.
Find the 'stop' item you want, and then find all the following siblings.
Remove them.
For example:
<body>
<div id="a">
<h2>My Section</h2>
<p class="backtotop">Back to Top</p>
<p>More Content</p>
<p>Even More Content</p>
</div>
</body>
require 'nokogiri'
doc = Nokogiri::HTML(my_html)
div = doc.at('#a')
div.at('.backtotop').xpath('following-sibling::*').remove
puts div
#=> <div id="a">
#=> <h2>My Section</h2>
#=> <p class="backtotop">Back to Top</p>
#=>
#=>
#=> </div>
Here's a more complicated example, where the backtotop item may not be at the root of the div:
<body>
<div id="b">
<h2>Another Section</h2>
<section>
<p class="backtotop">Back to Top</p>
<p>More Content</p>
</section>
<p>Even More Content</p>
</div>
</body>
require 'nokogiri'
doc = Nokogiri::HTML(my_html)
div = doc.at('#b')
n = div.at('.backtotop')
until n==div
n.xpath('following-sibling::*').remove
n = n.parent
end
puts div
#=> <div id="b">
#=> <h2>Another Section</h2>
#=> <section><p class="backtotop">Back to Top</p>
#=>
#=> </section>
#=> </div>
If your HTML is more complicated than the above then please provide an actual sample along with the result you want. This is good advice for any future question you ask.

Resources