Ruby - how to retrieve text after a div with Nokogiri

Ruby - how to retrieve text after a div with Nokogiri - ruby

I am trying to retrieve the date and time info from the code below (Target Code). I can pull the class name but not date and time.
class = events.at_css('div.classTitle b').text
date = events.at_css('.classTitle') ["eventTime"]
time = events.at_css('.classTime span')
p class
p date
p time
I get the class name but nil for date and time
Target code
<div class="classTitle"><b>Astronomy 101</b></div>
<div class="classTime">
Friday, May 3, 2019<span class="smalltype"> at</span> 7:00PM</div>
<br>

You want the Node#content method:
This is index.html:
<div class="classTitle"><b>Astronomy 101</b></div>
<div class="classTime">
Friday, May 3, 2019<span class="smalltype"> at</span> 7:00PM
</div>
<br>
This is test.rb:
require 'nokogiri'
events = Nokogiri::HTML(open('index.html'))
date, time = events.at_css('div.classTime').content.strip.split('at')
puts date #=> Friday, May 3, 2019
puts time #=> 7:00PM

Related

Using Nokogiri to scrape itemprop data

I have a div which looks like the following and I am trying to scrape the itemprop datetime data but I can't seem to get it to work.
<time itemprop="startDate" datetime="2019-03-28T19:00:00">
Thursday, March 28, 2019
</time>
The script below pulls the text for the date just fine (i.e., . Thursday, March 28, 2019), but the time selector throws this error.
undefined method `text' for nil:NilClass (NoMethodError)
I've searched Stackoverflow, and I've tried to map the time data but nothing works.
require 'rubygems'
require 'nokogiri'
require 'open-uri'
my_local_filename = "C:/data-hold-classes/Santa Fe College" + ".html"
data = Nokogiri::HTML(open(my_local_filename), "r")
classes = data.css(".col-xs-7")
classes.each do |item|
class = item.at_css("a b").text.strip #=> All details
date = item.at_css("a > div > time").text.strip #==> Thursday, March 28, 2019
#time = item.at_css("a datetime").text.strip #==>
puts class
puts date
#puts time
puts " "
end
My goal is to pull the datetime portion of the div so I can format it as time (e.g., 8:00PM)

The line item.at_css("a > div > time") returns an element time.
a > div > time is a nested path to get that element. Now, you wanna get time, an attribute, not html element, so path a datetime will not return anything (cause we have no datetime element).
You can get date by using:
item.at_css("a > div > time")["datetime"].strip
Hope it helps :D

Watir scraping sequential elements : so simple, but no

This is so simple...
I want to scrap some web page like that with watir (gem of ruby:)
<div class="Time">time1</div>
<div class="Locus">locus1</div>
<div class="Locus">locus2</div>
<div class="Time">time2</div>
<div class="Locus">locus3</div>
<div class="Time">time3</div>
<div class="Locus">locus4</div>
<div class="Locus">locus5</div>
<div class="Locus">locus6</div>
<div class="Time">time4</div>
etc..
The result should be an array like that :
time1 locus1
time1 locus2
time2 locus3
time3 locus4
time3 locus5
time3 locus6
time4 xxx
All the divs are at the same level (not imbricated).
No way to find the solution using the watir methods...
Thx for your help

For each Locus element, you can retrieve the preceding Time element via the #preceding_sibling method:
result = browser.divs(class: 'Locus').map do |div|
time = div.preceding_sibling(class: 'Time').text
locus = div.text
"#{time} #{locus}"
end
p result
#=> ["time1 locus1", "time1 locus2", "time2 locus3", "time3 locus4", "time3 locus5", "time3 locus6"]
Note that if the list is long, you may want to retrieve the HTML via Watir but then do the parsing in Nokogiri. This would save a lot of execution time, but at the cost of readability.
doc = Nokogiri::HTML.parse(browser.html) # where `browser` is the usual Watir::Browser
result = doc.css('.Locus').map do |div|
time = div.at('./preceding-sibling::div[#class="Time"]').text
locus = div.text
"#{time} #{locus}"
end
p result
#=> ["time1 locus1", "time1 locus2", "time1 locus3", "time1 locus4", "time1 locus5", "time1 locus6"]

Need clarification with 'each-do' block in my ruby code

Given an html file:
<div>
<div class="NormalMid">
<span class="style-span">
"Data 1:"
1
2
</span>
</div>
...more divs
<div class="NormalMid">
<span class="style-span">
"Data 20:"
20
21
22
23
</span>
</div>
...more divs
</div
Using these SO posts as reference:
How do I integrate these two conditions block codes to mine in Ruby?
and
How to understand this Arrays and loops in Ruby?
My code:
require 'nokogiri'
require 'pp'
require 'open-uri'
data_file = 'site.htm'
file = File.open(data_file, 'r')
html = open(file)
page = Nokogiri::HTML(html)
page.encoding = 'utf-8'
rows = page.xpath('//div[#class="NormalMid"]')
details = rows.collect do |row|
detail = {}
[
[row.children.first.element_children,row.children.first.element_children],
].each do |part, link|
data = row.children[0].children[0].to_s.strip
links = link.collect {|item| item.at_xpath('#href').to_s.strip}
detail[data.to_sym] = links
end
detail
end
details.reject! {|d| d.empty?}
pp details
The output:
[{:"Data 1:"=>
["http://www.site.com/data/1",
"http://www.site.com/data/2"]},
...
{:"Data 20 :"=>
["http://www.site.com/data/20",
"http://www.site.com/data/21",
"http://www.site.com/data/22",
"http://www.site.com/data/20",]},
...
}]
Everything is going good, exactly what I wanted.
BUT if you change these lines of code:
detail = {}
[
[row.children.first.element_children,row.children.first.element_children],
].each do |part, link|
to:
detail = {}
[
[row.children.first.element_children],
].each do |link|
I get the output of
[{:"Data 1:"=>
["http://www.site.com/data/1"]},
...
{:"Data 20 :"=>
["http://www.site.com/data/20"]},
...
}]
Only the first anchor href is stored in the array.
I just need some clarification on why its behaving that way because the argument part in the argument list is not being used, I figure I didn't need it there. But my program doesn't work correctly if I delete the corresponding row.children.first.element_children as well.
What is going on in the [[obj,obj],].each do block? I just started ruby a week ago, and I'm still getting used to the syntax, any help will be appreciated. Thank You :D
EDIT
rows[0].children.first.element_children[0] will have the output
Nokogiri::XML::Element:0xcea69c name="a" attributes=[#<Nokogiri::XML::Attr:0xcea648
name="href" value="http://www.site.com/data/1">] children[<Nokogiri::XML::Text:0xcea1a4
"1">]>
puts rows[0].children.first.element_children[0]
1

You made your code overly complicated. Looking at your code,it seems you are trying to get something like below:
require 'nokogiri'
doc = Nokogiri::HTML::Document.parse <<-eotl
<div>
<div class="NormalMid">
<span class="style-span">
"Data 1:"
1
2
</span>
</div>
<div class="NormalMid">
<span class="style-span">
"Data 20:"
20
21
22
23
</span>
</div>
</div
eotl
rows = doc.xpath("//div[#class='NormalMid']/span[#class='style-span']")
val = rows.map do |row|
[row.at_xpath("./text()").to_s.tr('"','').strip,row.xpath(".//#href").map(&:to_s)]
end
Hash[val]
# => {"Data 1:"=>["http://site.com/data/1", "http://site.com/data/2"],
# "Data 20:"=>
# ["http://site.com/data/20",
# "http://site.com/data/21",
# "http://site.com/data/22",
# "http://site.com/data/23"]}
What is going on in the [[obj,obj],].each do block?
Look the below 2 parts:
[[1],[4,5]].each do |a|
p a
end
# >> [1]
# >> [4, 5]
[[1,2],[4,5]].each do |a,b|
p a, b
end
# >> 1
# >> 2
# >> 4
# >> 5

capybara - Find with xPath is leaving the within scope

I am trying to build a date selector with Capybara using the default Rails date, time, and datetime fields. I am using the within method to find the select boxes for the field but when I use xPath to find the correct box it leaves the within scope and find the first occurrence on the page of the element.
Here is the code I am using. The page I am testing on has 2 datetime fields but I can only get it to change the first because of this error. At the moment I have an div container with id that wraps up the datetime field but I do plan on switching the code to find by the label.
module Marketron
module DateTime
def select_date(field, options = {})
date_parse = Date.parse(options[:with])
year = date_parse.year.to_s
month = date_parse.strftime('%B')
day = date_parse.day.to_s
within("div##{field}") do
find(:xpath, "//select[contains(#id, \"_#{FIELDS[:year]}\")]").select(year)
find(:xpath, "//select[contains(#id, \"_#{FIELDS[:month]}\")]").select(month)
find(:xpath, "//select[contains(#id, \"_#{FIELDS[:day]}\")]").select(day)
end
end
def select_time(field, options = {})
require "time"
time_parse = Time.parse(options[:with])
hour = time_parse.hour.to_s.rjust(2, '0')
minute = time_parse.min.to_s.rjust(2, '0')
within("div##{field}") do
find(:xpath, "//select[contains(#id, \"_#{FIELDS[:hour]}\")]").find(:xpath, "option[contains(#value, '#{hour}')]").select_option
find(:xpath, "//select[contains(#id, \"_#{FIELDS[:minute]}\")]").find(:xpath, "option[contains(#value, '#{minute}')]").select_option
end
end
def select_datetime(field, options = {})
select_date(field, options)
select_time(field, options)
end
private
FIELDS = {year: "1i", month: "2i", day: "3i", hour: "4i", minute: "5i"}
end
end
World(Marketron::DateTime)

You should specify in the xpath that you want to start with the current node by adding a . to the start:
find(:xpath, ".//select[contains(#id, \"_#{FIELDS[:year]}\")]")
Example:
I tested an HTML page of this (hopefully not over simplifying your page):
<html>
<div id='div1'>
<span class='container'>
<span id='field_01'>field 1</span>
</span>
</div>
<div id='div2'>
<span class='container'>
<span id='field_02'>field 2</span>
</span>
</div>
</html>
Using the within methods, you can see your problem when you do this:
within("div#div1"){ puts find(:xpath, "//span[contains(#id, \"field\")]").text }
#=> field 1
within("div#div2"){ puts find(:xpath, "//span[contains(#id, \"field\")]").text }
#=> field 1
But you can see that but specifying the xpath to look within the current node (ie using .), you get the results you want:
within("div#div1"){ puts find(:xpath, ".//span[contains(#id, \"field\")]").text }
#=> field 1
within("div#div2"){ puts find(:xpath, ".//span[contains(#id, \"field\")]").text }
#=> field 2

Why does Date.new not call initialize?

I want to create a subclass of Date.
A normal, healthy, young rubyist, unscarred by the idiosyncrasy of Date's implementation would go about this in the following manner:
require 'date'
class MyDate < Date
def initialize(year, month, day)
#original_month = month
#original_day = day
# Christmas comes early!
super(year, 12, 25)
end
end
And proceed to use it in the most expected manner...
require 'my_date'
mdt = MyDate.new(2012, 1, 28)
puts mdt.to_s
... only to be double-crossed by the fact, that the Date::new method is actually an alias to Date::civil, which doesn't ever call initialize. In this case, the last piece of code prints "2012-01-28" instead of the expected "2012-12-25".
Dear Ruby-community, wtf is this?
Is there some very good reason for aliasing new, so that it ignores initialize, and as a result, any common sense and regard for the client's programmer's mental health?

You define initialize, but you create the new instance with new. new returns a new instance of the class, not the result of initialize.
You may do:
require 'date'
class MyDate < Date
def self.new(year, month, day)
#original_month = month
#original_day = day
# Christmas comes early!
super(year, 12, 25)
end
end
mdt = MyDate.new(2012, 1, 28)
puts mdt.to_s
Remark:
#original_month and #original_day are not available in this solution. The following solution extends Date, so you can access the original month and day. For normal dates, the values will be nil.
require 'date'
class Date
attr_accessor :original_month
attr_accessor :original_day
end
class MyDate < Date
def self.new(year, month, day)
# Christmas comes early!
date = super(year, 12, 25)
date.original_month = month
date.original_day = day
date
end
end
mdt = MyDate.new(2012, 1, 28)
puts mdt.to_s
puts mdt.original_month
But I would recommend:
require 'date'
class MyDate < Date
def self.create(year, month, day)
#original_month = month
#original_day = day
# Christmas comes early!
new(year, 12, 25)
end
end
mdt = MyDate.create(2012, 1, 28)
puts mdt.to_s
or
require 'date'
class Date
def this_year_christmas
# Christmas comes early!
self.class.new(year, 12, 28)
end
end
mdt = Date.new(2012, 1, 28).this_year_christmas
puts mdt.to_s

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Ruby - how to retrieve text after a div with Nokogiri - ruby

Related

Using Nokogiri to scrape itemprop data

Watir scraping sequential elements : so simple, but no

Need clarification with 'each-do' block in my ruby code

capybara - Find with xPath is leaving the within scope

Why does Date.new not call initialize?

Categories

Resources