Given an html file:
<div>
<div class="NormalMid">
<span class="style-span">
"Data 1:"
1
2
</span>
</div>
...more divs
<div class="NormalMid">
<span class="style-span">
"Data 20:"
20
21
22
23
</span>
</div>
...more divs
</div
Using these SO posts as reference:
How do I integrate these two conditions block codes to mine in Ruby?
and
How to understand this Arrays and loops in Ruby?
My code:
require 'nokogiri'
require 'pp'
require 'open-uri'
data_file = 'site.htm'
file = File.open(data_file, 'r')
html = open(file)
page = Nokogiri::HTML(html)
page.encoding = 'utf-8'
rows = page.xpath('//div[#class="NormalMid"]')
details = rows.collect do |row|
detail = {}
[
[row.children.first.element_children,row.children.first.element_children],
].each do |part, link|
data = row.children[0].children[0].to_s.strip
links = link.collect {|item| item.at_xpath('#href').to_s.strip}
detail[data.to_sym] = links
end
detail
end
details.reject! {|d| d.empty?}
pp details
The output:
[{:"Data 1:"=>
["http://www.site.com/data/1",
"http://www.site.com/data/2"]},
...
{:"Data 20 :"=>
["http://www.site.com/data/20",
"http://www.site.com/data/21",
"http://www.site.com/data/22",
"http://www.site.com/data/20",]},
...
}]
Everything is going good, exactly what I wanted.
BUT if you change these lines of code:
detail = {}
[
[row.children.first.element_children,row.children.first.element_children],
].each do |part, link|
to:
detail = {}
[
[row.children.first.element_children],
].each do |link|
I get the output of
[{:"Data 1:"=>
["http://www.site.com/data/1"]},
...
{:"Data 20 :"=>
["http://www.site.com/data/20"]},
...
}]
Only the first anchor href is stored in the array.
I just need some clarification on why its behaving that way because the argument part in the argument list is not being used, I figure I didn't need it there. But my program doesn't work correctly if I delete the corresponding row.children.first.element_children as well.
What is going on in the [[obj,obj],].each do block? I just started ruby a week ago, and I'm still getting used to the syntax, any help will be appreciated. Thank You :D
EDIT
rows[0].children.first.element_children[0] will have the output
Nokogiri::XML::Element:0xcea69c name="a" attributes=[#<Nokogiri::XML::Attr:0xcea648
name="href" value="http://www.site.com/data/1">] children[<Nokogiri::XML::Text:0xcea1a4
"1">]>
puts rows[0].children.first.element_children[0]
1
You made your code overly complicated. Looking at your code,it seems you are trying to get something like below:
require 'nokogiri'
doc = Nokogiri::HTML::Document.parse <<-eotl
<div>
<div class="NormalMid">
<span class="style-span">
"Data 1:"
1
2
</span>
</div>
<div class="NormalMid">
<span class="style-span">
"Data 20:"
20
21
22
23
</span>
</div>
</div
eotl
rows = doc.xpath("//div[#class='NormalMid']/span[#class='style-span']")
val = rows.map do |row|
[row.at_xpath("./text()").to_s.tr('"','').strip,row.xpath(".//#href").map(&:to_s)]
end
Hash[val]
# => {"Data 1:"=>["http://site.com/data/1", "http://site.com/data/2"],
# "Data 20:"=>
# ["http://site.com/data/20",
# "http://site.com/data/21",
# "http://site.com/data/22",
# "http://site.com/data/23"]}
What is going on in the [[obj,obj],].each do block?
Look the below 2 parts:
[[1],[4,5]].each do |a|
p a
end
# >> [1]
# >> [4, 5]
[[1,2],[4,5]].each do |a,b|
p a, b
end
# >> 1
# >> 2
# >> 4
# >> 5
Related
I'm trying to find a way to pull content directly below a header tag and group it into an array based on the header text.
I think I found a solution that is VERY similar to this but it won't work and I'm wondering if that's because the website I'm scraping from does not have the 'li' objects grouped into 'ul' tags.
My code:
require 'Nokogiri'
require 'open-uri'
BASE_URL = "https://www.hornellanimalshelter.org/donate.html"
doc = Nokogiri::HTML(open(BASE_URL))
cats = doc.search('.box-09_cnt h4')
cats_and_items = cats.map{ |cat|
items = cat.next_element.search('li')
{name: cat.text, items: items.map(&:text)}
}
=> [{:name=>"Toys & Enrichment", :items=>[]}, {:name=>"Office
Supplies", :items=>[]}, {:name=>"Cleaning Supplies", :items=>[]},
{:name=>"Food & Treats", :items=>[]}, {:name=>"Kennel Care", :items=>
[]}, {:name=>"& More!", :items=>[]}]
As you can see above - it won't pull any of the li but it seems to work fine with something simple like this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<h4>Alabama</h4>
<ul>
<li>auburn</li>
<li>birmingham</li>
</ul>
<h4>Alaska</h4>
<ul>
<li>anchorage / mat-su</li>
<li>fairbanks</li>
</ul>
EOT
states = doc.search('h4')
states_and_cities = states.map{ |state|
cities = state.next_element.search('li a')
[state.text, cities.map(&:text)]
}.to_h
states_and_cities
# => {"Alabama"=>["auburn", "birmingham"],
# "Alaska"=>["anchorage / mat-su", "fairbanks"]}
Any thoughts? Much appreciated in advance!
Something like this maybe (untested):
data = doc.search('h4').map do |h4|
[h4.text, h4.search('+ ul li').map(&:text)]
end
and then to get a hash:
h = Hash[data]
How do I remove spaces in my code? If I parse this HTML with Nokogiri:
<div class="address-thoroughfare mobile-inline-comma ng-binding">Kühlungsborner Straße
10
</div>
I get the following output:
Kühlungsborner Straße
10
which is not left-justified.
My code is:
address_street = page_detail.xpath('//div[#class="address-thoroughfare mobile-inline-comma ng-binding"]').text
Please try strip:
address_street = page_detail.xpath('//div[#class="address-thoroughfare mobile-inline-comma ng-binding"]').text.strip
Consider this:
require 'nokogiri'
doc = Nokogiri::HTML('<div class="address-thoroughfare mobile-inline-comma ng-binding">Kühlungsborner Straße
10
</div>')
doc.search('div').text
# => "Kühlungsborner Straße\n 10\n "
puts doc.search('div').text
# >> Kühlungsborner Straße
# >> 10
# >>
The given HTML doesn't replicate the problem you're having. It's really important to present valid input that duplicates the problem. Moving on....
Don't use xpath, css or search with text. You usually won't get what you expect:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div>
<span>foo</span>
<span>bar</span>
</div>
</body>
</html>
EOT
doc.search('span').class # => Nokogiri::XML::NodeSet
doc.search('span') # => [#<Nokogiri::XML::Element:0x3fdb6981bcd8 name="span" children=[#<Nokogiri::XML::Text:0x3fdb6981b5d0 "foo">]>, #<Nokogiri::XML::Element:0x3fdb6981aab8 name="span" children=[#<Nokogiri::XML::Text:0x3fdb6981a054 "bar">]>]
doc.search('span').text
# => "foobar"
Note that text returned the concatenated text of all nodes found.
Instead, walk the NodeSet and grab the individual node's text:
doc.search('span').map(&:text)
# => ["foo", "bar"]
I am trying to scrape and make a CSV file from this HTML:
<ul class="object-props">
<li class="object-props-item price">
<strong>CHF 14'800.-</strong>
</li>
<li class="object-props-item milage">31'000 km</li>
<li class="object-props-item date">08.2012</li>
</ul>
I want to extract the price and mileage using:
require 'rubygems'
require 'nokogiri'
require 'CSV'
require 'open-uri'
url= "/tto.htm"
data = Nokogiri::HTML(open(url))
CSV.open('csv.csv', 'wb') do |csv|
csv << %w[ price mileage ]
price=data.css('.price').text
mileage=data.css('.mileage').text
csv << [price, mileage]
end
The result is not really what I'm expecting. Two columns are created, but how can I remove the characters like CHF and KM and why is the data of the mileage not displaying result?
My guess is that the text in the HTML includes units of measure; CHF for Swiss Francs for the price, and km for kilometers for the mileage.
You could add split.first or split.last to get the number without the unit of measure, e.g.:
2.3.0 :007 > 'CHF 100'.split.last
=> "100"
2.3.0 :008 > '99 km'.split.first
=> "99"
Removing/ignoring the unwanted text is not a Nokogiri problem, it's a String processing problem:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
li class="object-props-item price"
<strong>CHF 14'900.-</strong>
<li class="object-props-item milage">61'000 km</li>
EOT
str = doc.at('strong').text # => "CHF 14'900.-"
At this point str contains the text of the <strong> node.
A simple regex will extract, which is the straightforward way to grab the data:
str[/[\d']+/] # => "14'900"
sub could be used to remove the 'CHF ' substring:
str.sub('CHF ', '') # => "14'900.-"
delete could be used to remove the characters C, H, F and :
str.delete('CHF ') # => "14'900.-"
tr could be used to remove everything that is NOT 0..9, ', . or -:
str.tr("^0-9'.-", '') # => "14'900.-"
Modify one of the above if you don't want ', . or -.
why are the data of the mileage not displaying
Because you have a mismatch between the CSS selector and the actual class parameter:
require 'nokogiri'
doc = Nokogiri::HTML('<li class="object-props-item milage">61'000 km</li>')
doc.at('.mileage').text # =>
# ~> NoMethodError
# ~> undefined method `text' for nil:NilClass
# ~>
# ~> /var/folders/yb/whn8dwns6rl92jswry5cz87dsgk2n1/T/seeing_is_believing_temp_dir20160428-96035-1dajnql/program.rb:5:in `<main>'
Instead it should be:
doc.css('.milage').text # => "61'000 km"
But that's not all that's wrong. There's a subtle problem waiting to bite you later.
css or search returns a NodeSet whereas at or at_css returns an Element:
doc.css('.milage').class # => Nokogiri::XML::NodeSet
doc.at('.milage').class # => Nokogiri::XML::Element
Here's what happens when text is passed a NodeSet containing multiple matching nodes:
doc = Nokogiri::HTML('<p>foo</p><p>bar</p>')
doc.search('p').class # => Nokogiri::XML::NodeSet
doc.search('p').text # => "foobar"
doc.at('p').class # => Nokogiri::XML::Element
doc.at('p').text # => "foo"
When text is used with a NodeSet it returns the text of all nodes concatenated into a single string. This can make it really difficult to separate the text from one node from another. Instead, use at or one of the at_* equivalents to get the text from a single node. If you want to extract the text from each node individually and get an array use:
doc.search('p').map(&:text) # => ["foo", "bar"]
See "How to avoid joining all text from Nodes when scraping" also.
Finally, notice that your HTML sample isn't valid:
doc = Nokogiri::HTML(<<EOT)
li class="object-props-item price"
<strong>CHF 14'900.-</strong>
<li class="object-props-item milage">61'000 km</li>')
EOT
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <p>li class="object-props-item price"
# >> <strong>CHF 14'900.-</strong>
# >> </p>
# >> <li class="object-props-item milage">61'000 km</li>')
# >> </body></html>
Here's what happens:
doc = Nokogiri::HTML(<<EOT)
li class="object-props-item price"
<strong>CHF 14'900.-</strong>
<li class="object-props-item milage">61'000 km</li>')
EOT
doc.at('.price') # => nil
Nokogiri has to do a fix-up to make sense of the first line, so it wraps it in <p>. By doing so the .price class no longer exists so your code will fail again.
Fixing the tag results in a correct response:
doc = Nokogiri::HTML(<<EOT)
<li class="object-props-item price">
<strong>CHF 14'900.-</strong>
</li>
<li class="object-props-item milage">61'000 km</li>')
EOT
doc.at('.price').to_html # => "<li class=\"object-props-item price\">\n<strong>CHF 14'900.-</strong>\n</li>"
This is why it's really important to make sure your input is valid. Trying to duplicate your problem is difficult without it.
I have the following script that reads a file and then puts it in an array based on line ends with a </h1>. How do I read only the contents between <h1> and </h1>?
This is my script:
out_array = []
open('foo.html') do |f|
f.each('</h1>') do |record|
record.gsub!("\n", ' ')
out_array.push record
end
end
# print array
p out_array
This my html
</h1>
akwotdfg
<h1>
<h1>I am foo</h1>
<h1>
Stubborn quaz
</h1>
<h3>
iThis
is a reas
long one line shit
</h3>
<h1>I am foo</h1>
This is my output:
["</h1>", " akwotdfg <h1> <h1>I am foo</h1>", " <h1> Stubborn quaz </h1>", " <h3> iThis is a reas long one line shit </h3> <h1>I am foo</h1>", " "]
Please take a look of following code:
out_array = open('foo.html') do |f|
f.read.scan(/<h1>(.*)<\/h1>/)
end
puts out_array
execution result:
I am foo
I am foo
updated for multi-line scan:
out_array = open('tempdir/foo.html') do |f|
f.read.scan(/<h1>([^<]*?)<\/h1>/m)
end
out_array.map! {|e| e[0].strip}
p out_array
execution result:
["I am foo", "Stubborn quaz", "I am foo"]
Don't use regular expressions to deal with HTML or XML. For trivial content you manage it's possible, but your code becomes liable to break for anything that can change at someone else's bidding.
Instead use a parser, like Nokogiri:
require 'nokogiri'
html = '
</h1>
akwotdfg
<h1>
<h1>I am foo</h1>
<h1>
Stubborn quaz
</h1>
<h3>
iThis
is a reas
long one line
</h3>
<h1>I am foo</h1>
'
doc = Nokogiri::HTML(html)
h1_contents = doc.search('h1').map(&:text)
puts h1_contents
Which outputs:
# >>
# >> I am foo
# >>
# >> Stubborn quaz
# >>
# >>
# >> iThis
# >> is a reas
# >> long one line
# >>
# >> I am foo
# >> I am foo
# >>
# >> Stubborn quaz
# >>
# >> I am foo
Notice that Nokogiri is returning the content inside the <h3> block. This is correct/expected behavior because the HTML is malformed. Nokogiri fixes malformed HTML in an attempt to help retrieve usable content, but because there are many possible locations for the closing tag, Nokogiri inserts the closing tag at the last location that would be syntactically correct. Humans know to do it earlier, but this is software trying to be helpful.
This situation requires you to preprocess the HTML to make it correct. I'm using a single, simple, sub to fix the first <h1> found:
doc = Nokogiri::HTML(html.sub(/^(<h1>)$/, '\1</h1>'))
h1_contents = doc.search('h1').map(&:text)
puts h1_contents
# >> I am foo
# >>
# >> Stubborn quaz
# >> I am foo
I have the following html which have couple of duplicate href's. How do I extract only the unique links
<div class="pages">
1
2
3
4
5
next ›
last »
</div>
# p => is the page that has this html
# The below gives 7 as expected. But I don't need next/last links as they are duplicate
p.css(".pages a").count
#So I tried uniq which obviously didnt work
p.css(".pages").css("a").uniq #=> didn't work
p.css(".pages").css("a").to_a.uniq #=> didn't work
Try extracting the "href" attribute from the matching elements (el.attr('href')):
html = Nokogiri::HTML(your_html_string)
html.css('a').map { |el| el.attr('href') }.uniq
# /search_results.aspx?f=Technology&Page=1
# /search_results.aspx?f=Technology&Page=2
# /search_results.aspx?f=Technology&Page=3
# /search_results.aspx?f=Technology&Page=4
# /search_results.aspx?f=Technology&Page=5
# /search_results.aspx?f=Technology&Page=6
The same can be done using #xpath. I would do as below:
require 'nokogiri'
doc = Nokogiri::HTML::Document.parse <<-HTML
<div class="pages">
1
2
3
4
5
next ›
last »
</div>
HTML
doc.xpath("//a/#href").map(&:to_s).uniq
# => ["/search_results.aspx?f=Technology&Page=1",
# "/search_results.aspx?f=Technology&Page=2",
# "/search_results.aspx?f=Technology&Page=3",
# "/search_results.aspx?f=Technology&Page=4",
# "/search_results.aspx?f=Technology&Page=5",
# "/search_results.aspx?f=Technology&Page=6"]
Another way to do the same job,where uniq value selection is being handled in the xpath expression itself:
require 'nokogiri'
doc = Nokogiri::HTML::Document.parse <<-HTML
<div class="pages">
1
2
3
4
5
next ›
last »
</div>
HTML
doc.xpath("//a[not(#href = preceding-sibling::a/#href)]/#href").map(&:to_s)
# => ["/search_results.aspx?f=Technology&Page=1",
# "/search_results.aspx?f=Technology&Page=2",
# "/search_results.aspx?f=Technology&Page=3",
# "/search_results.aspx?f=Technology&Page=4",
# "/search_results.aspx?f=Technology&Page=5",
# "/search_results.aspx?f=Technology&Page=6"]