How to export HTML data to a CSV file - ruby

I am trying to scrape and make a CSV file from this HTML:
<ul class="object-props">
<li class="object-props-item price">
<strong>CHF 14'800.-</strong>
</li>
<li class="object-props-item milage">31'000 km</li>
<li class="object-props-item date">08.2012</li>
</ul>
I want to extract the price and mileage using:
require 'rubygems'
require 'nokogiri'
require 'CSV'
require 'open-uri'
url= "/tto.htm"
data = Nokogiri::HTML(open(url))
CSV.open('csv.csv', 'wb') do |csv|
csv << %w[ price mileage ]
price=data.css('.price').text
mileage=data.css('.mileage').text
csv << [price, mileage]
end
The result is not really what I'm expecting. Two columns are created, but how can I remove the characters like CHF and KM and why is the data of the mileage not displaying result?

My guess is that the text in the HTML includes units of measure; CHF for Swiss Francs for the price, and km for kilometers for the mileage.
You could add split.first or split.last to get the number without the unit of measure, e.g.:
2.3.0 :007 > 'CHF 100'.split.last
=> "100"
2.3.0 :008 > '99 km'.split.first
=> "99"

Removing/ignoring the unwanted text is not a Nokogiri problem, it's a String processing problem:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
li class="object-props-item price"
<strong>CHF 14'900.-</strong>
<li class="object-props-item milage">61'000 km</li>
EOT
str = doc.at('strong').text # => "CHF 14'900.-"
At this point str contains the text of the <strong> node.
A simple regex will extract, which is the straightforward way to grab the data:
str[/[\d']+/] # => "14'900"
sub could be used to remove the 'CHF ' substring:
str.sub('CHF ', '') # => "14'900.-"
delete could be used to remove the characters C, H, F and :
str.delete('CHF ') # => "14'900.-"
tr could be used to remove everything that is NOT 0..9, ', . or -:
str.tr("^0-9'.-", '') # => "14'900.-"
Modify one of the above if you don't want ', . or -.
why are the data of the mileage not displaying
Because you have a mismatch between the CSS selector and the actual class parameter:
require 'nokogiri'
doc = Nokogiri::HTML('<li class="object-props-item milage">61'000 km</li>')
doc.at('.mileage').text # =>
# ~> NoMethodError
# ~> undefined method `text' for nil:NilClass
# ~>
# ~> /var/folders/yb/whn8dwns6rl92jswry5cz87dsgk2n1/T/seeing_is_believing_temp_dir20160428-96035-1dajnql/program.rb:5:in `<main>'
Instead it should be:
doc.css('.milage').text # => "61'000 km"
But that's not all that's wrong. There's a subtle problem waiting to bite you later.
css or search returns a NodeSet whereas at or at_css returns an Element:
doc.css('.milage').class # => Nokogiri::XML::NodeSet
doc.at('.milage').class # => Nokogiri::XML::Element
Here's what happens when text is passed a NodeSet containing multiple matching nodes:
doc = Nokogiri::HTML('<p>foo</p><p>bar</p>')
doc.search('p').class # => Nokogiri::XML::NodeSet
doc.search('p').text # => "foobar"
doc.at('p').class # => Nokogiri::XML::Element
doc.at('p').text # => "foo"
When text is used with a NodeSet it returns the text of all nodes concatenated into a single string. This can make it really difficult to separate the text from one node from another. Instead, use at or one of the at_* equivalents to get the text from a single node. If you want to extract the text from each node individually and get an array use:
doc.search('p').map(&:text) # => ["foo", "bar"]
See "How to avoid joining all text from Nodes when scraping" also.
Finally, notice that your HTML sample isn't valid:
doc = Nokogiri::HTML(<<EOT)
li class="object-props-item price"
<strong>CHF 14'900.-</strong>
<li class="object-props-item milage">61'000 km</li>')
EOT
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <p>li class="object-props-item price"
# >> <strong>CHF 14'900.-</strong>
# >> </p>
# >> <li class="object-props-item milage">61'000 km</li>')
# >> </body></html>
Here's what happens:
doc = Nokogiri::HTML(<<EOT)
li class="object-props-item price"
<strong>CHF 14'900.-</strong>
<li class="object-props-item milage">61'000 km</li>')
EOT
doc.at('.price') # => nil
Nokogiri has to do a fix-up to make sense of the first line, so it wraps it in <p>. By doing so the .price class no longer exists so your code will fail again.
Fixing the tag results in a correct response:
doc = Nokogiri::HTML(<<EOT)
<li class="object-props-item price">
<strong>CHF 14'900.-</strong>
</li>
<li class="object-props-item milage">61'000 km</li>')
EOT
doc.at('.price').to_html # => "<li class=\"object-props-item price\">\n<strong>CHF 14'900.-</strong>\n</li>"
This is why it's really important to make sure your input is valid. Trying to duplicate your problem is difficult without it.

Related

How to remove white space from HTML text

How do I remove spaces in my code? If I parse this HTML with Nokogiri:
<div class="address-thoroughfare mobile-inline-comma ng-binding">Kühlungsborner Straße
10
</div>
I get the following output:
Kühlungsborner Straße
10
which is not left-justified.
My code is:
address_street = page_detail.xpath('//div[#class="address-thoroughfare mobile-inline-comma ng-binding"]').text
Please try strip:
address_street = page_detail.xpath('//div[#class="address-thoroughfare mobile-inline-comma ng-binding"]').text.strip
Consider this:
require 'nokogiri'
doc = Nokogiri::HTML('<div class="address-thoroughfare mobile-inline-comma ng-binding">Kühlungsborner Straße
10
</div>')
doc.search('div').text
# => "Kühlungsborner Straße\n 10\n "
puts doc.search('div').text
# >> Kühlungsborner Straße
# >> 10
# >>
The given HTML doesn't replicate the problem you're having. It's really important to present valid input that duplicates the problem. Moving on....
Don't use xpath, css or search with text. You usually won't get what you expect:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div>
<span>foo</span>
<span>bar</span>
</div>
</body>
</html>
EOT
doc.search('span').class # => Nokogiri::XML::NodeSet
doc.search('span') # => [#<Nokogiri::XML::Element:0x3fdb6981bcd8 name="span" children=[#<Nokogiri::XML::Text:0x3fdb6981b5d0 "foo">]>, #<Nokogiri::XML::Element:0x3fdb6981aab8 name="span" children=[#<Nokogiri::XML::Text:0x3fdb6981a054 "bar">]>]
doc.search('span').text
# => "foobar"
Note that text returned the concatenated text of all nodes found.
Instead, walk the NodeSet and grab the individual node's text:
doc.search('span').map(&:text)
# => ["foo", "bar"]

How to get Nokogiri inner_HTML object to ignore/remove escape sequences

Currently, I am trying to get the inner HTML of an element on a page using nokogiri. However I'm not just getting the text of the element, I'm also getting its escape sequences. Is there a way i can suppress or remove them with nokogiri?
require 'nokogiri'
require 'open-uri'
page = Nokogiri::HTML(open("http://the.page.url.com"))
page.at_css("td[custom-attribute='foo']").parent.css('td').css('a').inner_html
this returns => "\r\n\t\t\t\t\t\t\t\tTheActuallyInnerContentThatIWant\r\n\t"
What is the most effective and direct nokogiri (or ruby) way of doing this?
page.at_css("td[custom-attribute='foo']")
.parent
.css('td')
.css('a')
.text # since you need a text, not inner_html
.strip # this will strip a result
String#strip.
Sidenote: css('td a') is likely more efficient than css('td').css('a').
It's important to drill in to the closest node containing the text you want. Consider this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
</body>
</html>
EOT
doc.at('body').inner_html # => "\n <p>foo</p>\n "
doc.at('body').text # => "\n foo\n "
doc.at('p').inner_html # => "foo"
doc.at('p').text # => "foo"
at, at_css and at_xpath return a Node/XML::Element. search, css and xpath return a NodeSet. There's a big difference in how text or inner_html return information when looking at a Node or NodeSet:
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
EOT
doc.at('p') # => #<Nokogiri::XML::Element:0x3fd635cf36f4 name="p" children=[#<Nokogiri::XML::Text:0x3fd635cf3514 "foo">]>
doc.search('p') # => [#<Nokogiri::XML::Element:0x3fd635cf36f4 name="p" children=[#<Nokogiri::XML::Text:0x3fd635cf3514 "foo">]>, #<Nokogiri::XML::Element:0x3fd635cf32bc name="p" children=[#<Nokogiri::XML::Text:0x3fd635cf30dc "bar">]>]
doc.at('p').class # => Nokogiri::XML::Element
doc.search('p').class # => Nokogiri::XML::NodeSet
doc.at('p').text # => "foo"
doc.search('p').text # => "foobar"
Notice that using search returned a NodeSet and that text returned the node's text concatenated together. This is rarely what you want.
Also notice that Nokogiri is smart enough to figure out whether a selector is CSS or XPath 99% of the time, so using the generic search and at for either type of selector is very convenient.

Need clarification with 'each-do' block in my ruby code

Given an html file:
<div>
<div class="NormalMid">
<span class="style-span">
"Data 1:"
1
2
</span>
</div>
...more divs
<div class="NormalMid">
<span class="style-span">
"Data 20:"
20
21
22
23
</span>
</div>
...more divs
</div
Using these SO posts as reference:
How do I integrate these two conditions block codes to mine in Ruby?
and
How to understand this Arrays and loops in Ruby?
My code:
require 'nokogiri'
require 'pp'
require 'open-uri'
data_file = 'site.htm'
file = File.open(data_file, 'r')
html = open(file)
page = Nokogiri::HTML(html)
page.encoding = 'utf-8'
rows = page.xpath('//div[#class="NormalMid"]')
details = rows.collect do |row|
detail = {}
[
[row.children.first.element_children,row.children.first.element_children],
].each do |part, link|
data = row.children[0].children[0].to_s.strip
links = link.collect {|item| item.at_xpath('#href').to_s.strip}
detail[data.to_sym] = links
end
detail
end
details.reject! {|d| d.empty?}
pp details
The output:
[{:"Data 1:"=>
["http://www.site.com/data/1",
"http://www.site.com/data/2"]},
...
{:"Data 20 :"=>
["http://www.site.com/data/20",
"http://www.site.com/data/21",
"http://www.site.com/data/22",
"http://www.site.com/data/20",]},
...
}]
Everything is going good, exactly what I wanted.
BUT if you change these lines of code:
detail = {}
[
[row.children.first.element_children,row.children.first.element_children],
].each do |part, link|
to:
detail = {}
[
[row.children.first.element_children],
].each do |link|
I get the output of
[{:"Data 1:"=>
["http://www.site.com/data/1"]},
...
{:"Data 20 :"=>
["http://www.site.com/data/20"]},
...
}]
Only the first anchor href is stored in the array.
I just need some clarification on why its behaving that way because the argument part in the argument list is not being used, I figure I didn't need it there. But my program doesn't work correctly if I delete the corresponding row.children.first.element_children as well.
What is going on in the [[obj,obj],].each do block? I just started ruby a week ago, and I'm still getting used to the syntax, any help will be appreciated. Thank You :D
EDIT
rows[0].children.first.element_children[0] will have the output
Nokogiri::XML::Element:0xcea69c name="a" attributes=[#<Nokogiri::XML::Attr:0xcea648
name="href" value="http://www.site.com/data/1">] children[<Nokogiri::XML::Text:0xcea1a4
"1">]>
puts rows[0].children.first.element_children[0]
1
You made your code overly complicated. Looking at your code,it seems you are trying to get something like below:
require 'nokogiri'
doc = Nokogiri::HTML::Document.parse <<-eotl
<div>
<div class="NormalMid">
<span class="style-span">
"Data 1:"
1
2
</span>
</div>
<div class="NormalMid">
<span class="style-span">
"Data 20:"
20
21
22
23
</span>
</div>
</div
eotl
rows = doc.xpath("//div[#class='NormalMid']/span[#class='style-span']")
val = rows.map do |row|
[row.at_xpath("./text()").to_s.tr('"','').strip,row.xpath(".//#href").map(&:to_s)]
end
Hash[val]
# => {"Data 1:"=>["http://site.com/data/1", "http://site.com/data/2"],
# "Data 20:"=>
# ["http://site.com/data/20",
# "http://site.com/data/21",
# "http://site.com/data/22",
# "http://site.com/data/23"]}
What is going on in the [[obj,obj],].each do block?
Look the below 2 parts:
[[1],[4,5]].each do |a|
p a
end
# >> [1]
# >> [4, 5]
[[1,2],[4,5]].each do |a,b|
p a, b
end
# >> 1
# >> 2
# >> 4
# >> 5

how to get html class values using regular expression in ruby

I have this below string from which I want to extract class values "ruby", "html", "java". My objective here is understanding / learning regular expressions that I have always dreaded :-).
<div class="ruby" name="ruby_doc">
<div class="html" name="html_doc">
<div class="java" name="java_doc">
This is what I have so far
str = <<END
<div class="ruby" name="ruby_doc">
<div class="html" name="html_doc">
<div class="java" name="java_doc">
END
str.scan(/"[^"]+/) #=> returns
["\"ruby", "\" name=", "\"ruby_doc", "\">\n<div class=", "\"html",...]
str.scan(/class="[^"]+/) #=> ["class=\"ruby", "class=\"html", "class=\"java"]
str.scan(/"(\w)+?"/) #=> [["ruby"], ["ruby_doc"], ["html"], ["html_doc"], ...]
str.scan(/\b(?<=class=\")[^"]+(?=\")/)
# => ["ruby", "html", "java"]
Use Nokogiri for this :
require 'nokogiri'
doc = Nokogiri::HTML::Document.parse <<-_html_
<div class="ruby" name="ruby_doc">
<div class="html" name="html_doc">
<div class="java" name="java_doc">
_html_
# to get values of class attribute
doc.xpath('//div/#class').map(&:to_s)
# => ["ruby", "html", "java"]
# to get values of name attribute
doc.xpath('//div/#name').map(&:to_s)
# => ["ruby_doc", "html_doc", "java_doc"]
Parsing HTML with regex is not recommended. If you had to write a somewhat ok regex, then you could try with
str.scan /<div\s+class=\s*"([^"]+)/
#=> [["ruby"], ["html"], ["java"]]
You really should use Nokogiri as per #Arup's answer. But, if you insist...
str.scan(/(?:class\=\")(\w+)(?:\")/).flatten
Live test in Ruby console
2.0.0p247 :001 > str = <<END
2.0.0p247 :002"> <div class="ruby" name="ruby_doc">
2.0.0p247 :003"> <div class="html" name="html_doc">
2.0.0p247 :004"> <div class="java" name="java_doc">
2.0.0p247 :005"> END
=> "<div class=\"ruby\" name=\"ruby_doc\">\n<div class=\"html\" name=\"html_doc\">\n<div class=\"java\" name=\"java_doc\">\n"
2.0.0p247 :006 > str.scan(/(?:class\=\")(\w+)(?:\")/).flatten
=> ["ruby", "html", "java"]
Howsabout:
str.scan /"(.*?)"/
#=> [["ruby"], ["ruby_doc"], ["html"], ["html_doc"], ["java"], ["java_doc"]]

capturing specific text between tags

The explanation is in the comment. I put it there because is interpreted as bold or something, and it screws up the post.
# I need to capture text that is
# enclosed in tags that are both <b> and
# <i>, but if there is more than one
# text enclosed in <i> in the same <b>
# block, then I only want the text
# enclosed in the first <i> tag, For
# example, for the following line:
#
# <b> <i> Important text here </i>
# irrelevant text everywhere else <i>
# irrelevant text here </i> </b> <b>
# <i> Also Important </i> not important
# <i> not important </i> </b>
#
# I want to retrieve only:
# - Important text here
# - Also Important
#
# I also must not retrieve text inside an
# <h2> block. I have been trying to
# delete the block with nodes.delete(nodes. search('h2')),
# but it doesn't actually delete the h2 block
require "rubygems"
require "nokogiri"
html = <<EOT
<b><i> Important text here </i> more text <i> not important text here </i> </b>
<b> <i> Also Important </i> more text <i> not important </i> </b>
<h2><b> <i> I don't want this text either</i></b></h2>
EOT
doc = Nokogiri::HTML(html)
nodes = doc.search('b i')
nodes.each { |e| puts e }
# Expected output:
# Important text here
# Also Important
require "nokogiri"
require 'pp'
html = <<EOT
<b><i>Important text here</i>more text<i>not important text here</i></b>
<b><i>Also Important</i>more text<i>not important</i></b>
<h2><b><i>I don't want this text either</i></b></h2>
EOT
doc = Nokogiri::HTML(html)
nodes = doc.search('b')
nodes.each { |e| puts e.children.children.first unless e.parent.name == "h2" }
or with xpath:
nodes = doc.xpath("//../*[local-name() != 'h2']/b/i[1]")
nodes.each { |e| puts e.children.first}

Resources