How to get unique links in Nokogiri - ruby

I have the following html which have couple of duplicate href's. How do I extract only the unique links
<div class="pages">
1
2
3
4
5
next ›
last »
</div>
# p => is the page that has this html
# The below gives 7 as expected. But I don't need next/last links as they are duplicate
p.css(".pages a").count
#So I tried uniq which obviously didnt work
p.css(".pages").css("a").uniq #=> didn't work
p.css(".pages").css("a").to_a.uniq #=> didn't work

Try extracting the "href" attribute from the matching elements (el.attr('href')):
html = Nokogiri::HTML(your_html_string)
html.css('a').map { |el| el.attr('href') }.uniq
# /search_results.aspx?f=Technology&Page=1
# /search_results.aspx?f=Technology&Page=2
# /search_results.aspx?f=Technology&Page=3
# /search_results.aspx?f=Technology&Page=4
# /search_results.aspx?f=Technology&Page=5
# /search_results.aspx?f=Technology&Page=6

The same can be done using #xpath. I would do as below:
require 'nokogiri'
doc = Nokogiri::HTML::Document.parse <<-HTML
<div class="pages">
1
2
3
4
5
next ›
last »
</div>
HTML
doc.xpath("//a/#href").map(&:to_s).uniq
# => ["/search_results.aspx?f=Technology&Page=1",
# "/search_results.aspx?f=Technology&Page=2",
# "/search_results.aspx?f=Technology&Page=3",
# "/search_results.aspx?f=Technology&Page=4",
# "/search_results.aspx?f=Technology&Page=5",
# "/search_results.aspx?f=Technology&Page=6"]

Another way to do the same job,where uniq value selection is being handled in the xpath expression itself:
require 'nokogiri'
doc = Nokogiri::HTML::Document.parse <<-HTML
<div class="pages">
1
2
3
4
5
next ›
last »
</div>
HTML
doc.xpath("//a[not(#href = preceding-sibling::a/#href)]/#href").map(&:to_s)
# => ["/search_results.aspx?f=Technology&Page=1",
# "/search_results.aspx?f=Technology&Page=2",
# "/search_results.aspx?f=Technology&Page=3",
# "/search_results.aspx?f=Technology&Page=4",
# "/search_results.aspx?f=Technology&Page=5",
# "/search_results.aspx?f=Technology&Page=6"]

Related

How to remove white space from HTML text

How do I remove spaces in my code? If I parse this HTML with Nokogiri:
<div class="address-thoroughfare mobile-inline-comma ng-binding">Kühlungsborner Straße
10
</div>
I get the following output:
Kühlungsborner Straße
10
which is not left-justified.
My code is:
address_street = page_detail.xpath('//div[#class="address-thoroughfare mobile-inline-comma ng-binding"]').text
Please try strip:
address_street = page_detail.xpath('//div[#class="address-thoroughfare mobile-inline-comma ng-binding"]').text.strip
Consider this:
require 'nokogiri'
doc = Nokogiri::HTML('<div class="address-thoroughfare mobile-inline-comma ng-binding">Kühlungsborner Straße
10
</div>')
doc.search('div').text
# => "Kühlungsborner Straße\n 10\n "
puts doc.search('div').text
# >> Kühlungsborner Straße
# >> 10
# >>
The given HTML doesn't replicate the problem you're having. It's really important to present valid input that duplicates the problem. Moving on....
Don't use xpath, css or search with text. You usually won't get what you expect:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div>
<span>foo</span>
<span>bar</span>
</div>
</body>
</html>
EOT
doc.search('span').class # => Nokogiri::XML::NodeSet
doc.search('span') # => [#<Nokogiri::XML::Element:0x3fdb6981bcd8 name="span" children=[#<Nokogiri::XML::Text:0x3fdb6981b5d0 "foo">]>, #<Nokogiri::XML::Element:0x3fdb6981aab8 name="span" children=[#<Nokogiri::XML::Text:0x3fdb6981a054 "bar">]>]
doc.search('span').text
# => "foobar"
Note that text returned the concatenated text of all nodes found.
Instead, walk the NodeSet and grab the individual node's text:
doc.search('span').map(&:text)
# => ["foo", "bar"]

How to export HTML data to a CSV file

I am trying to scrape and make a CSV file from this HTML:
<ul class="object-props">
<li class="object-props-item price">
<strong>CHF 14'800.-</strong>
</li>
<li class="object-props-item milage">31'000 km</li>
<li class="object-props-item date">08.2012</li>
</ul>
I want to extract the price and mileage using:
require 'rubygems'
require 'nokogiri'
require 'CSV'
require 'open-uri'
url= "/tto.htm"
data = Nokogiri::HTML(open(url))
CSV.open('csv.csv', 'wb') do |csv|
csv << %w[ price mileage ]
price=data.css('.price').text
mileage=data.css('.mileage').text
csv << [price, mileage]
end
The result is not really what I'm expecting. Two columns are created, but how can I remove the characters like CHF and KM and why is the data of the mileage not displaying result?
My guess is that the text in the HTML includes units of measure; CHF for Swiss Francs for the price, and km for kilometers for the mileage.
You could add split.first or split.last to get the number without the unit of measure, e.g.:
2.3.0 :007 > 'CHF 100'.split.last
=> "100"
2.3.0 :008 > '99 km'.split.first
=> "99"
Removing/ignoring the unwanted text is not a Nokogiri problem, it's a String processing problem:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
li class="object-props-item price"
<strong>CHF 14'900.-</strong>
<li class="object-props-item milage">61'000 km</li>
EOT
str = doc.at('strong').text # => "CHF 14'900.-"
At this point str contains the text of the <strong> node.
A simple regex will extract, which is the straightforward way to grab the data:
str[/[\d']+/] # => "14'900"
sub could be used to remove the 'CHF ' substring:
str.sub('CHF ', '') # => "14'900.-"
delete could be used to remove the characters C, H, F and :
str.delete('CHF ') # => "14'900.-"
tr could be used to remove everything that is NOT 0..9, ', . or -:
str.tr("^0-9'.-", '') # => "14'900.-"
Modify one of the above if you don't want ', . or -.
why are the data of the mileage not displaying
Because you have a mismatch between the CSS selector and the actual class parameter:
require 'nokogiri'
doc = Nokogiri::HTML('<li class="object-props-item milage">61'000 km</li>')
doc.at('.mileage').text # =>
# ~> NoMethodError
# ~> undefined method `text' for nil:NilClass
# ~>
# ~> /var/folders/yb/whn8dwns6rl92jswry5cz87dsgk2n1/T/seeing_is_believing_temp_dir20160428-96035-1dajnql/program.rb:5:in `<main>'
Instead it should be:
doc.css('.milage').text # => "61'000 km"
But that's not all that's wrong. There's a subtle problem waiting to bite you later.
css or search returns a NodeSet whereas at or at_css returns an Element:
doc.css('.milage').class # => Nokogiri::XML::NodeSet
doc.at('.milage').class # => Nokogiri::XML::Element
Here's what happens when text is passed a NodeSet containing multiple matching nodes:
doc = Nokogiri::HTML('<p>foo</p><p>bar</p>')
doc.search('p').class # => Nokogiri::XML::NodeSet
doc.search('p').text # => "foobar"
doc.at('p').class # => Nokogiri::XML::Element
doc.at('p').text # => "foo"
When text is used with a NodeSet it returns the text of all nodes concatenated into a single string. This can make it really difficult to separate the text from one node from another. Instead, use at or one of the at_* equivalents to get the text from a single node. If you want to extract the text from each node individually and get an array use:
doc.search('p').map(&:text) # => ["foo", "bar"]
See "How to avoid joining all text from Nodes when scraping" also.
Finally, notice that your HTML sample isn't valid:
doc = Nokogiri::HTML(<<EOT)
li class="object-props-item price"
<strong>CHF 14'900.-</strong>
<li class="object-props-item milage">61'000 km</li>')
EOT
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <p>li class="object-props-item price"
# >> <strong>CHF 14'900.-</strong>
# >> </p>
# >> <li class="object-props-item milage">61'000 km</li>')
# >> </body></html>
Here's what happens:
doc = Nokogiri::HTML(<<EOT)
li class="object-props-item price"
<strong>CHF 14'900.-</strong>
<li class="object-props-item milage">61'000 km</li>')
EOT
doc.at('.price') # => nil
Nokogiri has to do a fix-up to make sense of the first line, so it wraps it in <p>. By doing so the .price class no longer exists so your code will fail again.
Fixing the tag results in a correct response:
doc = Nokogiri::HTML(<<EOT)
<li class="object-props-item price">
<strong>CHF 14'900.-</strong>
</li>
<li class="object-props-item milage">61'000 km</li>')
EOT
doc.at('.price').to_html # => "<li class=\"object-props-item price\">\n<strong>CHF 14'900.-</strong>\n</li>"
This is why it's really important to make sure your input is valid. Trying to duplicate your problem is difficult without it.

Need clarification with 'each-do' block in my ruby code

Given an html file:
<div>
<div class="NormalMid">
<span class="style-span">
"Data 1:"
1
2
</span>
</div>
...more divs
<div class="NormalMid">
<span class="style-span">
"Data 20:"
20
21
22
23
</span>
</div>
...more divs
</div
Using these SO posts as reference:
How do I integrate these two conditions block codes to mine in Ruby?
and
How to understand this Arrays and loops in Ruby?
My code:
require 'nokogiri'
require 'pp'
require 'open-uri'
data_file = 'site.htm'
file = File.open(data_file, 'r')
html = open(file)
page = Nokogiri::HTML(html)
page.encoding = 'utf-8'
rows = page.xpath('//div[#class="NormalMid"]')
details = rows.collect do |row|
detail = {}
[
[row.children.first.element_children,row.children.first.element_children],
].each do |part, link|
data = row.children[0].children[0].to_s.strip
links = link.collect {|item| item.at_xpath('#href').to_s.strip}
detail[data.to_sym] = links
end
detail
end
details.reject! {|d| d.empty?}
pp details
The output:
[{:"Data 1:"=>
["http://www.site.com/data/1",
"http://www.site.com/data/2"]},
...
{:"Data 20 :"=>
["http://www.site.com/data/20",
"http://www.site.com/data/21",
"http://www.site.com/data/22",
"http://www.site.com/data/20",]},
...
}]
Everything is going good, exactly what I wanted.
BUT if you change these lines of code:
detail = {}
[
[row.children.first.element_children,row.children.first.element_children],
].each do |part, link|
to:
detail = {}
[
[row.children.first.element_children],
].each do |link|
I get the output of
[{:"Data 1:"=>
["http://www.site.com/data/1"]},
...
{:"Data 20 :"=>
["http://www.site.com/data/20"]},
...
}]
Only the first anchor href is stored in the array.
I just need some clarification on why its behaving that way because the argument part in the argument list is not being used, I figure I didn't need it there. But my program doesn't work correctly if I delete the corresponding row.children.first.element_children as well.
What is going on in the [[obj,obj],].each do block? I just started ruby a week ago, and I'm still getting used to the syntax, any help will be appreciated. Thank You :D
EDIT
rows[0].children.first.element_children[0] will have the output
Nokogiri::XML::Element:0xcea69c name="a" attributes=[#<Nokogiri::XML::Attr:0xcea648
name="href" value="http://www.site.com/data/1">] children[<Nokogiri::XML::Text:0xcea1a4
"1">]>
puts rows[0].children.first.element_children[0]
1
You made your code overly complicated. Looking at your code,it seems you are trying to get something like below:
require 'nokogiri'
doc = Nokogiri::HTML::Document.parse <<-eotl
<div>
<div class="NormalMid">
<span class="style-span">
"Data 1:"
1
2
</span>
</div>
<div class="NormalMid">
<span class="style-span">
"Data 20:"
20
21
22
23
</span>
</div>
</div
eotl
rows = doc.xpath("//div[#class='NormalMid']/span[#class='style-span']")
val = rows.map do |row|
[row.at_xpath("./text()").to_s.tr('"','').strip,row.xpath(".//#href").map(&:to_s)]
end
Hash[val]
# => {"Data 1:"=>["http://site.com/data/1", "http://site.com/data/2"],
# "Data 20:"=>
# ["http://site.com/data/20",
# "http://site.com/data/21",
# "http://site.com/data/22",
# "http://site.com/data/23"]}
What is going on in the [[obj,obj],].each do block?
Look the below 2 parts:
[[1],[4,5]].each do |a|
p a
end
# >> [1]
# >> [4, 5]
[[1,2],[4,5]].each do |a,b|
p a, b
end
# >> 1
# >> 2
# >> 4
# >> 5

Convert HTML to plain text (with inclusion of <br>s)

Is it possible to convert HTML with Nokogiri to plain text? I also want to include <br /> tag.
For example, given this HTML:
<p>ala ma kota</p> <br /> <span>i kot to idiota </span>
I want this output:
ala ma kota
i kot to idiota
When I just call Nokogiri::HTML(my_html).text it excludes <br /> tag:
ala ma kota i kot to idiota
Instead of writing complex regexp I used Nokogiri.
Working solution (K.I.S.S!):
def strip_html(str)
document = Nokogiri::HTML.parse(str)
document.css("br").each { |node| node.replace("\n") }
document.text
end
Nothing like this exists by default, but you can easily hack something together that comes close to the desired output:
require 'nokogiri'
def render_to_ascii(node)
blocks = %w[p div address] # els to put newlines after
swaps = { "br"=>"\n", "hr"=>"\n#{'-'*70}\n" } # content to swap out
dup = node.dup # don't munge the original
# Get rid of superfluous whitespace in the source
dup.xpath('.//text()').each{ |t| t.content=t.text.gsub(/\s+/,' ') }
# Swap out the swaps
dup.css(swaps.keys.join(',')).each{ |n| n.replace( swaps[n.name] ) }
# Slap a couple newlines after each block level element
dup.css(blocks.join(',')).each{ |n| n.after("\n\n") }
# Return the modified text content
dup.text
end
frag = Nokogiri::HTML.fragment "<p>It is the end of the world
as we
know it<br>and <i>I</i> <strong>feel</strong>
<a href='blah'>fine</a>.</p><div>Capische<hr>Buddy?</div>"
puts render_to_ascii(frag)
#=> It is the end of the world as we know it
#=> and I feel fine.
#=>
#=> Capische
#=> ----------------------------------------------------------------------
#=> Buddy?
Try
Nokogiri::HTML(my_html.gsub('<br />',"\n")).text
Nokogiri will strip out links, so I use this first to preserve links in the text version:
html_version.gsub!(/<a href.*(http:[^"']+).*>(.*)<\/a>/i) { "#{$2}\n#{$1}" }
that will turn this:
link to google
to this:
link to google
http://google.com
If you use HAML you can solve html converting by putting html with 'raw' option, f.e.
= raw #product.short_description

How to get text after or before certain tags using Nokogiri

I have an HTML document, something like this:
<root><template>title</template>
<h level="3" i="3">Something</h>
<template element="1"><title>test</title></template>
# one
# two
# three
# four
<h level="4" i="5">something1</h>
some random test
<template element="1"><title>test</title></template>
# first
# second
# third
# fourth
<template element="2"><title>testing</title></template>
I want to extract:
# one
# two
# three
# four
# first
# second
# third
# fourth
</root>
In other words, I want "all text after <template element="1"><title>test</title></template> and before the next tag that starts after that."
I can get all text between root using '//root/text()' but how do I get all text before and after certain tags?
This seems to work:
require 'nokogiri'
xml = '<root>
<template>title</template>
<h level="3" i="3">Something</h>
<template element="1">
<title>test</title>
</template>
# one
# two
# three
# four
<h level="4" i="5">something1</h>
some random test
<template element="1">
<title>test</title>
</template>
# first
# second
# third
# fourth
<template element="2">
<title>testing</title>
</template>
</root>
'
doc = Nokogiri::XML(xml)
text = (doc / 'template[#element="1"]').map{ |n| n.next_sibling.text.strip.gsub(/\n +/, "\n") }
puts text
# >> # one
# >> # two
# >> # three
# >> # four
# >> # first
# >> # second
# >> # third
# >> # fourth
I'm pretty sure krusty.ar is right that there's not a built-in method for achieving this. You can just remove all the tags inside the root tag one by one if you'd like. It's a hack, but it works:
doc = Nokogiri::HTML(open(url)) # or Nokogiri::HTML.parse(File.open(file_path))
doc.xpath('//template').remove
doc.xpath('//h').remove
doc
That gives the result you're looking for with the HTML you posted.

Resources