Get link and href text from html doc with Nokogiri & Ruby? - ruby

I'm trying to use the nokogiri gem to extract all the urls on the page as well their link text and store the link text and url in a hash.
<html>
<body>
<a href=#foo>Foo</a>
<a href=#bar>Bar </a>
</body>
</html>
I would like to return
{"Foo" => "#foo", "Bar" => "#bar"}

Here's a one-liner:
Hash[doc.xpath('//a[#href]').map {|link| [link.text.strip, link["href"]]}]
#=> {"Foo"=>"#foo", "Bar"=>"#bar"}
Split up a bit to be arguably more readable:
h = {}
doc.xpath('//a[#href]').each do |link|
h[link.text.strip] = link['href']
end
puts h
#=> {"Foo"=>"#foo", "Bar"=>"#bar"}

Another way:
h = doc.css('a[href]').each_with_object({}) { |n, h| h[n.text.strip] = n['href'] }
# yields {"Foo"=>"#foo", "Bar"=>"#bar"}
And if you're worried that you might have the same text linking to different things then you collect the hrefs in arrays:
h = doc.css('a[href]').each_with_object(Hash.new { |h,k| h[k] = [ ]}) { |n, h| h[n.text.strip] << n['href'] }
# yields {"Foo"=>["#foo"], "Bar"=>["#bar"]}

Related

Nokogiri-- To remove unwanted spaces between tags having no text

I have a HTML content as--
html = "<table id=\"soa_table\" class=\"table table-striped table-bordered table-condensed soa-table\"><thead><tr><th>SoA</th><th id=\"423\" class=\"soa-column text-center\">V1</th><th id=\"424\" class=\"soa-column text-center\">V2</th></tr></thead><tbody><tr><td class=\"soa-row\" id=\"631\">Label 1</td><td class=\"soa-element text-center\" form_id=\"631\" visit_id=\"423\" id=\"484\"><span class=\"glyphicon glyphicon-ok text-success\"></span></td><td class=\"soa-element\" form_id=\"631\" visit_id=\"424\" id=\"0\"> </td></tr><tr><td class=\"soa-row\" id=\"632\">Label 2</td><td class=\"soa-element text-center\" form_id=\"632\" visit_id=\"423\" id=\"485\"><span class=\"glyphicon glyphicon-ok text-success\"></span></td><td class=\"soa-element\" form_id=\"632\" visit_id=\"424\" id=\"0\"> </td></tr><tr><td class=\"soa-row\" id=\"633\">Label 3</td><td class=\"soa-element\" form_id=\"633\" visit_id=\"423\" id=\"0\"> </td><td class=\"soa-element text-center\" form_id=\"633\" visit_id=\"424\" id=\"486\"><span class=\"glyphicon glyphicon-ok text-success\"></span></td></tr></tbody></table>"
Now I parsed it via Nokogiri and tried to gsub the spaces as---
Nokogiri::HTML(html).at('table').to_html.gsub(/>\s+</, "><")
But it doesn't works
remove unwanted spaces between tags having no text
I asume you mean this kind of space:
<td class="soa-element" form_id="631" visit_id="424" id="0"> </td>
^
That's a text node containing a single space.
Let's use a smaller example:
html = '<foo>value</foo><bar> </bar>'
doc = Nokogiri::HTML.fragment(html)
You can use PP to inspect the parsed document structure:
require 'pp'
pp doc
Output:
#(DocumentFragment:0x3fe819894018 {
name = "#document-fragment",
children = [
#(Element:0x3fe819891b9c { name = "foo", children = [ #(Text "value")] }),
#(Element:0x3fe819891ae8 { name = "bar", children = [ #(Text " ")] })]
})
The document contains two text nodes, one with "value" the other one with " ".
In order to remove the latter, we can traverse the document and remove all text nodes containing just whitespace:
doc.traverse { |node| node.remove if node.text? && node.text !~ /\S/ }
pp doc
Output:
#(DocumentFragment:0x3fe819894018 {
name = "#document-fragment",
children = [
#(Element:0x3fe819891b9c { name = "foo", children = [ #(Text "value")] }),
#(Element:0x3fe819891ae8 { name = "bar" })]
})
Finally, we can serialize the document:
doc.to_html
#=> "<foo>value</foo><bar></bar>"
gsub does not substitute into the source object. gsub! does. Also, you don't need Nokogiri at all.
require 'nokogiri'
puts 'Needlessly using Nokogiri'
html = "<p> </p>"
new_html = Nokogiri::HTML(html).at('p').to_html.gsub(/>\s+</, '><')
puts html
puts new_html
puts '-' * 20
puts 'Solution #1'
html = "<p> </p>"
new_html = html.gsub(/>\s+</, '><')
puts html
puts new_html
puts '-' * 20
puts 'Solution #2'
html = "<p> </p>"
puts html
html.gsub!(/>\s+</,'><')
puts html
The output of this program is:
Needlessly using Nokogiri
<p> </p>
<p></p>
--------------------
Solution #1
<p> </p>
<p></p>
--------------------
Solution #2
<p> </p>
<p></p>
Remove whitespace-only text nodes:
doc.search('//text()[normalize-space()=""]').remove
Update with example:
Nokogiri::HTML('<b></b> <b></b>').search('//text()[normalize-space()=""]').remove
#=> [#<Nokogiri::XML::Text:0x197ad78 " ">]

Nokogiri: Slop access a node named name

I'm trying to parse a xml that looks like this:
<lesson>
<name>toto</name>
<version>42</version>
</lesson>
Using Nokogiri::Slop.
I can access lesson easily through lesson.version but I cannot access lesson.name, as name refer in this case to the name of the node (lesson).
Is there any way to access the child ?
As a variant you could try this one:
doc.lesson.elements.select{|el| el.name == "name"}
Why? Just because of this benchmarks:
require 'nokogiri'
require 'benchmark'
str = '<lesson>
<name>toto</name>
<version>42</version>
</lesson>'
doc = Nokogiri::Slop(str)
n = 50000
Benchmark.bm do |x|
x.report("select") { n.times do; doc.lesson.elements.select{|el| el.name == "name"}; end }
x.report("search") { n.times do; doc.lesson.search('name'); end }
end
Which gives us the result:
#=> user system total real
#=> select 1.466000 0.047000 1.513000 ( 1.528153)
#=> search 2.637000 0.125000 2.762000 ( 2.777278)
You can use search and give the node a xpath or css selector:
doc.lesson.search('name').first
Do a bit hack using meta programming.
require 'nokogiri'
doc = Nokogiri::Slop <<-HTML
<lesson>
<name>toto</name>
<version>42</version>
</lesson>
HTML
name_val = doc.lesson.instance_eval do
self.class.send :undef_method, :name
self.name
end.text
p name_val # => toto
p doc.lesson.version.text # => '42'
Nokogiri::XML::Node#name is a method defined to get the names of Nokogiri::XML::Node. Just for some moment, remove the method from the class Nokogiri::XML::Node in the scope of #instance_eval.

How to parse consecutive tags with Nokogiri?

I have HTML code like this:
<div id="first">
<dt>Label1</dt>
<dd>Value1</dd>
<dt>Label2</dt>
<dd>Value2</dd>
...
</div>
My code does not work.
doc.css("first").each do |item|
label = item.css("dt")
value = item.css("dd")
end
Show all the <dt> tags firsts and then the <dd> tags and I need "label: value"
First of all, your HTML should have the <dt> and <dd> elements inside a <dl>:
<div id="first">
<dl>
<dt>Label1</dt>
<dd>Value1</dd>
<dt>Label2</dt>
<dd>Value2</dd>
...
</dl>
</div>
but that won't change how you parse it. You want to find the <dt>s and iterate over them, then at each <dt> you can use next_element to get the <dd>; something like this:
doc = Nokogiri::HTML('<div id="first"><dl>...')
doc.css('#first').search('dt').each do |node|
puts "#{node.text}: #{node.next_element.text}"
end
That should work as long as the structure matches your example.
Under the assumption that some <dt> may have multiple <dd>, you want to find all <dt> and then (for each) find the following <dd> before the next <dt>. This is pretty easy to do in pure Ruby, but more fun to do in just XPath. ;)
Given this setup:
require 'nokogiri'
html = '<dl id="first">
<dt>Label1</dt><dd>Value1</dd>
<dt>Label2</dt><dd>Value2</dd>
<dt>Label3</dt><dd>Value3a</dd><dd>Value3b</dd>
<dt>Label4</dt><dd>Value4</dd>
</dl>'
doc = Nokogiri.HTML(html)
Using no XPath:
doc.css('dt').each do |dt|
dds = []
n = dt.next_element
begin
dds << n
n = n.next_element
end while n && n.name=='dd'
p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]
Using a Little XPath:
doc.css('dt').each do |dt|
dds = dt.xpath('following-sibling::*').chunk{ |n| n.name }.first.last
p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]
Using Lotsa XPath:
doc.css('dt').each do |dt|
ct = dt.xpath('count(following-sibling::dt)')
dds = dt.xpath("following-sibling::dd[count(following-sibling::dt)=#{ct}]")
p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]
After looking at the other answer here is an inefficient way of doing the same thing.
require 'nokogiri'
a = Nokogiri::HTML('<div id="first"><dt>Label1</dt><dd>Value1</dd><dt>Label2</dt><dd>Value2</dd></div>')
dt = []
dd = []
a.css("#first").each do |item|
item.css("dt").each {|t| dt << t.text}
item.css("dd").each {|t| dd << t.text}
end
dt.each_index do |i|
puts dt[i] + ': ' + dd[i]
end
In css to reference the ID you need to put the # symbol before. For a class it's the . symbol.

ruby string to hash conversion

I have a string like this,
str = "uu#p, xx#m, yy#n, zz#m"
I want to know how to convert the given string into a hash. (i.e my actual requirement is, how many values (before the # symbol) have the m, n and p. I don't want the counting, I need an exact value). The output would be better like this,
{"m" => ["xx", "zz"], "n" => ["yy"], "p" => ["uu"]}
Can help me anyone, please?
Direct copy/past of an IRB session:
>> str.split(/, /).inject(Hash.new{|h,k|h[k]=[]}) do |h, s|
.. v,k = s.split(/#/)
.. h[k] << v
.. h
.. end
=> {"p"=>["uu"], "m"=>["xx", "zz"], "n"=>["yy"]}
Simpler code for a newbie :)
str = "uu#p, xx#m, yy#n, zz#m"
h = {}
str.split(",").each do |x|
v,k = x.split('#')
h[k] ||= []
h[k].push(v)
end
p h
FP style:
grouped = str
.split(", ")
.group_by { |s| s.split("#")[1] }
.transform_values { |ss| ss.map { |x| s.split("#")[0] } }
#=> {"m"=>["xx", "zz"], "n"=>["yy"], "p"=>["uu"]}
This is a pretty common pattern. Using Facets.map_by:
require 'facets'
str.split(", ").map_by { |s| s.split("#", 2).reverse }
#=> {"m"=>["xx", "zz"], "n"=>["yy"], "p"=>["uu"]}

xpath to find all following sibling adjacent nodes up til another type [duplicate]

I have HTML code like this:
<div id="first">
<dt>Label1</dt>
<dd>Value1</dd>
<dt>Label2</dt>
<dd>Value2</dd>
...
</div>
My code does not work.
doc.css("first").each do |item|
label = item.css("dt")
value = item.css("dd")
end
Show all the <dt> tags firsts and then the <dd> tags and I need "label: value"
First of all, your HTML should have the <dt> and <dd> elements inside a <dl>:
<div id="first">
<dl>
<dt>Label1</dt>
<dd>Value1</dd>
<dt>Label2</dt>
<dd>Value2</dd>
...
</dl>
</div>
but that won't change how you parse it. You want to find the <dt>s and iterate over them, then at each <dt> you can use next_element to get the <dd>; something like this:
doc = Nokogiri::HTML('<div id="first"><dl>...')
doc.css('#first').search('dt').each do |node|
puts "#{node.text}: #{node.next_element.text}"
end
That should work as long as the structure matches your example.
Under the assumption that some <dt> may have multiple <dd>, you want to find all <dt> and then (for each) find the following <dd> before the next <dt>. This is pretty easy to do in pure Ruby, but more fun to do in just XPath. ;)
Given this setup:
require 'nokogiri'
html = '<dl id="first">
<dt>Label1</dt><dd>Value1</dd>
<dt>Label2</dt><dd>Value2</dd>
<dt>Label3</dt><dd>Value3a</dd><dd>Value3b</dd>
<dt>Label4</dt><dd>Value4</dd>
</dl>'
doc = Nokogiri.HTML(html)
Using no XPath:
doc.css('dt').each do |dt|
dds = []
n = dt.next_element
begin
dds << n
n = n.next_element
end while n && n.name=='dd'
p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]
Using a Little XPath:
doc.css('dt').each do |dt|
dds = dt.xpath('following-sibling::*').chunk{ |n| n.name }.first.last
p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]
Using Lotsa XPath:
doc.css('dt').each do |dt|
ct = dt.xpath('count(following-sibling::dt)')
dds = dt.xpath("following-sibling::dd[count(following-sibling::dt)=#{ct}]")
p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]
After looking at the other answer here is an inefficient way of doing the same thing.
require 'nokogiri'
a = Nokogiri::HTML('<div id="first"><dt>Label1</dt><dd>Value1</dd><dt>Label2</dt><dd>Value2</dd></div>')
dt = []
dd = []
a.css("#first").each do |item|
item.css("dt").each {|t| dt << t.text}
item.css("dd").each {|t| dd << t.text}
end
dt.each_index do |i|
puts dt[i] + ': ' + dd[i]
end
In css to reference the ID you need to put the # symbol before. For a class it's the . symbol.

Resources