Nokogiri: need to turn markup partitioned by `hr` into divs

Nokogiri: need to turn markup partitioned by `hr` into divs - ruby

Given markup inside an HTML document that looks like this
<h3>test</h3>
<p>test</p>
<hr/>
<h3>test2</h3>
<p>test2</p>
<hr/>
I'd like to to produce this
<div>
<h3>test</h3>
<p>test</p>
</div>
<div>
<h3>test2</h3>
<p>test2</p>
</div>
What's the most elegant way to do with with Nokogiri?

Edit: Reworked answer to be a bit cleaner.
Edit2: Small rewrite to shorten by two lines
require 'nokogiri'
doc = Nokogiri::HTML <<ENDHTML
<h3>test</h3>
<p>test</p>
<hr/>
<h3>test2</h3>
<p>test2</p>
<hr/>
ENDHTML
body = doc.at_css('body') # Created by parsing as HTML
kids = body.xpath('./*') # Every child of the body
body.inner_html = "" # Empty the body now that we have our nodes
div = (body << "<div>").first # Create our first container in the body
kids.each do |node| # For every child that was in the body...
if node.name=='hr'
div = (body << '<div>').first # Create a new container for stuff
else
div << node # Move this into the last container
end
end
div.remove unless div.child # Get rid of a trailing, empty div
puts body.inner_html
#=> <div>
#=> <h3>test</h3>
#=> <p>test</p>
#=> </div>
#=> <div>
#=> <h3>test2</h3>
#=> <p>test2</p>
#=> </div>

This is how I'd go about it:
require 'nokogiri'
html = '
<h3>test</h3>
<p>test</p>
<hr/>
<h3>test2</h3>
<p>test2</p>
<hr/>
'
doc = Nokogiri::HTML(html)
doc2 = Nokogiri::HTML('<body />')
doc2_body = doc2.at('body')
doc.search('//h3 | //p').each_slice(2) do |ns|
nodeset = Nokogiri::XML::NodeSet.new(doc2, ns)
div = Nokogiri::XML::Node.new('div', doc2)
div.add_child(nodeset)
doc2_body.add_child(div)
end
puts doc2_body.inner_html
# >> <div>
# >> <h3>test</h3>
# >> <p>test</p>
# >> </div>
# >> <div>
# >> <h3>test2</h3>
# >> <p>test2</p>
# >> </div>

Here's an answer that uses Ruby 1.9.2's Enumerable#chunk to split the children into sections and also exercises Nokogiri's NodeSet class:
require 'nokogiri'
doc = Nokogiri::HTML <<ENDHTML
<h3>test</h3>
<p>test</p>
<hr/>
<h3>test2</h3>
<p>test2</p>
<hr/>
ENDHTML
result = Nokogiri::XML::NodeSet.new( doc,
doc.xpath('//body/*').chunk do |n|
n.name=='hr'
end.reject do |matched,nodes|
matched
end.map do |matched,nodes|
doc.create_element('div').tap do |div|
div << Nokogiri::XML::NodeSet.new( doc, nodes )
end
end )
puts result
#=> <div>
#=> <h3>test</h3>
#=> <p>test</p>
#=> </div>
#=> <div>
#=> <h3>test2</h3>
#=> <p>test2</p>
#=> </div>

Related

Nokogiri-- To remove unwanted spaces between tags having no text

I have a HTML content as--
html = "<table id=\"soa_table\" class=\"table table-striped table-bordered table-condensed soa-table\"><thead><tr><th>SoA</th><th id=\"423\" class=\"soa-column text-center\">V1</th><th id=\"424\" class=\"soa-column text-center\">V2</th></tr></thead><tbody><tr><td class=\"soa-row\" id=\"631\">Label 1</td><td class=\"soa-element text-center\" form_id=\"631\" visit_id=\"423\" id=\"484\"><span class=\"glyphicon glyphicon-ok text-success\"></span></td><td class=\"soa-element\" form_id=\"631\" visit_id=\"424\" id=\"0\"> </td></tr><tr><td class=\"soa-row\" id=\"632\">Label 2</td><td class=\"soa-element text-center\" form_id=\"632\" visit_id=\"423\" id=\"485\"><span class=\"glyphicon glyphicon-ok text-success\"></span></td><td class=\"soa-element\" form_id=\"632\" visit_id=\"424\" id=\"0\"> </td></tr><tr><td class=\"soa-row\" id=\"633\">Label 3</td><td class=\"soa-element\" form_id=\"633\" visit_id=\"423\" id=\"0\"> </td><td class=\"soa-element text-center\" form_id=\"633\" visit_id=\"424\" id=\"486\"><span class=\"glyphicon glyphicon-ok text-success\"></span></td></tr></tbody></table>"
Now I parsed it via Nokogiri and tried to gsub the spaces as---
Nokogiri::HTML(html).at('table').to_html.gsub(/>\s+</, "><")
But it doesn't works

remove unwanted spaces between tags having no text
I asume you mean this kind of space:
<td class="soa-element" form_id="631" visit_id="424" id="0"> </td>
^
That's a text node containing a single space.
Let's use a smaller example:
html = '<foo>value</foo><bar> </bar>'
doc = Nokogiri::HTML.fragment(html)
You can use PP to inspect the parsed document structure:
require 'pp'
pp doc
Output:
#(DocumentFragment:0x3fe819894018 {
name = "#document-fragment",
children = [
#(Element:0x3fe819891b9c { name = "foo", children = [ #(Text "value")] }),
#(Element:0x3fe819891ae8 { name = "bar", children = [ #(Text " ")] })]
})
The document contains two text nodes, one with "value" the other one with " ".
In order to remove the latter, we can traverse the document and remove all text nodes containing just whitespace:
doc.traverse { |node| node.remove if node.text? && node.text !~ /\S/ }
pp doc
Output:
#(DocumentFragment:0x3fe819894018 {
name = "#document-fragment",
children = [
#(Element:0x3fe819891b9c { name = "foo", children = [ #(Text "value")] }),
#(Element:0x3fe819891ae8 { name = "bar" })]
})
Finally, we can serialize the document:
doc.to_html
#=> "<foo>value</foo><bar></bar>"

gsub does not substitute into the source object. gsub! does. Also, you don't need Nokogiri at all.
require 'nokogiri'
puts 'Needlessly using Nokogiri'
html = "<p> </p>"
new_html = Nokogiri::HTML(html).at('p').to_html.gsub(/>\s+</, '><')
puts html
puts new_html
puts '-' * 20
puts 'Solution #1'
html = "<p> </p>"
new_html = html.gsub(/>\s+</, '><')
puts html
puts new_html
puts '-' * 20
puts 'Solution #2'
html = "<p> </p>"
puts html
html.gsub!(/>\s+</,'><')
puts html
The output of this program is:
Needlessly using Nokogiri
<p> </p>
<p></p>
--------------------
Solution #1
<p> </p>
<p></p>
--------------------
Solution #2
<p> </p>
<p></p>

Remove whitespace-only text nodes:
doc.search('//text()[normalize-space()=""]').remove
Update with example:
Nokogiri::HTML('<b></b> <b></b>').search('//text()[normalize-space()=""]').remove
#=> [#<Nokogiri::XML::Text:0x197ad78 " ">]

How to parse consecutive tags with Nokogiri?

I have HTML code like this:
<div id="first">
<dt>Label1</dt>
<dd>Value1</dd>
<dt>Label2</dt>
<dd>Value2</dd>
...
</div>
My code does not work.
doc.css("first").each do |item|
label = item.css("dt")
value = item.css("dd")
end
Show all the <dt> tags firsts and then the <dd> tags and I need "label: value"

First of all, your HTML should have the <dt> and <dd> elements inside a <dl>:
<div id="first">
<dl>
<dt>Label1</dt>
<dd>Value1</dd>
<dt>Label2</dt>
<dd>Value2</dd>
...
</dl>
</div>
but that won't change how you parse it. You want to find the <dt>s and iterate over them, then at each <dt> you can use next_element to get the <dd>; something like this:
doc = Nokogiri::HTML('<div id="first"><dl>...')
doc.css('#first').search('dt').each do |node|
puts "#{node.text}: #{node.next_element.text}"
end
That should work as long as the structure matches your example.

Under the assumption that some <dt> may have multiple <dd>, you want to find all <dt> and then (for each) find the following <dd> before the next <dt>. This is pretty easy to do in pure Ruby, but more fun to do in just XPath. ;)
Given this setup:
require 'nokogiri'
html = '<dl id="first">
<dt>Label1</dt><dd>Value1</dd>
<dt>Label2</dt><dd>Value2</dd>
<dt>Label3</dt><dd>Value3a</dd><dd>Value3b</dd>
<dt>Label4</dt><dd>Value4</dd>
</dl>'
doc = Nokogiri.HTML(html)
Using no XPath:
doc.css('dt').each do |dt|
dds = []
n = dt.next_element
begin
dds << n
n = n.next_element
end while n && n.name=='dd'
p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]
Using a Little XPath:
doc.css('dt').each do |dt|
dds = dt.xpath('following-sibling::*').chunk{ |n| n.name }.first.last
p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]
Using Lotsa XPath:
doc.css('dt').each do |dt|
ct = dt.xpath('count(following-sibling::dt)')
dds = dt.xpath("following-sibling::dd[count(following-sibling::dt)=#{ct}]")
p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]

After looking at the other answer here is an inefficient way of doing the same thing.
require 'nokogiri'
a = Nokogiri::HTML('<div id="first"><dt>Label1</dt><dd>Value1</dd><dt>Label2</dt><dd>Value2</dd></div>')
dt = []
dd = []
a.css("#first").each do |item|
item.css("dt").each {|t| dt << t.text}
item.css("dd").each {|t| dd << t.text}
end
dt.each_index do |i|
puts dt[i] + ': ' + dd[i]
end
In css to reference the ID you need to put the # symbol before. For a class it's the . symbol.

How do I wrap untagged text elements using Nokogiri?

For example I have a html string:
<span class="no">1172</span><span class="r">case</span> primary_key_prefix_type
How to wrap every element which doesn't have tag by Nokogiri like this:
<span class="no">1172</span><span class="r">case</span> <span>primary_key_prefix_type</span>

This doesn't feel like the most elegant solution, but it works:
require 'nokogiri'
# Given a node, find each whitespace-delimited word
# and wrap it in the supplied markup
def wrap_text( node, wrapper='<span />' )
wrapper = Nokogiri::XML::DocumentFragment.parse(wrapper).children.first
node.xpath('child::text()').each do |text_node|
text_node.swap( text_node.text.gsub(/(\s*)(\S+)(\s*)/) do
"#{$1}#{
wrapper.clone.tap{ |w| w.inner_html = $2 }.to_html
}#{$3}"
end )
end
node
end
# Testing
html = Nokogiri::HTML '<body>
<p><span class="no">1172</span><span class="r">case</span> primary_key_prefix_type</p>
<p>Hello <b>cool</b> world #42!</p>
</body>'
html.search('p').each{ |para| wrap_text(para) }
puts html.at('body')
#=> <body>
#=> <p><span class="no">1172</span><span class="r">case</span> <span>primary_key_prefix_type</span></p>
#=> <p><span>Hello</span> <b>cool</b> <span>world</span> <span>#42!</span></p>
#=> </body>
Edit: More examples:
# If your lines don't have element wrapping them...
raw = [
'<span class="no">1172</span><span class="r">case</span> primary_key',
'Hello <b>cool</b> world #42!'
]
puts raw.map{ |line| wrap_text(Nokogiri::HTML(line).at('body')).inner_html }
#=> <span class="no">1172</span><span class="r">case</span> <span>primary_key</span>
#=> <p>Hello <b>cool</b> world #42!</p>
# If your lines each have exactly one element wrapping them...
wrapped = [
'<a><span class="no">1172</span><span class="r">case</span> primary_key</a>',
'<b>Hello <b>cool</b> world #42!</b>'
]
body = Nokogiri::HTML(wrapped.join("\n")).at('body')
puts body.children.map{ |e| wrap_text(e) }
#=> <a><span class="no">1172</span><span class="r">case</span> <span>primary_key</span></a>
#=> <b><span>Hello</span> <b>cool</b> <span>world</span> <span>#42!</span></b>

xpath to find all following sibling adjacent nodes up til another type [duplicate]

I have HTML code like this:
<div id="first">
<dt>Label1</dt>
<dd>Value1</dd>
<dt>Label2</dt>
<dd>Value2</dd>
...
</div>
My code does not work.
doc.css("first").each do |item|
label = item.css("dt")
value = item.css("dd")
end
Show all the <dt> tags firsts and then the <dd> tags and I need "label: value"

First of all, your HTML should have the <dt> and <dd> elements inside a <dl>:
<div id="first">
<dl>
<dt>Label1</dt>
<dd>Value1</dd>
<dt>Label2</dt>
<dd>Value2</dd>
...
</dl>
</div>
but that won't change how you parse it. You want to find the <dt>s and iterate over them, then at each <dt> you can use next_element to get the <dd>; something like this:
doc = Nokogiri::HTML('<div id="first"><dl>...')
doc.css('#first').search('dt').each do |node|
puts "#{node.text}: #{node.next_element.text}"
end
That should work as long as the structure matches your example.

Under the assumption that some <dt> may have multiple <dd>, you want to find all <dt> and then (for each) find the following <dd> before the next <dt>. This is pretty easy to do in pure Ruby, but more fun to do in just XPath. ;)
Given this setup:
require 'nokogiri'
html = '<dl id="first">
<dt>Label1</dt><dd>Value1</dd>
<dt>Label2</dt><dd>Value2</dd>
<dt>Label3</dt><dd>Value3a</dd><dd>Value3b</dd>
<dt>Label4</dt><dd>Value4</dd>
</dl>'
doc = Nokogiri.HTML(html)
Using no XPath:
doc.css('dt').each do |dt|
dds = []
n = dt.next_element
begin
dds << n
n = n.next_element
end while n && n.name=='dd'
p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]
Using a Little XPath:
doc.css('dt').each do |dt|
dds = dt.xpath('following-sibling::*').chunk{ |n| n.name }.first.last
p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]
Using Lotsa XPath:
doc.css('dt').each do |dt|
ct = dt.xpath('count(following-sibling::dt)')
dds = dt.xpath("following-sibling::dd[count(following-sibling::dt)=#{ct}]")
p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]

After looking at the other answer here is an inefficient way of doing the same thing.
require 'nokogiri'
a = Nokogiri::HTML('<div id="first"><dt>Label1</dt><dd>Value1</dd><dt>Label2</dt><dd>Value2</dd></div>')
dt = []
dd = []
a.css("#first").each do |item|
item.css("dt").each {|t| dt << t.text}
item.css("dd").each {|t| dd << t.text}
end
dt.each_index do |i|
puts dt[i] + ': ' + dd[i]
end
In css to reference the ID you need to put the # symbol before. For a class it's the . symbol.

finding common ancestor from a group of xpath?

say i have
html/body/span/div/p/h1/i/font
html/body/span/div/div/div/div/table/tr/p/h1
html/body/span/p/h1/b
html/body/span/div
how can i get the common ancestor? in this case span would be the common ancestor of "font, h1, b, div" would be "span"

To find common ancestry between two nodes:
(node1.ancestors & node2.ancestors).first
A more generalized function that works with multiple nodes:
# accepts node objects or selector strings
class Nokogiri::XML::Element
def common_ancestor(*nodes)
nodes = nodes.map do |node|
String === node ? self.document.at(node) : node
end
nodes.inject(self.ancestors) do |common, node|
common & node.ancestors
end.first
end
end
# usage:
node1.common_ancestor(node2, '//foo/bar')
# => <ancestor node>

The function common_ancestor below does what you want.
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::XML(DATA)
def common_ancestor *elements
return nil if elements.empty?
elements.map! do |e| [ e, [e] ] end #prepare array
elements.map! do |e| # build array of ancestors for each given element
e[1].unshift e[0] while e[0].respond_to?(:parent) and e[0] = e[0].parent
e[1]
end
# merge corresponding ancestors and find the last where all ancestors are the same
elements[0].zip(*elements[1..-1]).select { |e| e.uniq.length == 1 }.flatten.last
end
i = doc.xpath('//*[#id="i"]').first
div = doc.xpath('//*[#id="div"]').first
h1 = doc.xpath('//*[#id="h1"]').first
p common_ancestor i, div, h1 # => gives the p element
__END__
<html>
<body>
<span>
<p id="common-ancestor">
<div>
<p><h1><i id="i"></i></h1></p>
<div id="div"></div>
</div>
<p>
<h1 id="h1"></h1>
</p>
<div></div>
</p>
</span>
</body>
</html>

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Nokogiri: need to turn markup partitioned by `hr` into divs - ruby

Given markup inside an HTML document that looks like this <h3>test</h3> <p>test</p> <hr/> <h3>test2</h3> <p>test2</p> <hr/> I'd like to to produce this <div> <h3>test</h3> <p>test</p> </div> <div> <h3>test2</h3> <p>test2</p> </div> What's the most elegant way to do with with Nokogiri?

Related

Nokogiri-- To remove unwanted spaces between tags having no text

How to parse consecutive tags with Nokogiri?

How do I wrap untagged text elements using Nokogiri?

xpath to find all following sibling adjacent nodes up til another type [duplicate]

finding common ancestor from a group of xpath?

Categories

Resources