How to wrap Nokogiri nodeset in ONE span - ruby

So my goal is to wrap all paragraphs after the initial paragraph within a span. I'm trying to figure out how to wrap a nodeset within a single span and .wrap() wraps each node in its own span. As in want:
<p>First</p>
<p>Second</p>
<p>Third</p>
To become:
<p>First</p>
<span>
<p>Second</p>
<p>Third</p>
</span>
Any sample code to help? Thanks!

I'd do as below :
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(<<-html)
<p>First</p>
<p>Second</p>
<p>Third</p>
html
nodeset = doc.css("p")
new_node = Nokogiri::XML::Node.new('span',doc)
new_node << nodeset[1..-1]
nodeset.first.after(new_node)
puts doc.to_html
# >> <p>First</p><span><p>Second</p>
# >> <p>Third</p></span>
# >>

I'd do it something like this:
require 'nokogiri'
html = '<p>First</p>
<p>Second</p>
<p>Third</p>
'
doc = Nokogiri::HTML(html)
paragraphs = doc.search('p')[1..-1].unlink
doc.at('p').after('<span>')
doc.at('span').add_child(paragraphs)
puts doc.to_html
Which results in HTML looking like:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>First</p>
<span><p>Second</p>
<p>Third</p></span>
</body></html>
To give you an idea what's happening, here's a more verbose output showing intermediate changes to the doc:
paragraphs = doc.search('p')[1..-1].unlink
paragraphs.to_html
# => "<p>Second</p><p>Third</p>"
doc.at('p').after('<span>')
doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>\n<p>First</p>\n<span></span>\n\n</body></html>\n"
doc.at('span').add_child(paragraphs)
doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>\n<p>First</p>\n<span><p>Second</p>\n<p>Third</p></span>\n\n</body></html>\n"
Looking at the initial HTML, I'm not sure the question asked is going to work well for normal, everyday HTML, however, if you are absolutely sure it'll never change from the
<p>...</p>
<p>...</p>
<p>...</p>
layout then you should be OK. Any answer based on the initial, sample, HTML, will blow up miserably if the HTML really is something like:
<div>
<p>...</p>
<p>...</p>
<p>...</p>
</div>
...
<div>
<p>...</p>
<p>...</p>
<p>...</p>
</div>

Related

How to parse the image href in Nokogiri

I am parsing a web page using Nokogiri, and would like to parse out an image URL. This is my setup:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('https://themeforest.net/search?sort=sales'))
I can see the following code block if I inspect the page on chrome:
<div class="_2_3rp " style="padding-top:50.847457627118644%">
<div style="">
<img class="_1xvs1" src="https://themeforest.img.customer.envatousercontent.com/files/274559780/screenshots/00-Preview.jpg?auto=compress%2Cformat&fit=crop&crop=top&w=590&h=300&s=37354d884fd0f3b574238e013b4ea423"
title="Avada | Responsive Multi-Purpose Theme"
alt="Avada | Responsive Multi-Purpose Theme" style="left: 0%;">
</div>
</div>
However, when I run:
puts doc.search("//div[#class = '_2_3rp ']")
I get the following:
<div class="_2_3rp " style="padding-top:50.847457627118644%"><div style="height:100%" class="lazyload-placeholder"></div></div>
<div class="_2_3rp " style="padding-top:50.847457627118644%"><div style="height:100%" class="lazyload-placeholder"></div></div>
.....
=> nil
Why am I not getting the img class, and instead getting lazyload-placeholder? Is there any way I can get over this, and escape the image placeholder?
Here's the minimal code I came up with that's necessary to test your assertion:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div class="12345">
<div>
<img class="67890" src="https://foo.bar">
</div>
</div>
EOT
doc.search('//div[#class=12345]').map(&:to_html)
# => ["<div class=\"12345\">\n" +
# " <div>\n" +
# " <img class=\"67890\" src=\"https://foo.bar\">\n" +
# " </div>\n" +
# "</div>"]
# "</div>"]
It looks like the img tag is there.
You're using Nokogiri::XML to parse. Don't, because strict parsing occurs and with HTML, which is anything but strict, problems can occur if the HTML is malformed.

How to print to web page using Ruby and Sinatra

I'm calling a ruby function in a post method and I'm trying to output the contents from the function to the web page but it prints the output in my console instead. How do I get it to print to the page?
I've tried
<%=rsg(params[:grammar_file])%> inside an erb file
and
rsg(params[:grammar_file])
inside of the post method and both just print to the console
require 'sinatra'
require 'sinatra/reloader' if development? #gem install sinatra-contrib
require './rsg.rb'
enable :sessions
get '/' do
erb :index
end
post '/' do
rsg(params[:grammar_file])
erb :index
end
<% title = "RANDOM SENTENCE GENERATOR" %>
<!doctype html>
<html lang="en">
<head>
<title><%= #title || "RSG" %></title>
<meta charset="UTF8">
</head>
<body>
<h1>RubyRSG Demo</h1>
<p>Select grammar file to create randomly generated sentence</p>
<form action="/" method="post">
<select name="grammar_file">
<option value="Select" hidden>Select</option>
<option value="Poem">Poem</option>
<option value="Insult">Insult</option>
<option value="Extension-request">Extension-request</option>
<option value="Bond-movie">Bond-movie</option>
</select>
<br><br>
<input type="submit" value="submit">
</form>
<section>
<p>Here</p>
<p><%= rsg(params[:grammar_file])%></p>
</section>
</body>
</html>
You need to tell your template what to do with the params.
This is what is happening:
post '/' do
rsg(params[:grammar_file])
# your rsg method produces some output. I guess you have a line the `puts` your params to stdout somewhere. Instead you should redirect the output into the template.
erb :index
end
Like this:
post '/' do
erb :index, :locals => {:rsg => rsg(params[:grammar_file])}
end
Then, in your :index template you have a line like:
<%=rsg%>
To output the generated String.
The problem might also be that you're tryng to return a puts statement instead of the plain string:
def rsg(p)
puts "I love my daily #{p}. Good luck to you"
end
This will just print to the console and nothing else (true to be precise)
Better:
def rsg(p)
"I love my daily #{p}. Good luck to you"
end
Here you will just return the String from your method and calling rsg("sandwich") will return:
# => "I love my daily sandwich. Good luck to you"

Parsing nodes with Nokogiri?

I'm parsing web pages and I want to get the link from the <img src> by finding the <div id="image">.
How do I do this in Nokogiri? I tried walking through the child nodes but it fails.
<div id="image" class="image textbox ">
<div class="">
<img src="img.jpg" alt="" original-title="">
</div>
</div>
This is my code:
doc = Nokogiri::HTML(open("site.com"))
doc.css("div.image").each do |node|
node.children().each do |c|
puts c.attr("src")
end
end
Any ideas?
Try this and let me know if it works for you
require 'nokogiri'
source = <<-HTML
<div id="image" class="image textbox ">
<div class="">
<img src="img.jpg" alt="" original-title="">
</div>
</div>
HTML
doc = Nokogiri::HTML(source)
doc.css('div#image > div > img').each do |image|
puts image.attr('src')
end
Output:
img.jpg
Here is a great resource: http://ruby.bastardsbook.com/chapters/html-parsing/
Modifying an example a bit, I get this:
doc = Nokogiri::HTML(open("site.com"))
doc.css("div.image img").each do |img|
puts img.attr("src")
end
Although you should use the ID selector, #image, rather than the class selector, .image, when you can. It is very much faster.

Use XPath to group siblings from an HTML/XML document?

I want to transform an HTML or XML document by grouping previously ungrouped sibling nodes.
For example, I want to take the following fragment:
<h2>Header</h2>
<p>First paragraph</p>
<p>Second paragraph</p>
<h2>Second header</h2>
<p>Third paragraph</p>
<p>Fourth paragraph</p>
Into this:
<section>
<h2>Header</h2>
<p>First paragraph</p>
<p>Second paragraph</p>
</section>
<section>
<h2>Second header</h2>
<p>Third paragraph</p>
<p>Fourth paragraph</p>
</section>
Is this possible using simple Xpath selectors and an XML parser like Nokogiri? Or do I need to implement a SAX parser for this task?
Updated Answer
Here's a general solution that creates a hierarchy of <section> elements based on header levels and their following siblings:
class Nokogiri::XML::Node
# Create a hierarchy on a document based on heading levels
# wrap : e.g. "<section>" or "<div class='section'>"
# stops : array of tag names that stop all sections; use nil for none
# levels : array of tag names that control nesting, in order
def auto_section(wrap='<section>', stops=%w[hr], levels=%w[h1 h2 h3 h4 h5 h6])
levels = Hash[ levels.zip(0...levels.length) ]
stops = stops && Hash[ stops.product([true]) ]
stack = []
children.each do |node|
unless level = levels[node.name]
level = stops && stops[node.name] && -1
end
stack.pop while (top=stack.last) && top[:level]>=level if level
stack.last[:section].add_child(node) if stack.last
if level && level >=0
section = Nokogiri::XML.fragment(wrap).children[0]
node.replace(section); section << node
stack << { :section=>section, :level=>level }
end
end
end
end
Here is this code in use, and the result it gives.
The original HTML
<body>
<h1>Main Section 1</h1>
<p>Intro</p>
<h2>Subhead 1.1</h2>
<p>Meat</p><p>MOAR MEAT</p>
<h2>Subhead 1.2</h2>
<p>Meat</p>
<h3>Caveats</h3>
<p>FYI</p>
<h4>ProTip</h4>
<p>Get it done</p>
<h2>Subhead 1.3</h2>
<p>Meat</p>
<h1>Main Section 2</h1>
<h3>Jumpin' in it!</h3>
<p>Level skip!</p>
<h2>Subhead 2.1</h2>
<p>Back up...</p>
<h4>Dive! Dive!</h4>
<p>...and down</p>
<hr /><p id="footer">Copyright © All Done</p>
</body>
The conversion code
# Use XML only so that we can pretty-print the results; HTML works fine, too
doc = Nokogiri::XML(html,&:noblanks) # stripping whitespace allows indentation
doc.at('body').auto_section # make the magic happen
puts doc.to_xhtml # show the result with indentation
The result
<body>
<section>
<h1>Main Section 1</h1>
<p>Intro</p>
<section>
<h2>Subhead 1.1</h2>
<p>Meat</p>
<p>MOAR MEAT</p>
</section>
<section>
<h2>Subhead 1.2</h2>
<p>Meat</p>
<section>
<h3>Caveats</h3>
<p>FYI</p>
<section>
<h4>ProTip</h4>
<p>Get it done</p>
</section>
</section>
</section>
<section>
<h2>Subhead 1.3</h2>
<p>Meat</p>
</section>
</section>
<section>
<h1>Main Section 2</h1>
<section>
<h3>Jumpin' in it!</h3>
<p>Level skip!</p>
</section>
<section>
<h2>Subhead 2.1</h2>
<p>Back up...</p>
<section>
<h4>Dive! Dive!</h4>
<p>...and down</p>
</section>
</section>
</section>
<hr />
<p id="footer">Copyright All Done</p>
</body>
Original Answer
Here's an answer using no XPath, but Nokogiri. I've taken the liberty of making the solution somewhat flexible, handling arbitrary start/stops (but not nested sections).
html = "<h2>Header</h2>
<p>First paragraph</p>
<p>Second paragraph</p>
<h2>Second header</h2>
<p>Third paragraph</p>
<p>Fourth paragraph</p>
<hr>
<p id='footer'>All done!</p>"
require 'nokogiri'
class Nokogiri::XML::Node
# Provide a block that returns:
# true - for nodes that should start a new section
# false - for nodes that should not start a new section
# :stop - for nodes that should stop any current section but not start a new one
def group_under(name="section")
group = nil
element_children.each do |child|
case yield(child)
when false, nil
group << child if group
when :stop
group = nil
else
group = document.create_element(name)
child.replace(group)
group << child
end
end
end
end
doc = Nokogiri::HTML(html)
doc.at('body').group_under do |node|
if node.name == 'hr'
:stop
else
%w[h1 h2 h3 h4 h5 h6].include?(node.name)
end
end
puts doc
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body>
#=> <section><h2>Header</h2>
#=> <p>First paragraph</p>
#=> <p>Second paragraph</p></section>
#=>
#=> <section><h2>Second header</h2>
#=> <p>Third paragraph</p>
#=> <p>Fourth paragraph</p></section>
#=>
#=> <hr>
#=> <p id="footer">All done!</p>
#=> </body></html>
For XPath, see XPath : select all following siblings until another sibling
One way using xpath is to select all the p elements that follow your h2 and from them subtract the p elements that also follow the next h2:
doc = Nokogiri::HTML.fragment(html)
doc.css('h2').each do |h2|
nodeset = h2.xpath('./following-sibling::p')
next_h2 = h2.at('./following-sibling::h2')
nodeset -= next_h2.xpath('./following-sibling::p') if next_h2
section_tag = h2.add_previous_sibling Nokogiri::XML::Node.new('section',doc)
h2.parent = section_tag
nodeset.each {|n| n.parent = section_tag}
end
XPath can only select things from your input document, it can't transform it into a new document. For that you need XSLT or some other transformation language. I guess if you're into Nokogiri then the previous answers will be useful, but for completeness, here's what it looks like in XSLT 2.0:
<xsl:for-each-group select="*" group-starting-with="h2">
<section>
<xsl:copy-of select="current-group()"/>
</section>
</xsl:for-each-group>

How do I exclude a nested element when grabbing content using Nokogiri?

I have a page with content that looks similar to this:
<div id="level1">
<div id="level2">
<div id="level3">Crap i dont care about</div>
Here is some text i want
<br />
Here is some more text i want
<br />
Oh i want this text too :)
</div>
</div>
My goal is to capture the text in #level2 but the #level3 <div> is nested inside of it at the same level as the text I want.
Is it possible to some how exclude that <div>? Should I be modifying the document and simply removing the element before parsing?
require 'nokogiri'
xml = <<-XML
<div id="level1">
<div id="level2">
<div id="level3">Crap i dont care about</div>
Here is some text i want
<br />
Here is some more text i want
<br />
Oh i want this text too :)
</div>
</div>
XML
page = Nokogiri::XML(xml)
p page.xpath("//*[#id='level3']").remove.xpath("//*[#id='level2']").inner_text
# => "\n \n Here is some text i want\n \n Here is some more text i want\n \n Oh i want this text too :)\n "
Now, you may clean the output text if you wish.
If your HTML fragment is in html, then you could do something like this:
doc = Nokogiri::HTML(html)
div = doc.at_css('#level2') # Extract <div id="level2">
div.at_css('#level3').remove # Remove <div id="level3">
text_you_want = div.inner_text
You could also do it with XPath but I find CSS selectors a bit simpler for simple cases like this.

Resources