Use XPath to group siblings from an HTML/XML document? - ruby

I want to transform an HTML or XML document by grouping previously ungrouped sibling nodes.
For example, I want to take the following fragment:
<h2>Header</h2>
<p>First paragraph</p>
<p>Second paragraph</p>
<h2>Second header</h2>
<p>Third paragraph</p>
<p>Fourth paragraph</p>
Into this:
<section>
<h2>Header</h2>
<p>First paragraph</p>
<p>Second paragraph</p>
</section>
<section>
<h2>Second header</h2>
<p>Third paragraph</p>
<p>Fourth paragraph</p>
</section>
Is this possible using simple Xpath selectors and an XML parser like Nokogiri? Or do I need to implement a SAX parser for this task?

Updated Answer
Here's a general solution that creates a hierarchy of <section> elements based on header levels and their following siblings:
class Nokogiri::XML::Node
# Create a hierarchy on a document based on heading levels
# wrap : e.g. "<section>" or "<div class='section'>"
# stops : array of tag names that stop all sections; use nil for none
# levels : array of tag names that control nesting, in order
def auto_section(wrap='<section>', stops=%w[hr], levels=%w[h1 h2 h3 h4 h5 h6])
levels = Hash[ levels.zip(0...levels.length) ]
stops = stops && Hash[ stops.product([true]) ]
stack = []
children.each do |node|
unless level = levels[node.name]
level = stops && stops[node.name] && -1
end
stack.pop while (top=stack.last) && top[:level]>=level if level
stack.last[:section].add_child(node) if stack.last
if level && level >=0
section = Nokogiri::XML.fragment(wrap).children[0]
node.replace(section); section << node
stack << { :section=>section, :level=>level }
end
end
end
end
Here is this code in use, and the result it gives.
The original HTML
<body>
<h1>Main Section 1</h1>
<p>Intro</p>
<h2>Subhead 1.1</h2>
<p>Meat</p><p>MOAR MEAT</p>
<h2>Subhead 1.2</h2>
<p>Meat</p>
<h3>Caveats</h3>
<p>FYI</p>
<h4>ProTip</h4>
<p>Get it done</p>
<h2>Subhead 1.3</h2>
<p>Meat</p>
<h1>Main Section 2</h1>
<h3>Jumpin' in it!</h3>
<p>Level skip!</p>
<h2>Subhead 2.1</h2>
<p>Back up...</p>
<h4>Dive! Dive!</h4>
<p>...and down</p>
<hr /><p id="footer">Copyright © All Done</p>
</body>
The conversion code
# Use XML only so that we can pretty-print the results; HTML works fine, too
doc = Nokogiri::XML(html,&:noblanks) # stripping whitespace allows indentation
doc.at('body').auto_section # make the magic happen
puts doc.to_xhtml # show the result with indentation
The result
<body>
<section>
<h1>Main Section 1</h1>
<p>Intro</p>
<section>
<h2>Subhead 1.1</h2>
<p>Meat</p>
<p>MOAR MEAT</p>
</section>
<section>
<h2>Subhead 1.2</h2>
<p>Meat</p>
<section>
<h3>Caveats</h3>
<p>FYI</p>
<section>
<h4>ProTip</h4>
<p>Get it done</p>
</section>
</section>
</section>
<section>
<h2>Subhead 1.3</h2>
<p>Meat</p>
</section>
</section>
<section>
<h1>Main Section 2</h1>
<section>
<h3>Jumpin' in it!</h3>
<p>Level skip!</p>
</section>
<section>
<h2>Subhead 2.1</h2>
<p>Back up...</p>
<section>
<h4>Dive! Dive!</h4>
<p>...and down</p>
</section>
</section>
</section>
<hr />
<p id="footer">Copyright All Done</p>
</body>
Original Answer
Here's an answer using no XPath, but Nokogiri. I've taken the liberty of making the solution somewhat flexible, handling arbitrary start/stops (but not nested sections).
html = "<h2>Header</h2>
<p>First paragraph</p>
<p>Second paragraph</p>
<h2>Second header</h2>
<p>Third paragraph</p>
<p>Fourth paragraph</p>
<hr>
<p id='footer'>All done!</p>"
require 'nokogiri'
class Nokogiri::XML::Node
# Provide a block that returns:
# true - for nodes that should start a new section
# false - for nodes that should not start a new section
# :stop - for nodes that should stop any current section but not start a new one
def group_under(name="section")
group = nil
element_children.each do |child|
case yield(child)
when false, nil
group << child if group
when :stop
group = nil
else
group = document.create_element(name)
child.replace(group)
group << child
end
end
end
end
doc = Nokogiri::HTML(html)
doc.at('body').group_under do |node|
if node.name == 'hr'
:stop
else
%w[h1 h2 h3 h4 h5 h6].include?(node.name)
end
end
puts doc
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body>
#=> <section><h2>Header</h2>
#=> <p>First paragraph</p>
#=> <p>Second paragraph</p></section>
#=>
#=> <section><h2>Second header</h2>
#=> <p>Third paragraph</p>
#=> <p>Fourth paragraph</p></section>
#=>
#=> <hr>
#=> <p id="footer">All done!</p>
#=> </body></html>
For XPath, see XPath : select all following siblings until another sibling

One way using xpath is to select all the p elements that follow your h2 and from them subtract the p elements that also follow the next h2:
doc = Nokogiri::HTML.fragment(html)
doc.css('h2').each do |h2|
nodeset = h2.xpath('./following-sibling::p')
next_h2 = h2.at('./following-sibling::h2')
nodeset -= next_h2.xpath('./following-sibling::p') if next_h2
section_tag = h2.add_previous_sibling Nokogiri::XML::Node.new('section',doc)
h2.parent = section_tag
nodeset.each {|n| n.parent = section_tag}
end

XPath can only select things from your input document, it can't transform it into a new document. For that you need XSLT or some other transformation language. I guess if you're into Nokogiri then the previous answers will be useful, but for completeness, here's what it looks like in XSLT 2.0:
<xsl:for-each-group select="*" group-starting-with="h2">
<section>
<xsl:copy-of select="current-group()"/>
</section>
</xsl:for-each-group>

Related

How to wrap Nokogiri nodeset in ONE span

So my goal is to wrap all paragraphs after the initial paragraph within a span. I'm trying to figure out how to wrap a nodeset within a single span and .wrap() wraps each node in its own span. As in want:
<p>First</p>
<p>Second</p>
<p>Third</p>
To become:
<p>First</p>
<span>
<p>Second</p>
<p>Third</p>
</span>
Any sample code to help? Thanks!
I'd do as below :
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(<<-html)
<p>First</p>
<p>Second</p>
<p>Third</p>
html
nodeset = doc.css("p")
new_node = Nokogiri::XML::Node.new('span',doc)
new_node << nodeset[1..-1]
nodeset.first.after(new_node)
puts doc.to_html
# >> <p>First</p><span><p>Second</p>
# >> <p>Third</p></span>
# >>
I'd do it something like this:
require 'nokogiri'
html = '<p>First</p>
<p>Second</p>
<p>Third</p>
'
doc = Nokogiri::HTML(html)
paragraphs = doc.search('p')[1..-1].unlink
doc.at('p').after('<span>')
doc.at('span').add_child(paragraphs)
puts doc.to_html
Which results in HTML looking like:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>First</p>
<span><p>Second</p>
<p>Third</p></span>
</body></html>
To give you an idea what's happening, here's a more verbose output showing intermediate changes to the doc:
paragraphs = doc.search('p')[1..-1].unlink
paragraphs.to_html
# => "<p>Second</p><p>Third</p>"
doc.at('p').after('<span>')
doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>\n<p>First</p>\n<span></span>\n\n</body></html>\n"
doc.at('span').add_child(paragraphs)
doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>\n<p>First</p>\n<span><p>Second</p>\n<p>Third</p></span>\n\n</body></html>\n"
Looking at the initial HTML, I'm not sure the question asked is going to work well for normal, everyday HTML, however, if you are absolutely sure it'll never change from the
<p>...</p>
<p>...</p>
<p>...</p>
layout then you should be OK. Any answer based on the initial, sample, HTML, will blow up miserably if the HTML really is something like:
<div>
<p>...</p>
<p>...</p>
<p>...</p>
</div>
...
<div>
<p>...</p>
<p>...</p>
<p>...</p>
</div>

Distribute the returns of a loop among two sections

I've basically got these two templates:
<article>
<section>
...
</section>
<section>
...
</section>
</article>
and:
#posts.each do |post|
<h3> post.title </h3>
<p> post.body </p>
<hr />
I can replace ... in <section> with yield and have every post in this section. Now what I want to do is to alternate this every turn so that each of the two sections gets filled up equally. I'm sure there must be some relatively easy way to achieve this? I somehow can't think of one for now.
edit: don't get me wrong. the first template can (and probably should) be merged with the second.
Do this in the first section:
#posts.select.each_with_index{|_,i| i.even? }
and this in the other one:
#posts.select.each_with_index{|_,i| i.odd? }
If you don't mind that the elements are not alternating between the columns but go top to bottom, you can do:
<article>
<% #posts.in_groups(2, false) do |grouped_posts| %>
<section>
<% grouped_posts.each do |post| %>
<h3> post.title </h3>
<p> post.body </p>
<hr />
<% end %>
</section>
<% end %>
</article>
Outside of Rails you can either pull in the necessary bits for in_groups from the active support gem
require 'active_support/core_ext/array/grouping'
or you can use this alternative approach which works very similarly:
#posts.each_slice((#posts.size/2.0).ceil).to_a
The only difference is:
#posts = []
#posts.in_groups(2, false) #=> [[], []]
#posts.each_slice((#posts.size/2.0).ceil).to_a #=> []

Parsing nodes with Nokogiri?

I'm parsing web pages and I want to get the link from the <img src> by finding the <div id="image">.
How do I do this in Nokogiri? I tried walking through the child nodes but it fails.
<div id="image" class="image textbox ">
<div class="">
<img src="img.jpg" alt="" original-title="">
</div>
</div>
This is my code:
doc = Nokogiri::HTML(open("site.com"))
doc.css("div.image").each do |node|
node.children().each do |c|
puts c.attr("src")
end
end
Any ideas?
Try this and let me know if it works for you
require 'nokogiri'
source = <<-HTML
<div id="image" class="image textbox ">
<div class="">
<img src="img.jpg" alt="" original-title="">
</div>
</div>
HTML
doc = Nokogiri::HTML(source)
doc.css('div#image > div > img').each do |image|
puts image.attr('src')
end
Output:
img.jpg
Here is a great resource: http://ruby.bastardsbook.com/chapters/html-parsing/
Modifying an example a bit, I get this:
doc = Nokogiri::HTML(open("site.com"))
doc.css("div.image img").each do |img|
puts img.attr("src")
end
Although you should use the ID selector, #image, rather than the class selector, .image, when you can. It is very much faster.

Ruby/Nokogiri inspect reveals more then class. I need the extra item inspect shows

In the following code:
page = Nokogiri::HTML($browser.html)
page_links = page.css("a").select
page_links.each do |link|
if not link.nil?
if not link['href'].nil? and !!link['href']["/about"]
puts link.class
puts link.inspect
end
end
end
the link.class outputs the following:
Nokogiri::XML::Element
#<Nokogiri::XML::Element:0x..fdb623d3c name="a" attributes=[#<Nokogiri::XML::Attr:0x..fdb623c7e name="action-type" value="8">, #<Nokogiri::XML::Attr:0x..fdb623c74 name="class" value="a-n g-s-n-aa g-s-n-aa I8 EjFvwd VP">, #<Nokogiri::XML::Attr:0x..fdb623c6a name="target" value="_top">, #<Nokogiri::XML::Attr:0x..fdb623c60 name="href" value="./104882190640970316938/about">] children=[#<Nokogiri::XML::Text:0x..fdb623792 "PetSmart Winchester">]>
And link.inspect outputs the following:
Nokogiri::XML::Element
#<Nokogiri::XML::Element:0x..fdb623666 name="a" attributes=[#<Nokogiri::XML::Attr:0x..fdb6235a8 name="action-type" value="8">, #<Nokogiri::XML::Attr:0x..fdb62359e name="class" value="a-n g-s-n-aa g-s-n-aa Gbb EjFvwd VP">, #<Nokogiri::XML::Attr:0x..fdb623594 name="target" value="_top">, #<Nokogiri::XML::Attr:0x..fdb62358a name="href" value="./104882190640970316938/about">] children=[#<Nokogiri::XML::Element:0x..fdb6230bc name="div" attributes=[#<Nokogiri::XML::Attr:0x..fdb62304e name="style" value="height:110px; width:110px;">] children=[#<Nokogiri::XML::Element:0x..fdb622e1e name="img" attributes=[#<Nokogiri::XML::Attr:0x..fdb622db0 name="style" value=" height: 110px; width: 110px;">, #<Nokogiri::XML::Attr:0x..fdb622da6 name="class" value="mja">, #<Nokogiri::XML::Attr:0x..fdb622d9c name="src" value="https://mts0.google.com/vt/data=TSwRVVf0DGlwBQqarpBU3wUz-i2gqbuWEbxTilWKINf30Au9l0oLM_ojk4KI0oPUi8kL5fJaJWte45O3abOXMzE3L7xDBg">]>]>]>
In Nokogiri I can access the link text by link.content and the link url by link['href'] . Yet neither of these methods work for image source from the inspect results.
How can I get the img src within this example code that inspect is revealing?
UPDATE: HERE IS THE HTML CODE
<div class="HWb">
<div class="erb">
<div class="ubb">
<div role="button" class="a-f-e c-b c-b-T c-b-Oe c-b-H-ra L0a X9" tabindex="0"
data-placeid="6817440171144926830" data-source="lo-gp" data-inline="true"
data-tooltip-delay="600" data-tooltip-align="b,l" data-oid="104882190640970316938"
data-size="small">
<span class="TIa c-b-fa"></span>
</div>
</div>
<h3 class="drb">
<a href="./104882190640970316938/about" target="_top" class="a-n g-s-n-aa g-s-n-aa I8 EjFvwd VP"
action-type="8">PetSmart Winchester</a>
</h3>
</div>
<div class="Qbb">
<span class="vqb SIa">Pet Store</span>
<span class="lja SIa">
<a href="//www.google.com/url?sa=D&oi=plus&q=https://maps.google.com/maps?q%3DPetsmart%2Bloc:22601%26numal%3D1%26hl%3Den-US%26gl%3DUS%26mix%3D2%26opth%3Dplatter_request:2%26ie%3DUTF8%26cid%3D6817440171144926830%26iwloc%3DA"
target="_blank" class="a-n uqb">2310 Legge Boulevard, Winchester, VA</a>
</span>
<span class="SIa">(540) 662-5544</span>
</div>
<div class="crb">
<div class="Pbb a-f-e">
<div class="Fbb">
<div class="cca">
<div class="tob">
<div class="xob">“Do not bother with the grooming salon, the staff are unusually stupid.
Otherwise the store is a typical petsmart.”</div>
</div>
</div>
</div>
</div>
<div class="dWa">
<a href="./104882190640970316938/about" target="_top" class="a-n g-s-n-aa g-s-n-aa Gbb EjFvwd VP"
action-type="8"><div style="height:110px; width:110px;"><img src="https://mts0.google.com/vt/data=TSwRVVf0DGlwBQqarpBU3wUz-i2gqbuWEbxTilWKINf30Au9l0oLM_ojk4KI0oPUi8kL5fJaJWte45O3abOXMzE3L7xDBg" class="mja" style=" height: 110px; width: 110px;"></div></a>
</div>
</div>
Without the HTML you're making it a lot harder, but after some digging into the inspect output, I think I have a reasonable HTML snippet.
This is how I'd go about getting to the <img src="..."> tag:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<a action-type="8" class="a-n g-s-n-aa g-s-n-aa Gbb EjFvwd VP" target="_top" href="./104882190640970316938/about">
<div style="height:110px; width:110px;">
<img style=" height: 110px; width: 110px;" class="mja" src="https://mts0.google.com/vt/data=TSwRVVf0DGlwBQqarpBU3wUz-i2gqbuWEbxTilWKINf30Au9l0oLM_ojk4KI0oPUi8kL5fJaJWte45O3abOXMzE3L7xDBg">
</div>
</a>
EOT
doc.at('img')['src'] # => "https://mts0.google.com/vt/data=TSwRVVf0DGlwBQqarpBU3wUz-i2gqbuWEbxTilWKINf30Au9l0oLM_ojk4KI0oPUi8kL5fJaJWte45O3abOXMzE3L7xDBg"
You'll need to take the time to improve your question and provide more detail if that doesn't work.
If you are not sure whether you will have 0, 1 or 1+ instances of a tag, use search because it returns a NodeSet, which acts like an Array, making it easy to deal with no, single or multiple occurrences:
doc.search('img').map{ |img| img['src'] }
will return all the <img src="..."> values in the document in an array. You can iterate over those easily or use empty? to see if there are no hits:
doc.search('img').map{ |img| img['src'] }.each do |src|
# do something with src if any are found.
end
If it's possible you'll have <img> tags without the src="..." parameter, use compact to filter them out before iterating:
doc.search('img').map{ |img| img['src'] }.compact.each do |src|
# do something with src if any are found.
end
If you only expect 0 or 1 occurrence, try:
src = doc.at('img') && doc.at('img')['src']
as in:
doc = Nokogiri::HTML(<<EOT)
<html><body><p>foo</p>
<img src="blah">
<p>bar</p></body></html>
EOT
src = doc.at('img') && doc.at('img')['src']
=> "blah"
or, without the src parameter:
doc = Nokogiri::HTML(<<EOT)
<html><body><p>foo</p>
<img>
<p>bar</p></body></html>
EOT
src = doc.at('img') && doc.at('img')['src']
=> nil
or missing the <img> tag entirely:
doc = Nokogiri::HTML(<<EOT)
<html><body><p>foo</p>
<p>bar</p></body></html>
EOT
src = doc.at('img') && doc.at('img')['src']
=> nil
If you want to continue to use an if block:
if doc.at('img')
puts doc.at('img')['src']
end
will accomplish what your:
if not doc.at('img').nil?
puts doc.at('img')['src']
end
accomplishes, but in a more straightforward and concise manner, while maintaining readability.
The downside to doing two at lookups is it can be costly in big documents, especially inside a loop. You could get all Perlish and use:
if (img = doc.at('img'))
puts img['src']
end
but that's not really the Ruby way. For clarity and long-term maintenance, I'd probably use:
img = doc.at('img')
if (img)
puts img['src']
end
but that exposes the img variable, cluttering up things. It's programmer's choice at that point.
Your two outputs look like they are two different links (ie both the link.class and link.inspect for each).
Assuming we are talking about getting the image source in the second output, it looks like the HTML is something like:
<div><img src="image_src" /></div>
Assuming that is true, then you need to do:
puts link.at_css("img")['src']
I have found if you take the results from link.inspect, since they are a string, and use regex you can grab the image URL.
link.inspect[/http.*com.*"/].chop # Since all other urls are relative ./
I don't believe this is the best method. I will try working with the other answers first.

How do I do a regex search in Nokogiri for text that matches a certain beginning?

Given:
require 'rubygems'
require 'nokogiri'
value = Nokogiri::HTML.parse(<<-HTML_END)
"<html>
<body>
<p id='para-1'>A</p>
<div class='block' id='X1'>
<h1>Foo</h1>
<p id='para-2'>B</p>
</div>
<p id='para-3'>C</p>
<h2>Bar</h2>
<p id='para-4'>D</p>
<p id='para-5'>E</p>
<div class='block' id='X2'>
<p id='para-6'>F</p>
</div>
</body>
</html>"
HTML_END
I want to do something like what I can do in Hpricot:
divs = value.search('//div[#id^="para-"]')
How do I do a pattern search for elements in XPath style?
Where would I find the documentation to help me? I didn't see this in the rdocs.
Use the xpath function starts-with:
value.xpath('//p[starts-with(#id, "para-")]').each { |x| puts x['id'] }
divs = value.css('div[id^="para-"]')
And some docs you're seeking:
Nokogiri: http://nokogiri.org/
XPath: http://www.w3.org/TR/xpath20/
CSS3 Selectors: http://www.w3.org/TR/selectors/
Nokogiri::XML::Node.send(:define_method, 'xpath_regex') { |*args|
xpath = args[0]
rgxp = /\/([a-z]+)\[#([a-z\-]+)~=\/(.*?)\/\]/
xpath.gsub!(rgxp) { |s| m = s.match(rgxp); "/#{m[1]}[regex(.,'#{m[2]}','#{m[3]}')]" }
self.xpath(xpath, Class.new {
def regex node_set, attr, regex
node_set.find_all { |node| node[attr] =~ /#{regex}/ }
end
}.new)
}
Usage:
divs = Nokogiri::HTML(page.root.to_html).
xpath_regex("//div[#class~=/axtarget$/]//div[#class~=/^carbo/]")

Resources