finding common ancestor from a group of xpath? - xpath

say i have
html/body/span/div/p/h1/i/font
html/body/span/div/div/div/div/table/tr/p/h1
html/body/span/p/h1/b
html/body/span/div
how can i get the common ancestor? in this case span would be the common ancestor of "font, h1, b, div" would be "span"

To find common ancestry between two nodes:
(node1.ancestors & node2.ancestors).first
A more generalized function that works with multiple nodes:
# accepts node objects or selector strings
class Nokogiri::XML::Element
def common_ancestor(*nodes)
nodes = nodes.map do |node|
String === node ? self.document.at(node) : node
end
nodes.inject(self.ancestors) do |common, node|
common & node.ancestors
end.first
end
end
# usage:
node1.common_ancestor(node2, '//foo/bar')
# => <ancestor node>

The function common_ancestor below does what you want.
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::XML(DATA)
def common_ancestor *elements
return nil if elements.empty?
elements.map! do |e| [ e, [e] ] end #prepare array
elements.map! do |e| # build array of ancestors for each given element
e[1].unshift e[0] while e[0].respond_to?(:parent) and e[0] = e[0].parent
e[1]
end
# merge corresponding ancestors and find the last where all ancestors are the same
elements[0].zip(*elements[1..-1]).select { |e| e.uniq.length == 1 }.flatten.last
end
i = doc.xpath('//*[#id="i"]').first
div = doc.xpath('//*[#id="div"]').first
h1 = doc.xpath('//*[#id="h1"]').first
p common_ancestor i, div, h1 # => gives the p element
__END__
<html>
<body>
<span>
<p id="common-ancestor">
<div>
<p><h1><i id="i"></i></h1></p>
<div id="div"></div>
</div>
<p>
<h1 id="h1"></h1>
</p>
<div></div>
</p>
</span>
</body>
</html>

Related

How to parse consecutive tags with Nokogiri?

I have HTML code like this:
<div id="first">
<dt>Label1</dt>
<dd>Value1</dd>
<dt>Label2</dt>
<dd>Value2</dd>
...
</div>
My code does not work.
doc.css("first").each do |item|
label = item.css("dt")
value = item.css("dd")
end
Show all the <dt> tags firsts and then the <dd> tags and I need "label: value"
First of all, your HTML should have the <dt> and <dd> elements inside a <dl>:
<div id="first">
<dl>
<dt>Label1</dt>
<dd>Value1</dd>
<dt>Label2</dt>
<dd>Value2</dd>
...
</dl>
</div>
but that won't change how you parse it. You want to find the <dt>s and iterate over them, then at each <dt> you can use next_element to get the <dd>; something like this:
doc = Nokogiri::HTML('<div id="first"><dl>...')
doc.css('#first').search('dt').each do |node|
puts "#{node.text}: #{node.next_element.text}"
end
That should work as long as the structure matches your example.
Under the assumption that some <dt> may have multiple <dd>, you want to find all <dt> and then (for each) find the following <dd> before the next <dt>. This is pretty easy to do in pure Ruby, but more fun to do in just XPath. ;)
Given this setup:
require 'nokogiri'
html = '<dl id="first">
<dt>Label1</dt><dd>Value1</dd>
<dt>Label2</dt><dd>Value2</dd>
<dt>Label3</dt><dd>Value3a</dd><dd>Value3b</dd>
<dt>Label4</dt><dd>Value4</dd>
</dl>'
doc = Nokogiri.HTML(html)
Using no XPath:
doc.css('dt').each do |dt|
dds = []
n = dt.next_element
begin
dds << n
n = n.next_element
end while n && n.name=='dd'
p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]
Using a Little XPath:
doc.css('dt').each do |dt|
dds = dt.xpath('following-sibling::*').chunk{ |n| n.name }.first.last
p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]
Using Lotsa XPath:
doc.css('dt').each do |dt|
ct = dt.xpath('count(following-sibling::dt)')
dds = dt.xpath("following-sibling::dd[count(following-sibling::dt)=#{ct}]")
p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]
After looking at the other answer here is an inefficient way of doing the same thing.
require 'nokogiri'
a = Nokogiri::HTML('<div id="first"><dt>Label1</dt><dd>Value1</dd><dt>Label2</dt><dd>Value2</dd></div>')
dt = []
dd = []
a.css("#first").each do |item|
item.css("dt").each {|t| dt << t.text}
item.css("dd").each {|t| dd << t.text}
end
dt.each_index do |i|
puts dt[i] + ': ' + dd[i]
end
In css to reference the ID you need to put the # symbol before. For a class it's the . symbol.

Nokogiri replace inner text with <span>ed words

Here's an example HTML fragment:
<p class="stanza">Thus grew the tale of Wonderland:<br/>
Thus slowly, one by one,<br/>
Its quaint events were hammered out -<br/>
And now the tale is done,<br/>
And home we steer, a merry crew,<br/>
Beneath the setting sun.<br/></p>
I need to surround each word with a <span id="w0">Thus </span> like this:
<span id='w1'>Anon,</span> <span id='w2'>to</span> <span id='w3'>sudden</span>
<span id='w4'>silence</span> <span id='w5'>won,</span> ....
I written this which creates the new fragment. How do I replace/swap the new for old?
def callchildren(n)
n.children.each do |n| # call recursively until arrive at a node w/o children
callchildren(n)
end
if n.node_type == 3 && n.to_s.strip.empty? != true
new_node = ""
n.to_s.split.each { |w|
new_node = new_node + "<span id='w#{$word_number}'>#{w}</span> "
$word_number += 1
}
# puts new_node
# HELP? How do I get new_node swapped in?
end
end
My attempt to provide a solution for your problem:
require 'nokogiri'
Inf = 1.0/0.0
def number_words(node, counter = nil)
# define infinite counter (Ruby >= 1.8.7)
counter ||= (1..Inf).each
doc = node.document
unless node.is_a?(Nokogiri::XML::Text)
# recurse for children and collect all the returned
# nodes into an array
children = node.children.inject([]) { |acc, child|
acc += number_words(child, counter)
}
# replace the node's children
node.children = Nokogiri::XML::NodeSet.new(doc, children)
return [node]
end
# for text nodes, we generate a list of span nodes
# and return it (this is more secure than OP's original
# approach that is vulnerable to HTML injection)n
node.to_s.strip.split.inject([]) { |acc, word|
span = Nokogiri::XML::Node.new("span", node)
span.content = word
span["id"] = "w#{counter.next}"
# add a space if we are not at the beginning
acc << Nokogiri::XML::Text.new(" ", doc) unless acc.empty?
# add our new span to the collection
acc << span
}
end
# demo
if __FILE__ == $0
h = <<-HTML
<p class="stanza">Thus grew the tale of Wonderland:<br/>
Thus slowly, one by one,<br/>
Its quaint events were hammered out -<br/>
And now the tale is done,<br/>
And home we steer, a merry crew,<br/>
Beneath the setting sun.<br/></p>
HTML
doc = Nokogiri::HTML.parse(h)
number_words(doc)
p doc.to_xml
end
Given a Nokogiri::HTML::Document in doc, you could do something like this:
i = 0
doc.search('//p[#class="stanza"]/text()').each do |n|
spans = n.content.scan(/\S+/).map do |s|
"<span id=\"w#{i += 1}\">" + s + '</span>'
end
n.replace(spans.join(' '))
end

xpath to find all following sibling adjacent nodes up til another type [duplicate]

I have HTML code like this:
<div id="first">
<dt>Label1</dt>
<dd>Value1</dd>
<dt>Label2</dt>
<dd>Value2</dd>
...
</div>
My code does not work.
doc.css("first").each do |item|
label = item.css("dt")
value = item.css("dd")
end
Show all the <dt> tags firsts and then the <dd> tags and I need "label: value"
First of all, your HTML should have the <dt> and <dd> elements inside a <dl>:
<div id="first">
<dl>
<dt>Label1</dt>
<dd>Value1</dd>
<dt>Label2</dt>
<dd>Value2</dd>
...
</dl>
</div>
but that won't change how you parse it. You want to find the <dt>s and iterate over them, then at each <dt> you can use next_element to get the <dd>; something like this:
doc = Nokogiri::HTML('<div id="first"><dl>...')
doc.css('#first').search('dt').each do |node|
puts "#{node.text}: #{node.next_element.text}"
end
That should work as long as the structure matches your example.
Under the assumption that some <dt> may have multiple <dd>, you want to find all <dt> and then (for each) find the following <dd> before the next <dt>. This is pretty easy to do in pure Ruby, but more fun to do in just XPath. ;)
Given this setup:
require 'nokogiri'
html = '<dl id="first">
<dt>Label1</dt><dd>Value1</dd>
<dt>Label2</dt><dd>Value2</dd>
<dt>Label3</dt><dd>Value3a</dd><dd>Value3b</dd>
<dt>Label4</dt><dd>Value4</dd>
</dl>'
doc = Nokogiri.HTML(html)
Using no XPath:
doc.css('dt').each do |dt|
dds = []
n = dt.next_element
begin
dds << n
n = n.next_element
end while n && n.name=='dd'
p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]
Using a Little XPath:
doc.css('dt').each do |dt|
dds = dt.xpath('following-sibling::*').chunk{ |n| n.name }.first.last
p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]
Using Lotsa XPath:
doc.css('dt').each do |dt|
ct = dt.xpath('count(following-sibling::dt)')
dds = dt.xpath("following-sibling::dd[count(following-sibling::dt)=#{ct}]")
p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]
After looking at the other answer here is an inefficient way of doing the same thing.
require 'nokogiri'
a = Nokogiri::HTML('<div id="first"><dt>Label1</dt><dd>Value1</dd><dt>Label2</dt><dd>Value2</dd></div>')
dt = []
dd = []
a.css("#first").each do |item|
item.css("dt").each {|t| dt << t.text}
item.css("dd").each {|t| dd << t.text}
end
dt.each_index do |i|
puts dt[i] + ': ' + dd[i]
end
In css to reference the ID you need to put the # symbol before. For a class it's the . symbol.

Nokogiri: need to turn markup partitioned by `hr` into divs

Given markup inside an HTML document that looks like this
<h3>test</h3>
<p>test</p>
<hr/>
<h3>test2</h3>
<p>test2</p>
<hr/>
I'd like to to produce this
<div>
<h3>test</h3>
<p>test</p>
</div>
<div>
<h3>test2</h3>
<p>test2</p>
</div>
What's the most elegant way to do with with Nokogiri?
Edit: Reworked answer to be a bit cleaner.
Edit2: Small rewrite to shorten by two lines
require 'nokogiri'
doc = Nokogiri::HTML <<ENDHTML
<h3>test</h3>
<p>test</p>
<hr/>
<h3>test2</h3>
<p>test2</p>
<hr/>
ENDHTML
body = doc.at_css('body') # Created by parsing as HTML
kids = body.xpath('./*') # Every child of the body
body.inner_html = "" # Empty the body now that we have our nodes
div = (body << "<div>").first # Create our first container in the body
kids.each do |node| # For every child that was in the body...
if node.name=='hr'
div = (body << '<div>').first # Create a new container for stuff
else
div << node # Move this into the last container
end
end
div.remove unless div.child # Get rid of a trailing, empty div
puts body.inner_html
#=> <div>
#=> <h3>test</h3>
#=> <p>test</p>
#=> </div>
#=> <div>
#=> <h3>test2</h3>
#=> <p>test2</p>
#=> </div>
This is how I'd go about it:
require 'nokogiri'
html = '
<h3>test</h3>
<p>test</p>
<hr/>
<h3>test2</h3>
<p>test2</p>
<hr/>
'
doc = Nokogiri::HTML(html)
doc2 = Nokogiri::HTML('<body />')
doc2_body = doc2.at('body')
doc.search('//h3 | //p').each_slice(2) do |ns|
nodeset = Nokogiri::XML::NodeSet.new(doc2, ns)
div = Nokogiri::XML::Node.new('div', doc2)
div.add_child(nodeset)
doc2_body.add_child(div)
end
puts doc2_body.inner_html
# >> <div>
# >> <h3>test</h3>
# >> <p>test</p>
# >> </div>
# >> <div>
# >> <h3>test2</h3>
# >> <p>test2</p>
# >> </div>
Here's an answer that uses Ruby 1.9.2's Enumerable#chunk to split the children into sections and also exercises Nokogiri's NodeSet class:
require 'nokogiri'
doc = Nokogiri::HTML <<ENDHTML
<h3>test</h3>
<p>test</p>
<hr/>
<h3>test2</h3>
<p>test2</p>
<hr/>
ENDHTML
result = Nokogiri::XML::NodeSet.new( doc,
doc.xpath('//body/*').chunk do |n|
n.name=='hr'
end.reject do |matched,nodes|
matched
end.map do |matched,nodes|
doc.create_element('div').tap do |div|
div << Nokogiri::XML::NodeSet.new( doc, nodes )
end
end )
puts result
#=> <div>
#=> <h3>test</h3>
#=> <p>test</p>
#=> </div>
#=> <div>
#=> <h3>test2</h3>
#=> <p>test2</p>
#=> </div>

Sorting a tree structure by folders first in Ruby

I have an array of paths, array = [
'a.txt',
'b/a.txt',
'a/a.txt',
'a/z/a.txt'
]
I need to create a tree structure (for the jTree plugin), but it has to be sorted by folders first (alphabetically) and then leafs (alphabetically too).
A sorted tree structure with the above example would look like this:
a
z
a.txt
a.txt
b
a.txt
a.txt
EDIT: Im looking to build a Tree of HTML ordered lists and list items, where each node is a LI and if its a folder it has another UL as a sibling. This is one of the formats the jTree plugin takes as input. Structure for above example:
<ul>
<li class="folder">a</li>
<ul>
<li class="folder">z</li>
<ul>
<li class="leaf">a.txt</li>
</ul>
</ul>
<li class="folder">b</li>
<ul>
<li class="leaf">a.txt</li>
</ul>
<li class="leaf">a.txt</li>
</ul>
This will build the tree structure as a hash tree:
array = ["home", "about", "about/history", "about/company", "about/history/part1", "about/history/part2"]
auto_hash = Hash.new{ |h,k| h[k] = Hash.new &h.default_proc }
array.each{ |path|
sub = auto_hash
path.split( "/" ).each{ |dir| sub[dir]; sub = sub[dir] }
}
require 'rubygems'
require 'builder'
paths = ["home", "about", "about/history", "about/company", "about/history/part1", "about/history/part2"]
auto_hash = Hash.new{ |h,k| h[k] = Hash.new &h.default_proc }
paths.each do |path|
sub = auto_hash
path.split( "/" ).each{ |dir| sub[dir]; sub = sub[dir] }
end
def build_branch(branch, xml)
directories = branch.keys.reject{|k| branch[k].empty? }.sort
leaves = branch.keys.select{|k| branch[k].empty? }.sort
directories.each do |directory|
xml.li(directory, :class => 'folder')
xml.ul do
build_branch(branch[directory], xml)
end
end
leaves.each do |leaf|
xml.li(leaf, :class => 'leaf')
end
end
xml = Builder::XmlMarkup.new(:indent => 2)
xml.ul do
build_branch(auto_hash, xml)
end
puts xml.target!

Resources