finding common ancestor from a group of xpath?

finding common ancestor from a group of xpath? - xpath

say i have
html/body/span/div/p/h1/i/font
html/body/span/div/div/div/div/table/tr/p/h1
html/body/span/p/h1/b
html/body/span/div
how can i get the common ancestor? in this case span would be the common ancestor of "font, h1, b, div" would be "span"

To find common ancestry between two nodes:
(node1.ancestors & node2.ancestors).first
A more generalized function that works with multiple nodes:
# accepts node objects or selector strings
class Nokogiri::XML::Element
def common_ancestor(*nodes)
nodes = nodes.map do |node|
String === node ? self.document.at(node) : node
end
nodes.inject(self.ancestors) do |common, node|
common & node.ancestors
end.first
end
end
# usage:
node1.common_ancestor(node2, '//foo/bar')
# => <ancestor node>

The function common_ancestor below does what you want.
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::XML(DATA)
def common_ancestor *elements
return nil if elements.empty?
elements.map! do |e| [ e, [e] ] end #prepare array
elements.map! do |e| # build array of ancestors for each given element
e[1].unshift e[0] while e[0].respond_to?(:parent) and e[0] = e[0].parent
e[1]
end
# merge corresponding ancestors and find the last where all ancestors are the same
elements[0].zip(*elements[1..-1]).select { |e| e.uniq.length == 1 }.flatten.last
end
i = doc.xpath('//*[#id="i"]').first
div = doc.xpath('//*[#id="div"]').first
h1 = doc.xpath('//*[#id="h1"]').first
p common_ancestor i, div, h1 # => gives the p element
__END__
<html>
<body>
<span>
<p id="common-ancestor">
<div>
<p><h1><i id="i"></i></h1></p>
<div id="div"></div>
</div>
<p>
<h1 id="h1"></h1>
</p>
<div></div>
</p>
</span>
</body>
</html>

Related

How to parse consecutive tags with Nokogiri?

I have HTML code like this:
<div id="first">
<dt>Label1</dt>
<dd>Value1</dd>
<dt>Label2</dt>
<dd>Value2</dd>
...
</div>
My code does not work.
doc.css("first").each do |item|
label = item.css("dt")
value = item.css("dd")
end
Show all the <dt> tags firsts and then the <dd> tags and I need "label: value"

First of all, your HTML should have the <dt> and <dd> elements inside a <dl>:
<div id="first">
<dl>
<dt>Label1</dt>
<dd>Value1</dd>
<dt>Label2</dt>
<dd>Value2</dd>
...
</dl>
</div>
but that won't change how you parse it. You want to find the <dt>s and iterate over them, then at each <dt> you can use next_element to get the <dd>; something like this:
doc = Nokogiri::HTML('<div id="first"><dl>...')
doc.css('#first').search('dt').each do |node|
puts "#{node.text}: #{node.next_element.text}"
end
That should work as long as the structure matches your example.

Under the assumption that some <dt> may have multiple <dd>, you want to find all <dt> and then (for each) find the following <dd> before the next <dt>. This is pretty easy to do in pure Ruby, but more fun to do in just XPath. ;)
Given this setup:
require 'nokogiri'
html = '<dl id="first">
<dt>Label1</dt><dd>Value1</dd>
<dt>Label2</dt><dd>Value2</dd>
<dt>Label3</dt><dd>Value3a</dd><dd>Value3b</dd>
<dt>Label4</dt><dd>Value4</dd>
</dl>'
doc = Nokogiri.HTML(html)
Using no XPath:
doc.css('dt').each do |dt|
dds = []
n = dt.next_element
begin
dds << n
n = n.next_element
end while n && n.name=='dd'
p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]
Using a Little XPath:
doc.css('dt').each do |dt|
dds = dt.xpath('following-sibling::*').chunk{ |n| n.name }.first.last
p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]
Using Lotsa XPath:
doc.css('dt').each do |dt|
ct = dt.xpath('count(following-sibling::dt)')
dds = dt.xpath("following-sibling::dd[count(following-sibling::dt)=#{ct}]")
p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]

After looking at the other answer here is an inefficient way of doing the same thing.
require 'nokogiri'
a = Nokogiri::HTML('<div id="first"><dt>Label1</dt><dd>Value1</dd><dt>Label2</dt><dd>Value2</dd></div>')
dt = []
dd = []
a.css("#first").each do |item|
item.css("dt").each {|t| dt << t.text}
item.css("dd").each {|t| dd << t.text}
end
dt.each_index do |i|
puts dt[i] + ': ' + dd[i]
end
In css to reference the ID you need to put the # symbol before. For a class it's the . symbol.

Nokogiri replace inner text with <span>ed words

Here's an example HTML fragment:
<p class="stanza">Thus grew the tale of Wonderland:<br/>
Thus slowly, one by one,<br/>
Its quaint events were hammered out -<br/>
And now the tale is done,<br/>
And home we steer, a merry crew,<br/>
Beneath the setting sun.<br/></p>
I need to surround each word with a <span id="w0">Thus </span> like this:
<span id='w1'>Anon,</span> <span id='w2'>to</span> <span id='w3'>sudden</span>
<span id='w4'>silence</span> <span id='w5'>won,</span> ....
I written this which creates the new fragment. How do I replace/swap the new for old?
def callchildren(n)
n.children.each do |n| # call recursively until arrive at a node w/o children
callchildren(n)
end
if n.node_type == 3 && n.to_s.strip.empty? != true
new_node = ""
n.to_s.split.each { |w|
new_node = new_node + "<span id='w#{$word_number}'>#{w}</span> "
$word_number += 1
}
# puts new_node
# HELP? How do I get new_node swapped in?
end
end

My attempt to provide a solution for your problem:
require 'nokogiri'
Inf = 1.0/0.0
def number_words(node, counter = nil)
# define infinite counter (Ruby >= 1.8.7)
counter ||= (1..Inf).each
doc = node.document
unless node.is_a?(Nokogiri::XML::Text)
# recurse for children and collect all the returned
# nodes into an array
children = node.children.inject([]) { |acc, child|
acc += number_words(child, counter)
}
# replace the node's children
node.children = Nokogiri::XML::NodeSet.new(doc, children)
return [node]
end
# for text nodes, we generate a list of span nodes
# and return it (this is more secure than OP's original
# approach that is vulnerable to HTML injection)n
node.to_s.strip.split.inject([]) { |acc, word|
span = Nokogiri::XML::Node.new("span", node)
span.content = word
span["id"] = "w#{counter.next}"
# add a space if we are not at the beginning
acc << Nokogiri::XML::Text.new(" ", doc) unless acc.empty?
# add our new span to the collection
acc << span
}
end
# demo
if __FILE__ == $0
h = <<-HTML
<p class="stanza">Thus grew the tale of Wonderland:<br/>
Thus slowly, one by one,<br/>
Its quaint events were hammered out -<br/>
And now the tale is done,<br/>
And home we steer, a merry crew,<br/>
Beneath the setting sun.<br/></p>
HTML
doc = Nokogiri::HTML.parse(h)
number_words(doc)
p doc.to_xml
end

Given a Nokogiri::HTML::Document in doc, you could do something like this:
i = 0
doc.search('//p[#class="stanza"]/text()').each do |n|
spans = n.content.scan(/\S+/).map do |s|
"<span id=\"w#{i += 1}\">" + s + '</span>'
end
n.replace(spans.join(' '))
end

xpath to find all following sibling adjacent nodes up til another type [duplicate]

I have HTML code like this:
<div id="first">
<dt>Label1</dt>
<dd>Value1</dd>
<dt>Label2</dt>
<dd>Value2</dd>
...
</div>
My code does not work.
doc.css("first").each do |item|
label = item.css("dt")
value = item.css("dd")
end
Show all the <dt> tags firsts and then the <dd> tags and I need "label: value"

First of all, your HTML should have the <dt> and <dd> elements inside a <dl>:
<div id="first">
<dl>
<dt>Label1</dt>
<dd>Value1</dd>
<dt>Label2</dt>
<dd>Value2</dd>
...
</dl>
</div>
but that won't change how you parse it. You want to find the <dt>s and iterate over them, then at each <dt> you can use next_element to get the <dd>; something like this:
doc = Nokogiri::HTML('<div id="first"><dl>...')
doc.css('#first').search('dt').each do |node|
puts "#{node.text}: #{node.next_element.text}"
end
That should work as long as the structure matches your example.

Under the assumption that some <dt> may have multiple <dd>, you want to find all <dt> and then (for each) find the following <dd> before the next <dt>. This is pretty easy to do in pure Ruby, but more fun to do in just XPath. ;)
Given this setup:
require 'nokogiri'
html = '<dl id="first">
<dt>Label1</dt><dd>Value1</dd>
<dt>Label2</dt><dd>Value2</dd>
<dt>Label3</dt><dd>Value3a</dd><dd>Value3b</dd>
<dt>Label4</dt><dd>Value4</dd>
</dl>'
doc = Nokogiri.HTML(html)
Using no XPath:
doc.css('dt').each do |dt|
dds = []
n = dt.next_element
begin
dds << n
n = n.next_element
end while n && n.name=='dd'
p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]
Using a Little XPath:
doc.css('dt').each do |dt|
dds = dt.xpath('following-sibling::*').chunk{ |n| n.name }.first.last
p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]
Using Lotsa XPath:
doc.css('dt').each do |dt|
ct = dt.xpath('count(following-sibling::dt)')
dds = dt.xpath("following-sibling::dd[count(following-sibling::dt)=#{ct}]")
p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]

After looking at the other answer here is an inefficient way of doing the same thing.
require 'nokogiri'
a = Nokogiri::HTML('<div id="first"><dt>Label1</dt><dd>Value1</dd><dt>Label2</dt><dd>Value2</dd></div>')
dt = []
dd = []
a.css("#first").each do |item|
item.css("dt").each {|t| dt << t.text}
item.css("dd").each {|t| dd << t.text}
end
dt.each_index do |i|
puts dt[i] + ': ' + dd[i]
end
In css to reference the ID you need to put the # symbol before. For a class it's the . symbol.

Nokogiri: need to turn markup partitioned by `hr` into divs

Given markup inside an HTML document that looks like this
<h3>test</h3>
<p>test</p>
<hr/>
<h3>test2</h3>
<p>test2</p>
<hr/>
I'd like to to produce this
<div>
<h3>test</h3>
<p>test</p>
</div>
<div>
<h3>test2</h3>
<p>test2</p>
</div>
What's the most elegant way to do with with Nokogiri?

Edit: Reworked answer to be a bit cleaner.
Edit2: Small rewrite to shorten by two lines
require 'nokogiri'
doc = Nokogiri::HTML <<ENDHTML
<h3>test</h3>
<p>test</p>
<hr/>
<h3>test2</h3>
<p>test2</p>
<hr/>
ENDHTML
body = doc.at_css('body') # Created by parsing as HTML
kids = body.xpath('./*') # Every child of the body
body.inner_html = "" # Empty the body now that we have our nodes
div = (body << "<div>").first # Create our first container in the body
kids.each do |node| # For every child that was in the body...
if node.name=='hr'
div = (body << '<div>').first # Create a new container for stuff
else
div << node # Move this into the last container
end
end
div.remove unless div.child # Get rid of a trailing, empty div
puts body.inner_html
#=> <div>
#=> <h3>test</h3>
#=> <p>test</p>
#=> </div>
#=> <div>
#=> <h3>test2</h3>
#=> <p>test2</p>
#=> </div>

This is how I'd go about it:
require 'nokogiri'
html = '
<h3>test</h3>
<p>test</p>
<hr/>
<h3>test2</h3>
<p>test2</p>
<hr/>
'
doc = Nokogiri::HTML(html)
doc2 = Nokogiri::HTML('<body />')
doc2_body = doc2.at('body')
doc.search('//h3 | //p').each_slice(2) do |ns|
nodeset = Nokogiri::XML::NodeSet.new(doc2, ns)
div = Nokogiri::XML::Node.new('div', doc2)
div.add_child(nodeset)
doc2_body.add_child(div)
end
puts doc2_body.inner_html
# >> <div>
# >> <h3>test</h3>
# >> <p>test</p>
# >> </div>
# >> <div>
# >> <h3>test2</h3>
# >> <p>test2</p>
# >> </div>

Here's an answer that uses Ruby 1.9.2's Enumerable#chunk to split the children into sections and also exercises Nokogiri's NodeSet class:
require 'nokogiri'
doc = Nokogiri::HTML <<ENDHTML
<h3>test</h3>
<p>test</p>
<hr/>
<h3>test2</h3>
<p>test2</p>
<hr/>
ENDHTML
result = Nokogiri::XML::NodeSet.new( doc,
doc.xpath('//body/*').chunk do |n|
n.name=='hr'
end.reject do |matched,nodes|
matched
end.map do |matched,nodes|
doc.create_element('div').tap do |div|
div << Nokogiri::XML::NodeSet.new( doc, nodes )
end
end )
puts result
#=> <div>
#=> <h3>test</h3>
#=> <p>test</p>
#=> </div>
#=> <div>
#=> <h3>test2</h3>
#=> <p>test2</p>
#=> </div>

Sorting a tree structure by folders first in Ruby

I have an array of paths, array = [
'a.txt',
'b/a.txt',
'a/a.txt',
'a/z/a.txt'
]
I need to create a tree structure (for the jTree plugin), but it has to be sorted by folders first (alphabetically) and then leafs (alphabetically too).
A sorted tree structure with the above example would look like this:
a
z
a.txt
a.txt
b
a.txt
a.txt
EDIT: Im looking to build a Tree of HTML ordered lists and list items, where each node is a LI and if its a folder it has another UL as a sibling. This is one of the formats the jTree plugin takes as input. Structure for above example:
<ul>
<li class="folder">a</li>
<ul>
<li class="folder">z</li>
<ul>
<li class="leaf">a.txt</li>
</ul>
</ul>
<li class="folder">b</li>
<ul>
<li class="leaf">a.txt</li>
</ul>
<li class="leaf">a.txt</li>
</ul>
This will build the tree structure as a hash tree:
array = ["home", "about", "about/history", "about/company", "about/history/part1", "about/history/part2"]
auto_hash = Hash.new{ |h,k| h[k] = Hash.new &h.default_proc }
array.each{ |path|
sub = auto_hash
path.split( "/" ).each{ |dir| sub[dir]; sub = sub[dir] }
}

require 'rubygems'
require 'builder'
paths = ["home", "about", "about/history", "about/company", "about/history/part1", "about/history/part2"]
auto_hash = Hash.new{ |h,k| h[k] = Hash.new &h.default_proc }
paths.each do |path|
sub = auto_hash
path.split( "/" ).each{ |dir| sub[dir]; sub = sub[dir] }
end
def build_branch(branch, xml)
directories = branch.keys.reject{|k| branch[k].empty? }.sort
leaves = branch.keys.select{|k| branch[k].empty? }.sort
directories.each do |directory|
xml.li(directory, :class => 'folder')
xml.ul do
build_branch(branch[directory], xml)
end
end
leaves.each do |leaf|
xml.li(leaf, :class => 'leaf')
end
end
xml = Builder::XmlMarkup.new(:indent => 2)
xml.ul do
build_branch(auto_hash, xml)
end
puts xml.target!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

finding common ancestor from a group of xpath? - xpath

say i have html/body/span/div/p/h1/i/font html/body/span/div/div/div/div/table/tr/p/h1 html/body/span/p/h1/b html/body/span/div how can i get the common ancestor? in this case span would be the common ancestor of "font, h1, b, div" would be "span"

Related

How to parse consecutive tags with Nokogiri?

Nokogiri replace inner text with <span>ed words

xpath to find all following sibling adjacent nodes up til another type [duplicate]

Nokogiri: need to turn markup partitioned by `hr` into divs

Sorting a tree structure by folders first in Ruby

Categories

Resources