Nokogiri: non-destructively print node without its children - ruby

I have a piece of ruby code to replace the value of an attribute:
# -*- coding: utf-8 -*-
require "nokogiri"
xml = <<-eos
<a blubb="blah">
<b>irrelevant</b>
<b>also irrelevant</b>
<b blubb="blah">
<c>irrelevant</c>
<c>irrelevant</c>
</b>
<b blubb="foo">
<c>irrelevant</c>
<c>irrelevant</c>
</b>
</a>
eos
doc = Nokogiri::XML(xml) { |config| config.noent }
doc.xpath("//*[#blubb='blah']").each {|node|
puts "Node before:\n#{node.to_s}" ## replace me!
node['blubb'] = "NEW"
puts "Node after:\n#{node.to_s}" ## replace me!
}
When i execute this code, i get the whole node element printed, but I only need to see the start tag to confirm that my script works correctly. Is there a way to display only the start tags of node, or at least only the element itself without its child nodes? The important thing is that the node itself doesn't change when printed (beside the replacement in the attribute), so removing the children is not an option!

We can print name and attribute_nodes of the node
doc.xpath("//*[#blubb='blah']").each {|node|
puts "Node before:\n #{node.name} "+node.attribute_nodes.reduce('') { |out, n| out+="#{n.name}=#{n.value}'"}
node['blubb'] = "NEW"
puts "Node after:\n #{node.name} "+node.attribute_nodes.reduce('') { |out, n| out+="#{n.name}='#{n.value}'"}
}

Related

Nokogiri use of DOT '.' notation for XPATH

All my searches indicate that using the './/' in xpath should start the next search at the current node. In code below I thought I should return only the first h3 element "Top Level" and the search should terminate, instead I also return the second h3 tag which is in another node entirely. What am I missing.
=begin Sample HTML contains
<DIV><DIV><DIV id ='1'><h3> Top Level </h3></DIV></DIV></DIV>
<DIV><DIV><DIV id ='2'><h3> Bottom Level </h3></DIV></DIV></DIV>
=end
require 'rubygems'
require 'nokogiri'
page = Nokogiri::HTML(open("Sample.html"))
el = page.xpath("html/body/div/div/div[#id='1']") # set postion in tree
puts el.inspect
=begin
[#<Nokogiri::XML::Element:0x1990410 name="div" attributes= [#<Nokogiri::XML::Attr:0x1990200 name="id" value="1">]`
children=[#<Nokogiri::XML::Text:0x197dda4 " \r\n\t\t\t">, #<Nokogiri::XML::Element:0x197dcc0 name="h3"
children=[#<Nokogiri::XML::Text:0x197da74 " Top Level ">]>, #<Nokogiri::XML::Text:0x197d768 "\r\n
=end
el = page.xpath(".//h3")
puts el.inspect
=begin
[ #<Nokogiri::XML::Element:0x197dcc0 name="h3" children=[#<Nokogiri::XML::Text:0x197da74 " Top Level ">]>,
#<Nokogiri::XML::Element:0x197c37c name="h3" children=[#<Nokogiri::XML::Text:0x197c190 " Bottom Level ">]>]
=end

Nokogiri children method

I have the following XML here:
<listing>
<seller_info>
<payment_types>Visa, Mastercard, , , , 0, Discover, American Express </payment_types>
<shipping_info>siteonly, Buyer Pays Shipping Costs </shipping_info>
<buyer_protection_info/>
<auction_info>
<bid_history>
<item_info>
</listing>
The following code works fine for displaying first child of the first //listing node:
require 'nokogiri'
require 'open-uri'
html_data = open('http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/data/auctions/321gone.xml')
nokogiri_object = Nokogiri::XML(html_data)
listing_elements = nokogiri_object.xpath("//listing")
puts listing_elements[0].children[1]
This also works:
puts listing_elements[0].children[3]
I tried to access the second node <payment_types> with the the following code:
puts listing_elements[0].children[2]
but a blank line was displayed. Looking through Firebug, it is clearly the 2nd child of the listing node. In general, only odd numbers work with the children method.
Is this a bug in Nokogiri? Any thoughts?
It's not a bug, its the space created while parsing strings that contain "\n" (or empty nodes), but you could use the noblanks option to avoid them:
nokogiri_object = Nokogiri::XML(html_data) { |conf| conf.noblanks }
Use that and you will have no blanks in your array.
The problem is you are not parsing the document correctly. children returns more than you think, and its use is painting you into a corner.
Here's a simplified example of how I'd do it:
require 'nokogiri'
doc = Nokogiri::XML(DATA.read)
auctions = doc.search('listing').map do |listing|
seller_info = listing.at('seller_info')
auction_info = listing.at('auction_info')
hash = [:seller_name, :seller_rating].each_with_object({}) do |s, h|
h[s] = seller_info.at(s.to_s).text.strip
end
[:current_bid, :time_left].each do |s|
hash[s] = auction_info.at(s.to_s).text.strip
end
hash
end
__END__
<?xml version='1.0' ?>
<!DOCTYPE root SYSTEM "http://www.cs.washington.edu/research/projects/xmltk/xmldata/data/auctions/321gone.dtd">
<root>
<listing>
<seller_info>
<seller_name>537_sb_3 </seller_name>
<seller_rating> 0</seller_rating>
</seller_info>
<auction_info>
<current_bid> $839.93</current_bid>
<time_left> 1 Day, 6 Hrs</time_left>
</auction_info>
</listing>
<listing>
<seller_info>
<seller_name> lapro8</seller_name>
<seller_rating> 0</seller_rating>
</seller_info>
<auction_info>
<current_bid> $210.00</current_bid>
<time_left> 4 Days, 21 Hrs</time_left>
</auction_info>
</listing>
</root>
After running, auctions will be:
auctions
# => [{:seller_name=>"537_sb_3",
# :seller_rating=>"0",
# :current_bid=>"$839.93",
# :time_left=>"1 Day, 6 Hrs"},
# {:seller_name=>"lapro8",
# :seller_rating=>"0",
# :current_bid=>"$210.00",
# :time_left=>"4 Days, 21 Hrs"}]
Notice there are no empty text nodes to deal with because I told Nokogiri exactly which nodes to grab text from. You should be able to extend the code to grab any information you want easily.
A typically formatted XML or HTML document that displays nesting or indentation uses text nodes to provide that indenting:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
</body>
</html>
EOT
Here's what your code is seeing:
doc.at('body').children.map(&:to_html)
# => ["\n" +
# " ", "<p>foo</p>", "\n" +
# " "]
The Text nodes are what are confusing you:
doc.at('body').children.first.class # => Nokogiri::XML::Text
doc.at('body').children.first.text # => "\n "
If you don't drill down far enough you will pick up the Text nodes and have to clean up the results:
doc.at('body')
.text # => "\n foo\n "
.strip # => "foo"
Instead, explicitly find the node you want and extract the information:
doc.at('body p').text # => "foo"
In the suggested code above I used strip because the incoming XML had spaces surrounding some text:
h[s] = seller_info.at(s.to_s).text.strip
which is the result of the original XML creation code not cleaning the lines prior to generating the XML. So sometimes we have to clean up their mess, but the proper accessing of the node can reduce that a lot.
The problem is that children includes text nodes such as the whitespace between elements. If instead you use element_children you get just the child elements (i.e. the contents of the tags, not the surrounding whitespace).

Convert HTML to plain text (with inclusion of <br>s)

Is it possible to convert HTML with Nokogiri to plain text? I also want to include <br /> tag.
For example, given this HTML:
<p>ala ma kota</p> <br /> <span>i kot to idiota </span>
I want this output:
ala ma kota
i kot to idiota
When I just call Nokogiri::HTML(my_html).text it excludes <br /> tag:
ala ma kota i kot to idiota
Instead of writing complex regexp I used Nokogiri.
Working solution (K.I.S.S!):
def strip_html(str)
document = Nokogiri::HTML.parse(str)
document.css("br").each { |node| node.replace("\n") }
document.text
end
Nothing like this exists by default, but you can easily hack something together that comes close to the desired output:
require 'nokogiri'
def render_to_ascii(node)
blocks = %w[p div address] # els to put newlines after
swaps = { "br"=>"\n", "hr"=>"\n#{'-'*70}\n" } # content to swap out
dup = node.dup # don't munge the original
# Get rid of superfluous whitespace in the source
dup.xpath('.//text()').each{ |t| t.content=t.text.gsub(/\s+/,' ') }
# Swap out the swaps
dup.css(swaps.keys.join(',')).each{ |n| n.replace( swaps[n.name] ) }
# Slap a couple newlines after each block level element
dup.css(blocks.join(',')).each{ |n| n.after("\n\n") }
# Return the modified text content
dup.text
end
frag = Nokogiri::HTML.fragment "<p>It is the end of the world
as we
know it<br>and <i>I</i> <strong>feel</strong>
<a href='blah'>fine</a>.</p><div>Capische<hr>Buddy?</div>"
puts render_to_ascii(frag)
#=> It is the end of the world as we know it
#=> and I feel fine.
#=>
#=> Capische
#=> ----------------------------------------------------------------------
#=> Buddy?
Try
Nokogiri::HTML(my_html.gsub('<br />',"\n")).text
Nokogiri will strip out links, so I use this first to preserve links in the text version:
html_version.gsub!(/<a href.*(http:[^"']+).*>(.*)<\/a>/i) { "#{$2}\n#{$1}" }
that will turn this:
link to google
to this:
link to google
http://google.com
If you use HAML you can solve html converting by putting html with 'raw' option, f.e.
= raw #product.short_description

Get text of a paragraph with all the markup (and their content) removed

How can I get only the text of the node <p> which has other tags in it like:
<p>hello my website is click here <b>test</b></p>
I only want "hello my website is"
This is what I tried:
begin
node = html_doc.css('p')
node.each do |node|
node.children.remove
end
return (node.nil?) ? '' : node.text
rescue
return ''
end
Update 2: all right, well you are removing all children with node.children.remove, including the text nodes, a proposed solution might look like:
# 1. select all <p> nodes
doc.css('p').
# 2. map children, and flatten
map { |node| node.children }.flatten.
# 3. select text nodes only
select { |node| node.text? }.
# 4. get text and join
map { |node| node.text }.join(' ').strip
This sample returns "hello my website is", but note that doc.css('p') als finds <p> tags within <p> tags.
Update: sorry, misread your question, you only want "hello my website is", see solution above, original answer:
Not directly with nokogiri, but the sanitize gem might be an option: https://github.com/rgrove/sanitize/
Sanitize.clean(html, {}) # => " hello my website is click here test "
FYI, it uses nokogiri internally.
Your test case did not include any interesting text interleaved with the markup.
If you want to turn <p>Hello <b>World</b>!</p> into "Hello !", then removing the children is one way to do it. Simpler (and less destructive) is to just find all the text nodes and join them:
require 'nokogiri'
html = Nokogiri::HTML('<p>Hello <b>World</b>!</p>')
# Find the first paragraph (in this case the only one)
para = html.at('p')
# Find all the text nodes that are children (not descendants),
# change them from nodes into the strings of text they contain,
# and then smush the results together into one big string.
p para.search('text()').map(&:text).join
#=> "Hello !"
If you want to turn <p>Hello <b>World</b>!</p> into "Hello " (no exclamation point) then you can simply do:
p para.children.first.text # if you know that text is the first child
p para.at('text()').text # if you want to find the first text node
As #Iwe showed, you can use the String#strip method to removing leading/trailing whitespace from the result, if you like.
There's a different way to go about this. Rather than bother with removing nodes, remove the text that those nodes contain:
require 'nokogiri'
doc = Nokogiri::HTML('<p>hello my website is click here <b>test</b></p>')
text = doc.search('p').map{ |p|
p_text = p.text
a_text = p.at('a').text
p_text[a_text] = ''
p_text
}
puts text
>>hello my website is test
This is a simple example, but the idea is to find the <p> tags, then scan inside those for the tags that contain the text you don't want. For each of those unwanted tags, grab their text and delete it from the surrounding text.
In the sample code, you'd have a list of undesirable nodes at the a_text assignment, loop over them, and iteratively remove the text, like so:
text = doc.search('p').map{ |p|
p_text = p.text
%w[a].each do |bad_nodes|
bad_nodes_text = p.at(bad_nodes).text
p_text[bad_nodes_text] = ''
end
p_text
}
You get back text which is an array of the tweaked text contents of the <p> nodes.

capturing specific text between tags

The explanation is in the comment. I put it there because is interpreted as bold or something, and it screws up the post.
# I need to capture text that is
# enclosed in tags that are both <b> and
# <i>, but if there is more than one
# text enclosed in <i> in the same <b>
# block, then I only want the text
# enclosed in the first <i> tag, For
# example, for the following line:
#
# <b> <i> Important text here </i>
# irrelevant text everywhere else <i>
# irrelevant text here </i> </b> <b>
# <i> Also Important </i> not important
# <i> not important </i> </b>
#
# I want to retrieve only:
# - Important text here
# - Also Important
#
# I also must not retrieve text inside an
# <h2> block. I have been trying to
# delete the block with nodes.delete(nodes. search('h2')),
# but it doesn't actually delete the h2 block
require "rubygems"
require "nokogiri"
html = <<EOT
<b><i> Important text here </i> more text <i> not important text here </i> </b>
<b> <i> Also Important </i> more text <i> not important </i> </b>
<h2><b> <i> I don't want this text either</i></b></h2>
EOT
doc = Nokogiri::HTML(html)
nodes = doc.search('b i')
nodes.each { |e| puts e }
# Expected output:
# Important text here
# Also Important
require "nokogiri"
require 'pp'
html = <<EOT
<b><i>Important text here</i>more text<i>not important text here</i></b>
<b><i>Also Important</i>more text<i>not important</i></b>
<h2><b><i>I don't want this text either</i></b></h2>
EOT
doc = Nokogiri::HTML(html)
nodes = doc.search('b')
nodes.each { |e| puts e.children.children.first unless e.parent.name == "h2" }
or with xpath:
nodes = doc.xpath("//../*[local-name() != 'h2']/b/i[1]")
nodes.each { |e| puts e.children.first}

Resources