Replace markup (as a string) including certain inline elements [closed] - ruby

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
My intent is to modify a sentence within a tag.
For example change:
<div id="1">
This is text in the TD with <strong> strong </strong> tags
<p>This is a child node. with <b> bold </b> tags</p>
<div id=2>
"another line of text to a link "
<p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
</div>
</div>
To this:
<div id="1">
This is modified text in the TD with <strong> strong </strong> tags
<p>This is a child node. with <b> bold </b> tags</p>
<div id=2>
"another line of text to a link "
<p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
</div>
</div>
Which would mean I need to traverse the nodes grabbing a tag and getting all the text & style nodes, but not grabbing the children tags. Modifying the sentences and putting them back. I would need to do this for each tag with full text until all the content was modified.
For example grabbing the text and style nodes for div#1 would be:
"This is text in the TD with strong tags"
but as you can see, none of the other text underneath would be grabbed. It should be accessible and modifiable through a variable.
div#1.text_with_formating= "This is modified text in the TD with <strong> strong </strong> tags"
The below code removes all content, not just the children tags, keeping content leaves all content even the tags under div#1. Therefore, I'm not sure how to proceed.
Sanitize.clean(h,{:elements => %w[b em i strong u],:remove_contents=>'true'})
How would you recommend solving this?

If you want to find all the text nodes underneath an element, use:
text_pieces = div.xpath('.//text()')
If you want to find only the text that is an immediate child of an element, use:
text_pieces = div.xpath('text()')
For each text node, you can change the content any way you like. You must, however, just be sure you use my_text_node.content = ... instead of my_text_node.content.gsub!(...).
# Replace text that is a direct child of an element
def gsub_my_text!( el, find, replace=nil, &block )
el.xpath('text()').each do |text|
next if text.content.strip.empty?
text.content = replace ? text.content.gsub(find,replace,&block) : text.content.gsub(find,&block)
end
end
# Replace text beneath an element.
def gsub_text!( el, find, replace=nil, &block )
el.xpath('.//text()').each do |text|
next if text.content.strip.empty?
text.content = replace ? text.content.gsub(find,replace,&block) : text.content.gsub(find,&block)
end
end
d1 = doc.at('#d1')
gsub_my_text!( d1, /[aeiou]+/ ){ |found| found.upcase }
puts d1
#=> <div id="d1">
#=> ThIs Is tExt In thE TD wIth <strong> strong </strong> tAgs
#=> <p>This is a child node. with <b> bold </b> tags</p>
#=> <div id="d2">
#=> "another line of text to a link "
#=> <p> This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em></p>
#=> </div>
#=> </div>
gsub_text!( d1, /\w+/, '(\\0)' )
puts d1
#=> <div id="d1">
#=> (ThIs) (Is) (tExt) (In) (thE) (TD) (wIth) <strong> (strong) </strong> (tAgs)
#=> <p>(This) (is) (a) (child) (node). (with) <b> (bold) </b> (tags)</p>
#=> <div id="d2">
#=> "(another) (line) (of) (text) (to) (a) (link) "
#=> <p> (This) (is) (text) (inside) (a) (div) <em>(inside)<em> (another) (div) (inside) (a) (paragraph) (tag)</em></em></p>
#=> </div>
#=> </div>
Edit: Here is code that allows you to extract runs of text+inline markup as a string, run a gsub on that, and replace the result with new markup.
require 'nokogiri'
doc = Nokogiri.HTML '<div id="d1">
Text with <strong>strong</strong> tag.
<p>This is a child node. with <b>bold</b> tags.</p>
<div id=d2>And now we are in another div.</div>
Hooray for <em>me!</em>
</div>'
module Enumerable
# http://stackoverflow.com/q/4800337/405017
def split_on() chunk{|o|yield(o)||nil}.map{|b,a|b&&a}.compact end
end
require 'set'
# Given a node, call gsub on the `inner_html`
def gsub_markup!( node, find, replace=nil, &replace_block )
allowed = Set.new(%w[strong b em i u strike])
runs = node.children.split_on{ |el| el.node_type==1 && !allowed.include?(el.name) }
runs.each do |nodes|
orig = nodes.map{ |node| node.node_type==3 ? node.content : node.to_html }.join
next if orig.strip.empty? # Skip whitespace-only nodes
result = replace ? orig.gsub(find,replace) : orig.gsub(find,&replace_block)
puts "I'm replacing #{orig.inspect} with #{result.inspect}" if $DEBUG
nodes[1..-1].each(&:remove)
nodes.first.replace(result)
end
end
d1 = doc.at('#d1')
$DEBUG = true
gsub_markup!( d1, /[aeiou]+/, &:upcase )
#=> I'm replacing "\n Text with <strong>strong</strong> tag.\n " with "\n TExt wIth <strOng>strOng</strOng> tAg.\n "
#=> I'm replacing "\n Hooray for <em>me!</em>\n" with "\n HOOrAy fOr <Em>mE!</Em>\n"
puts doc
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body><div id="d1">
#=> TExt wIth <strong>strOng</strong> tAg.
#=> <p>This is a child node. with <b>bold</b> tags.</p>
#=> <div id="d2">And now we are in another div.</div>
#=> HOOrAy fOr <em>mE!</em>
#=> </div></body></html>

The easiest way would be:
div = doc.at('div#1')
div.replace div.to_s.sub('text', 'modified text')

Related

How to use Nokogiri to split content between successive h2 tags and wrap it under a chapter div

I want to split a document into "chapters". A chapter starts at a h2 and includes all siblings up to but not including the next h2 tag.
I.e. given this
<div id="content">
<h2>First</h2>
<p>one</p>
<h2>Second</h2>
<p>two</p>
<h2>Third</h2>
</div>
I want this
<div id="dad">
<div class="chapter">
<h2>First</h2>
<p>one</p>
</div>
<div class="chapter">
<h2>Second</h2>
<p>two</p>
</div>
<div class="chapter">
<h2>Third</h2>
</div>
</div>
Whilst I've used Nokogiri and xml to do some basic manipulation, I'm banging my heading wondering how to first group the nodes into chapter blocks and then wrap them in place with the chapter div.
Can anyone help?
You should group your nodes by headers (include related subling nodes) and then transform them to output format.
Here is an idea of algorithm to group nodes:
array = [
:header,
:text,
:text,
:header,
:text,
:header,
:text,
:text,
]
groupped_array = array.reduce([]) do |res, item|
res.tap do
res << [] if item == :header
res.last << item
end
end
p groupped_array
Result:
➜ ruby group_nodes.rb
[[:header, :text, :text], [:header, :text], [:header, :text, :text]]
I think you can add nokogiri here without big problems and transform result to your output format.

Nokogiri group flat structure

I have an HTML structure like:
<div class='content'>
<h2>Title</h2>
<p>Some content for Title</p>
<h2>Another Title</h2>
<p>Content for Another Title</p>
<p>Some more content for Another title</p>
<h2>Third</h2>
<p>Third Content</p>
</div>
I am trying to write code to output:
Title
- Some content for Title
Another Title
- Content for Another Title
- Some more content for Another title
Third
- Third Content
I've never used Nokogiri until five minutes ago, all I can come up with so far is:
content = doc.at_css('.content')
content.css('h2').each do |node|
puts node.text
end
content.css('p').each do |node|
puts " - "
puts node.text
end
This obviously doesn't group the pieces together. How can I achieve my required grouping with Nokogiri?
You almost had it.
Here's how I would fix it.
content.css('h2').each do |node|
puts node.text
while node = node.at('+ p')
puts " - #{node.text}"
end
end
+ p means the next (adjacent) p
There are many ways to do it, here's one:
doc.at_css('.content').element_children.each do |node|
puts(node.name == "h2" ? node.text : " - #{node.text}")
end

Remove all nodes after a specified node [duplicate]

This question already has answers here:
Nokogiri: Select content between element A and B
(3 answers)
Closed 2 years ago.
I'm grabbing a div of text from a url and would like to remove everything underneath a paragraph which has a backtotop class. I'd seen a traverse snippet of code here on stackoverflow which looks promising, but I can't figure out how to get it incorporated so #el only contains everything up to the first p.backtotop in the div.
my code:
#doc = Nokogiri::HTML(open(url))
#el = #doc.css("div")[0]
end
traverse snippet:
doc = Nokogiri::HTML(code)
stop_node = doc.css("p.backtotop")
doc.traverse do |node|
break if node == stop_node
# else, do whatever, e.g. `puts node.name`
end
Find the div you want.
Find the 'stop' item you want, and then find all the following siblings.
Remove them.
For example:
<body>
<div id="a">
<h2>My Section</h2>
<p class="backtotop">Back to Top</p>
<p>More Content</p>
<p>Even More Content</p>
</div>
</body>
require 'nokogiri'
doc = Nokogiri::HTML(my_html)
div = doc.at('#a')
div.at('.backtotop').xpath('following-sibling::*').remove
puts div
#=> <div id="a">
#=> <h2>My Section</h2>
#=> <p class="backtotop">Back to Top</p>
#=>
#=>
#=> </div>
Here's a more complicated example, where the backtotop item may not be at the root of the div:
<body>
<div id="b">
<h2>Another Section</h2>
<section>
<p class="backtotop">Back to Top</p>
<p>More Content</p>
</section>
<p>Even More Content</p>
</div>
</body>
require 'nokogiri'
doc = Nokogiri::HTML(my_html)
div = doc.at('#b')
n = div.at('.backtotop')
until n==div
n.xpath('following-sibling::*').remove
n = n.parent
end
puts div
#=> <div id="b">
#=> <h2>Another Section</h2>
#=> <section><p class="backtotop">Back to Top</p>
#=>
#=> </section>
#=> </div>
If your HTML is more complicated than the above then please provide an actual sample along with the result you want. This is good advice for any future question you ask.

capturing specific text between tags

The explanation is in the comment. I put it there because is interpreted as bold or something, and it screws up the post.
# I need to capture text that is
# enclosed in tags that are both <b> and
# <i>, but if there is more than one
# text enclosed in <i> in the same <b>
# block, then I only want the text
# enclosed in the first <i> tag, For
# example, for the following line:
#
# <b> <i> Important text here </i>
# irrelevant text everywhere else <i>
# irrelevant text here </i> </b> <b>
# <i> Also Important </i> not important
# <i> not important </i> </b>
#
# I want to retrieve only:
# - Important text here
# - Also Important
#
# I also must not retrieve text inside an
# <h2> block. I have been trying to
# delete the block with nodes.delete(nodes. search('h2')),
# but it doesn't actually delete the h2 block
require "rubygems"
require "nokogiri"
html = <<EOT
<b><i> Important text here </i> more text <i> not important text here </i> </b>
<b> <i> Also Important </i> more text <i> not important </i> </b>
<h2><b> <i> I don't want this text either</i></b></h2>
EOT
doc = Nokogiri::HTML(html)
nodes = doc.search('b i')
nodes.each { |e| puts e }
# Expected output:
# Important text here
# Also Important
require "nokogiri"
require 'pp'
html = <<EOT
<b><i>Important text here</i>more text<i>not important text here</i></b>
<b><i>Also Important</i>more text<i>not important</i></b>
<h2><b><i>I don't want this text either</i></b></h2>
EOT
doc = Nokogiri::HTML(html)
nodes = doc.search('b')
nodes.each { |e| puts e.children.children.first unless e.parent.name == "h2" }
or with xpath:
nodes = doc.xpath("//../*[local-name() != 'h2']/b/i[1]")
nodes.each { |e| puts e.children.first}

Parsing webpage with some html tags using Nokogiri

For example:
content=Nokogiri::HTML(open(url)).at_css(".appwindow").text
This example parse text from .appwindow (only text).
How can I parse this text with <p> tag?
I think you want to find either the full HTML of the first element that has an appwindow class, or perhaps the inner HTML. If so:
require 'nokogiri'
html = Nokogiri::HTML <<ENDHTML
<div id='menu'>menu</div>
<div class='appwindow'><p>Hello <b>World</b>!</p></div>
ENDHTML
puts html.at_css('.appwindow').text
#=> Hello World!
puts html.at_css('.appwindow').to_html
#=> <div class="appwindow"><p>Hello <b>World</b>!</p></div>
puts html.at_css('.appwindow').inner_html
#=> <p>Hello <b>World</b>!</p>
See the list of methods on Nokogiri::XML::Node for other options available to you.

Resources