Remove all nodes after a specified node [duplicate] - ruby

This question already has answers here:
Nokogiri: Select content between element A and B
(3 answers)
Closed 2 years ago.
I'm grabbing a div of text from a url and would like to remove everything underneath a paragraph which has a backtotop class. I'd seen a traverse snippet of code here on stackoverflow which looks promising, but I can't figure out how to get it incorporated so #el only contains everything up to the first p.backtotop in the div.
my code:
#doc = Nokogiri::HTML(open(url))
#el = #doc.css("div")[0]
end
traverse snippet:
doc = Nokogiri::HTML(code)
stop_node = doc.css("p.backtotop")
doc.traverse do |node|
break if node == stop_node
# else, do whatever, e.g. `puts node.name`
end

Find the div you want.
Find the 'stop' item you want, and then find all the following siblings.
Remove them.
For example:
<body>
<div id="a">
<h2>My Section</h2>
<p class="backtotop">Back to Top</p>
<p>More Content</p>
<p>Even More Content</p>
</div>
</body>
require 'nokogiri'
doc = Nokogiri::HTML(my_html)
div = doc.at('#a')
div.at('.backtotop').xpath('following-sibling::*').remove
puts div
#=> <div id="a">
#=> <h2>My Section</h2>
#=> <p class="backtotop">Back to Top</p>
#=>
#=>
#=> </div>
Here's a more complicated example, where the backtotop item may not be at the root of the div:
<body>
<div id="b">
<h2>Another Section</h2>
<section>
<p class="backtotop">Back to Top</p>
<p>More Content</p>
</section>
<p>Even More Content</p>
</div>
</body>
require 'nokogiri'
doc = Nokogiri::HTML(my_html)
div = doc.at('#b')
n = div.at('.backtotop')
until n==div
n.xpath('following-sibling::*').remove
n = n.parent
end
puts div
#=> <div id="b">
#=> <h2>Another Section</h2>
#=> <section><p class="backtotop">Back to Top</p>
#=>
#=> </section>
#=> </div>
If your HTML is more complicated than the above then please provide an actual sample along with the result you want. This is good advice for any future question you ask.

Related

Get instance no of a tag in nokogiri

I would like the to get the instance no. of a tag, i.e - whether a given node is the 1st, 2nd, or 3rd etc. instance of a given tag.
For example, if I call node.path on a node, I get the following output:
/html/head/base/link/body/div/br/form/hr/chapter[1]/section[1]/ul/li[1]/a
How do I get that 1 next to section?
require 'nokogiri'
html_string=<<END
<html>
<body>
<div>
<span></span>
<span></span>
<span></span>
<span>
<h1></h1>
</span>
<span></span>
<span>
<strong></strong>
</span>
<span></span>
<span></span>
</div>
</body>
</html>
END
doc = Nokogiri::HTML(html_string)
h1 = doc.xpath("/html/body/div/span/h1")[0]
puts h1.path # output => /html/body/div/span[4]/h1
puts h1.parent.xpath("preceding-sibling::*").size + 1 # output => 4
strong = doc.xpath("/html/body/div/span/strong")[0]
puts strong.path # output => /html/body/div/span[6]/strong
puts strong.parent.xpath("preceding-sibling::*").size + 1 # output => 6

Watir::ElementCollection click action in loop

Going off the example from the documentation http://www.rubydoc.info/gems/watir-webdriver/0.6.11/Watir/ElementCollection#each-instance_method, I am trying to click each element on the page that has the same class.
This is a code snippet of what I've come up with so far:
#b.divs(:class => 'portal-thumbnail-card').each do |div|
#b.div(:class => 'portal-thumbnail-card').click
puts 'foo'
# my puts statement outputs 'foo' 6 times (matches the number of elements with that class)
# right now this only clicks on the FIRST element, having issues with the other part :(
end
Even though this doesn't involve any page reloading, are click actions possible?
The problem is that you are locating the div to click during each iteration of the loop. In English, your code actually says, "for each div element with the class 'portal-thumbnail-card', click the first div on the page with class 'portal-thumbnail-card'."
What you actually want to do is click the div element that is the subject of each iteration:
#b.divs(:class => 'portal-thumbnail-card').each do |div|
div.click
puts 'foo'
end
The divs method returns a Watir::DivCollection, which is a collection of Watir::Div objects. For example:
require 'watir-webdriver'
b = Watir::Browser.new
b.goto('http://example.org')
divs = b.divs
puts divs.class
#=> Watir::DivCollection
divs.each { |d| puts d.class}
#=> Watir::Div
So--within your iterator--you want to refer to the block-local variable (i.e. div.click) instead of the browser's instance variable (i.e. #b.div(:class => 'portal-thumbnail-card').click)
use flash method for see element what you try click
require 'watir-webdriver'
browser = Watir::Browser.new
browser.goto "data:text/html,#{DATA.read}"
browser.divs(:class => 'portal-thumbnail-card').each do |div|
# browser.div(:class => 'portal-thumbnail-card').flash #you variant
div.flash #correct variant
puts 'foo'
end
browser.close
__END__
<html>
<div class='portal-thumbnail-card'>
<button id="button1">Button 1</button>
</div>
<div class='portal-thumbnail-card'>
<button id="button2">Button 2</button>
</div>
<div class='portal-thumbnail-card'>
<button id="button3">Button 3</button>
</div>
<div class='portal-thumbnail-card'>
<button id="button4">Button 4</button>
</div>
<div class='portal-thumbnail-card'>
<button id="button5">Button 5</button>
</div>
</html>

How to use Nokogiri to split content between successive h2 tags and wrap it under a chapter div

I want to split a document into "chapters". A chapter starts at a h2 and includes all siblings up to but not including the next h2 tag.
I.e. given this
<div id="content">
<h2>First</h2>
<p>one</p>
<h2>Second</h2>
<p>two</p>
<h2>Third</h2>
</div>
I want this
<div id="dad">
<div class="chapter">
<h2>First</h2>
<p>one</p>
</div>
<div class="chapter">
<h2>Second</h2>
<p>two</p>
</div>
<div class="chapter">
<h2>Third</h2>
</div>
</div>
Whilst I've used Nokogiri and xml to do some basic manipulation, I'm banging my heading wondering how to first group the nodes into chapter blocks and then wrap them in place with the chapter div.
Can anyone help?
You should group your nodes by headers (include related subling nodes) and then transform them to output format.
Here is an idea of algorithm to group nodes:
array = [
:header,
:text,
:text,
:header,
:text,
:header,
:text,
:text,
]
groupped_array = array.reduce([]) do |res, item|
res.tap do
res << [] if item == :header
res.last << item
end
end
p groupped_array
Result:
➜ ruby group_nodes.rb
[[:header, :text, :text], [:header, :text], [:header, :text, :text]]
I think you can add nokogiri here without big problems and transform result to your output format.

Replace markup (as a string) including certain inline elements [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
My intent is to modify a sentence within a tag.
For example change:
<div id="1">
This is text in the TD with <strong> strong </strong> tags
<p>This is a child node. with <b> bold </b> tags</p>
<div id=2>
"another line of text to a link "
<p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
</div>
</div>
To this:
<div id="1">
This is modified text in the TD with <strong> strong </strong> tags
<p>This is a child node. with <b> bold </b> tags</p>
<div id=2>
"another line of text to a link "
<p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
</div>
</div>
Which would mean I need to traverse the nodes grabbing a tag and getting all the text & style nodes, but not grabbing the children tags. Modifying the sentences and putting them back. I would need to do this for each tag with full text until all the content was modified.
For example grabbing the text and style nodes for div#1 would be:
"This is text in the TD with strong tags"
but as you can see, none of the other text underneath would be grabbed. It should be accessible and modifiable through a variable.
div#1.text_with_formating= "This is modified text in the TD with <strong> strong </strong> tags"
The below code removes all content, not just the children tags, keeping content leaves all content even the tags under div#1. Therefore, I'm not sure how to proceed.
Sanitize.clean(h,{:elements => %w[b em i strong u],:remove_contents=>'true'})
How would you recommend solving this?
If you want to find all the text nodes underneath an element, use:
text_pieces = div.xpath('.//text()')
If you want to find only the text that is an immediate child of an element, use:
text_pieces = div.xpath('text()')
For each text node, you can change the content any way you like. You must, however, just be sure you use my_text_node.content = ... instead of my_text_node.content.gsub!(...).
# Replace text that is a direct child of an element
def gsub_my_text!( el, find, replace=nil, &block )
el.xpath('text()').each do |text|
next if text.content.strip.empty?
text.content = replace ? text.content.gsub(find,replace,&block) : text.content.gsub(find,&block)
end
end
# Replace text beneath an element.
def gsub_text!( el, find, replace=nil, &block )
el.xpath('.//text()').each do |text|
next if text.content.strip.empty?
text.content = replace ? text.content.gsub(find,replace,&block) : text.content.gsub(find,&block)
end
end
d1 = doc.at('#d1')
gsub_my_text!( d1, /[aeiou]+/ ){ |found| found.upcase }
puts d1
#=> <div id="d1">
#=> ThIs Is tExt In thE TD wIth <strong> strong </strong> tAgs
#=> <p>This is a child node. with <b> bold </b> tags</p>
#=> <div id="d2">
#=> "another line of text to a link "
#=> <p> This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em></p>
#=> </div>
#=> </div>
gsub_text!( d1, /\w+/, '(\\0)' )
puts d1
#=> <div id="d1">
#=> (ThIs) (Is) (tExt) (In) (thE) (TD) (wIth) <strong> (strong) </strong> (tAgs)
#=> <p>(This) (is) (a) (child) (node). (with) <b> (bold) </b> (tags)</p>
#=> <div id="d2">
#=> "(another) (line) (of) (text) (to) (a) (link) "
#=> <p> (This) (is) (text) (inside) (a) (div) <em>(inside)<em> (another) (div) (inside) (a) (paragraph) (tag)</em></em></p>
#=> </div>
#=> </div>
Edit: Here is code that allows you to extract runs of text+inline markup as a string, run a gsub on that, and replace the result with new markup.
require 'nokogiri'
doc = Nokogiri.HTML '<div id="d1">
Text with <strong>strong</strong> tag.
<p>This is a child node. with <b>bold</b> tags.</p>
<div id=d2>And now we are in another div.</div>
Hooray for <em>me!</em>
</div>'
module Enumerable
# http://stackoverflow.com/q/4800337/405017
def split_on() chunk{|o|yield(o)||nil}.map{|b,a|b&&a}.compact end
end
require 'set'
# Given a node, call gsub on the `inner_html`
def gsub_markup!( node, find, replace=nil, &replace_block )
allowed = Set.new(%w[strong b em i u strike])
runs = node.children.split_on{ |el| el.node_type==1 && !allowed.include?(el.name) }
runs.each do |nodes|
orig = nodes.map{ |node| node.node_type==3 ? node.content : node.to_html }.join
next if orig.strip.empty? # Skip whitespace-only nodes
result = replace ? orig.gsub(find,replace) : orig.gsub(find,&replace_block)
puts "I'm replacing #{orig.inspect} with #{result.inspect}" if $DEBUG
nodes[1..-1].each(&:remove)
nodes.first.replace(result)
end
end
d1 = doc.at('#d1')
$DEBUG = true
gsub_markup!( d1, /[aeiou]+/, &:upcase )
#=> I'm replacing "\n Text with <strong>strong</strong> tag.\n " with "\n TExt wIth <strOng>strOng</strOng> tAg.\n "
#=> I'm replacing "\n Hooray for <em>me!</em>\n" with "\n HOOrAy fOr <Em>mE!</Em>\n"
puts doc
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body><div id="d1">
#=> TExt wIth <strong>strOng</strong> tAg.
#=> <p>This is a child node. with <b>bold</b> tags.</p>
#=> <div id="d2">And now we are in another div.</div>
#=> HOOrAy fOr <em>mE!</em>
#=> </div></body></html>
The easiest way would be:
div = doc.at('div#1')
div.replace div.to_s.sub('text', 'modified text')

Parsing webpage with some html tags using Nokogiri

For example:
content=Nokogiri::HTML(open(url)).at_css(".appwindow").text
This example parse text from .appwindow (only text).
How can I parse this text with <p> tag?
I think you want to find either the full HTML of the first element that has an appwindow class, or perhaps the inner HTML. If so:
require 'nokogiri'
html = Nokogiri::HTML <<ENDHTML
<div id='menu'>menu</div>
<div class='appwindow'><p>Hello <b>World</b>!</p></div>
ENDHTML
puts html.at_css('.appwindow').text
#=> Hello World!
puts html.at_css('.appwindow').to_html
#=> <div class="appwindow"><p>Hello <b>World</b>!</p></div>
puts html.at_css('.appwindow').inner_html
#=> <p>Hello <b>World</b>!</p>
See the list of methods on Nokogiri::XML::Node for other options available to you.

Resources