Nokogiri group flat structure - ruby

I have an HTML structure like:
<div class='content'>
<h2>Title</h2>
<p>Some content for Title</p>
<h2>Another Title</h2>
<p>Content for Another Title</p>
<p>Some more content for Another title</p>
<h2>Third</h2>
<p>Third Content</p>
</div>
I am trying to write code to output:
Title
- Some content for Title
Another Title
- Content for Another Title
- Some more content for Another title
Third
- Third Content
I've never used Nokogiri until five minutes ago, all I can come up with so far is:
content = doc.at_css('.content')
content.css('h2').each do |node|
puts node.text
end
content.css('p').each do |node|
puts " - "
puts node.text
end
This obviously doesn't group the pieces together. How can I achieve my required grouping with Nokogiri?

You almost had it.
Here's how I would fix it.
content.css('h2').each do |node|
puts node.text
while node = node.at('+ p')
puts " - #{node.text}"
end
end
+ p means the next (adjacent) p

There are many ways to do it, here's one:
doc.at_css('.content').element_children.each do |node|
puts(node.name == "h2" ? node.text : " - #{node.text}")
end

Related

Why is the following Nokogiri/XPath code removing tags inside the node?

The document going in has a structure like this:
<span class="footnote">Hello there, link</span>
The XPath search is:
#doc = set_nokogiri(html)
footnotes = #doc.xpath(".//span[#class = 'footnote']")
footnotes.each_with_index do |footnote, index|
puts footnote
end
The above footnote becomes:
<span>Hello there, link</span>
I assume my XPath is wrong but I'm having a hard time figuring out why.
I had the wrong tag in the output and should have been more careful. The point being that the <a> tag is getting stripped but its contents are still included.
I also added the set_nokogiri line in case that's relevant.
I can't duplicate the problem:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<span class="footnote">Hello there, link</span>
EOT
footnotes = doc.xpath(".//span[#class = 'footnote']")
footnotes.to_xml # => "<span class=\"footnote\">Hello there, link</span>"
footnotes.each do |f|
puts f
end
# >> <span class="footnote">Hello there, link</span>
An additional problem is that the <a> tag has an invalid href URL.
link
should be:
link

How to use Nokogiri to split content between successive h2 tags and wrap it under a chapter div

I want to split a document into "chapters". A chapter starts at a h2 and includes all siblings up to but not including the next h2 tag.
I.e. given this
<div id="content">
<h2>First</h2>
<p>one</p>
<h2>Second</h2>
<p>two</p>
<h2>Third</h2>
</div>
I want this
<div id="dad">
<div class="chapter">
<h2>First</h2>
<p>one</p>
</div>
<div class="chapter">
<h2>Second</h2>
<p>two</p>
</div>
<div class="chapter">
<h2>Third</h2>
</div>
</div>
Whilst I've used Nokogiri and xml to do some basic manipulation, I'm banging my heading wondering how to first group the nodes into chapter blocks and then wrap them in place with the chapter div.
Can anyone help?
You should group your nodes by headers (include related subling nodes) and then transform them to output format.
Here is an idea of algorithm to group nodes:
array = [
:header,
:text,
:text,
:header,
:text,
:header,
:text,
:text,
]
groupped_array = array.reduce([]) do |res, item|
res.tap do
res << [] if item == :header
res.last << item
end
end
p groupped_array
Result:
➜ ruby group_nodes.rb
[[:header, :text, :text], [:header, :text], [:header, :text, :text]]
I think you can add nokogiri here without big problems and transform result to your output format.

Replace markup (as a string) including certain inline elements [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
My intent is to modify a sentence within a tag.
For example change:
<div id="1">
This is text in the TD with <strong> strong </strong> tags
<p>This is a child node. with <b> bold </b> tags</p>
<div id=2>
"another line of text to a link "
<p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
</div>
</div>
To this:
<div id="1">
This is modified text in the TD with <strong> strong </strong> tags
<p>This is a child node. with <b> bold </b> tags</p>
<div id=2>
"another line of text to a link "
<p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
</div>
</div>
Which would mean I need to traverse the nodes grabbing a tag and getting all the text & style nodes, but not grabbing the children tags. Modifying the sentences and putting them back. I would need to do this for each tag with full text until all the content was modified.
For example grabbing the text and style nodes for div#1 would be:
"This is text in the TD with strong tags"
but as you can see, none of the other text underneath would be grabbed. It should be accessible and modifiable through a variable.
div#1.text_with_formating= "This is modified text in the TD with <strong> strong </strong> tags"
The below code removes all content, not just the children tags, keeping content leaves all content even the tags under div#1. Therefore, I'm not sure how to proceed.
Sanitize.clean(h,{:elements => %w[b em i strong u],:remove_contents=>'true'})
How would you recommend solving this?
If you want to find all the text nodes underneath an element, use:
text_pieces = div.xpath('.//text()')
If you want to find only the text that is an immediate child of an element, use:
text_pieces = div.xpath('text()')
For each text node, you can change the content any way you like. You must, however, just be sure you use my_text_node.content = ... instead of my_text_node.content.gsub!(...).
# Replace text that is a direct child of an element
def gsub_my_text!( el, find, replace=nil, &block )
el.xpath('text()').each do |text|
next if text.content.strip.empty?
text.content = replace ? text.content.gsub(find,replace,&block) : text.content.gsub(find,&block)
end
end
# Replace text beneath an element.
def gsub_text!( el, find, replace=nil, &block )
el.xpath('.//text()').each do |text|
next if text.content.strip.empty?
text.content = replace ? text.content.gsub(find,replace,&block) : text.content.gsub(find,&block)
end
end
d1 = doc.at('#d1')
gsub_my_text!( d1, /[aeiou]+/ ){ |found| found.upcase }
puts d1
#=> <div id="d1">
#=> ThIs Is tExt In thE TD wIth <strong> strong </strong> tAgs
#=> <p>This is a child node. with <b> bold </b> tags</p>
#=> <div id="d2">
#=> "another line of text to a link "
#=> <p> This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em></p>
#=> </div>
#=> </div>
gsub_text!( d1, /\w+/, '(\\0)' )
puts d1
#=> <div id="d1">
#=> (ThIs) (Is) (tExt) (In) (thE) (TD) (wIth) <strong> (strong) </strong> (tAgs)
#=> <p>(This) (is) (a) (child) (node). (with) <b> (bold) </b> (tags)</p>
#=> <div id="d2">
#=> "(another) (line) (of) (text) (to) (a) (link) "
#=> <p> (This) (is) (text) (inside) (a) (div) <em>(inside)<em> (another) (div) (inside) (a) (paragraph) (tag)</em></em></p>
#=> </div>
#=> </div>
Edit: Here is code that allows you to extract runs of text+inline markup as a string, run a gsub on that, and replace the result with new markup.
require 'nokogiri'
doc = Nokogiri.HTML '<div id="d1">
Text with <strong>strong</strong> tag.
<p>This is a child node. with <b>bold</b> tags.</p>
<div id=d2>And now we are in another div.</div>
Hooray for <em>me!</em>
</div>'
module Enumerable
# http://stackoverflow.com/q/4800337/405017
def split_on() chunk{|o|yield(o)||nil}.map{|b,a|b&&a}.compact end
end
require 'set'
# Given a node, call gsub on the `inner_html`
def gsub_markup!( node, find, replace=nil, &replace_block )
allowed = Set.new(%w[strong b em i u strike])
runs = node.children.split_on{ |el| el.node_type==1 && !allowed.include?(el.name) }
runs.each do |nodes|
orig = nodes.map{ |node| node.node_type==3 ? node.content : node.to_html }.join
next if orig.strip.empty? # Skip whitespace-only nodes
result = replace ? orig.gsub(find,replace) : orig.gsub(find,&replace_block)
puts "I'm replacing #{orig.inspect} with #{result.inspect}" if $DEBUG
nodes[1..-1].each(&:remove)
nodes.first.replace(result)
end
end
d1 = doc.at('#d1')
$DEBUG = true
gsub_markup!( d1, /[aeiou]+/, &:upcase )
#=> I'm replacing "\n Text with <strong>strong</strong> tag.\n " with "\n TExt wIth <strOng>strOng</strOng> tAg.\n "
#=> I'm replacing "\n Hooray for <em>me!</em>\n" with "\n HOOrAy fOr <Em>mE!</Em>\n"
puts doc
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body><div id="d1">
#=> TExt wIth <strong>strOng</strong> tAg.
#=> <p>This is a child node. with <b>bold</b> tags.</p>
#=> <div id="d2">And now we are in another div.</div>
#=> HOOrAy fOr <em>mE!</em>
#=> </div></body></html>
The easiest way would be:
div = doc.at('div#1')
div.replace div.to_s.sub('text', 'modified text')

Nokogiri : Not able to access "p" tag text with "a" tag link

My HTML code is this :
<h3>Head1</h3>
<p>text before linkLink 1text after link</p>
<h3>Head2</h3>
<p>text before linkLink 2text after link</p>
<h3>Head3</h3>
<p>text before linkLink 3text after link</p>
I am using NOKOGIRI for the HTML parsing.
In Above case,
Suppose above html code is in #text
#page_data = Nokogiri::HTML(#text)
#headings = #page_data.css('h3')
#desc = #page_data.css('p')
But in #desc , It only return the text ,It will not create the link for "Link 1","Link 2","Link 3".
Becoz the link is present in between the text , So i can not link it separately again.
How can i achieve the text with link in "p" tag in this case ?
Your question is not very clear about what you are trying to accomplish. If by this...
How can i achieve the text with link in "p" tag in this case?
...you mean, "How can I get the HTML contents of each <p> tag?" then this will do it:
require "nokogiri"
frag = Nokogiri::HTML.fragment(my_html)
frag.css('h3').each do |header|
puts header.text
para = header.next_element
puts para.inner_html
end
#=> Head1
#=> text before linkLink 1text after link
#=> Head2
#=> text before linkLink 2text after link
#=> Head3
#=> text before linkLink 3text after link
If instead you mean "How do I get just the text of the anchor in each paragraph?" then you could either do this:
frag.css('h3').each do |header|
anchor = header.next_element.at_css('a')
puts "#{header.text}: #{anchor.text}"
end
#=> Head1: Link 1
#=> Head2: Link 2
#=> Head3: Link 3
...or you could do this:
frag.xpath('.//p/a').each do |anchor|
puts anchor.text
end
#=> Link 1
#=> Link 2
#=> Link 3
If none of these are what you want, then please edit your question to more clearly explain what you want as the end result.

How do I get the next HTML element in Nokogiri?

Let's say my HTML document is like:
<div class="headline">News</div>
<p>Some interesting news here</p>
<div class="headline">Sports</div>
<p>Baseball is fun!</p>
I can get the headline divs with the following code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "mypage.html"
doc = Nokogiri::HTML(open(url))
doc.css(".headline").each do |item|
puts item.text
end
But how do I access the content in the following p tag so that News is related to Some interesting news here, etc?
You want Node#next_element:
doc.css(".headline").each do |item|
puts item.text
puts item.next_element.text
end
There is also item.next, but that will also return text nodes, where item.next_element will return only element nodes (like p).

Resources