Nokogiri grab text with formatting and link tags, <em>,<strong>, <a>, etc - ruby

How can I recursively capture all the text with formatting tags using Nokogiri?
<div id="1">
This is text in the TD with <strong> strong </strong> tags
<p>This is a child node. with <b> bold </b> tags</p>
<div id=2>
"another line of text to a link "
<p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
</div>
</div>
For example, I would like to capture:
"This is text in the TD with <strong> strong </strong> tags"
"This is a child node. with <b> bold </b> tags"
"another line of text to a link "
"This is text inside a div <em>inside<em> another div inside a paragraph tag"
I can't just use .text() because it strips the formatting tags and I'm not sure how to do it recursively.
ADDED DETAIL: Sanitize looks like an interesting gem, I'm reading it now. However, have some added info that might clarify what I need to do.
I need to traverse each node, get the text, process it and put it back. therefore I would grab the text from , "This is text in the TD with strong tags", modify it to something like, "This is the modified text in the TD with strong tags. Then goto the next tag from div 1 get the text. "This is a child node. with bold tags" modify it "This is a modified child node. with bold tags." and put it back. Goto the next div#2 and grab the text, "another line of text to a link ", modify it, "another line of modified text to a link ", and put it back and goto the next node, Div#2 and grab text from the paragraph tag. "This is modified text inside a div inside another div inside a paragraph tag"
so after everything is processed the new html should be look like this...
<div id="1">
This is modified text in the TD with <strong> strong </strong> tags
<p>This is a modified child node. with <b> bold </b> tags</p>
<div id=2>
"another line of modified text to a link "
<p> This is modified text inside a div <em>inside<em> another div inside a paragraph tag</p>
</div>
</div>
My quasi-code,but I'm really stuck on the two parts, grabbing just the text with formatting (which sanitize helps with), but sanitize grabs all tags. I need to preserve formatting of just the text with formatting, including spaces, etc. However, not grab the unrelated tag children. And two, traversing down all the children related directly with full text tags.
#Quasi-code
doc = Nokogiri.HTML(html)
kids=doc.at('div#1')
text_kids=kids.descendant_elements
text.kids.each do |i|
#grab full text(full sentence and paragraphs) with formating tags
#currently, I have not way to grab just the text with formatting and not the other tags
modified_text=processing_code(i.full_text_w_formating())
i.full_text_w_formating=modified_text
end
def processing_code(string)
#code to process string (not relevant for this example)
return modified_string
end
# Recursive 1
class Nokogiri::XML::Node
def descendant_elements
#This is flawed because it grabs every child and even
#splits it based on any tag.
# I need to traverse down only the text related children.
element_children.map{ |kid|
[kid, kid.descendant_elements]
}.flatten
end
end

I'd use two tactics, Nokogiri to extract the content you want, then a blacklist/whitelist program to strip tags you don't want or keep the ones you want.
require 'nokogiri'
require 'sanitize'
html = '
<div id="1">
This is text in the TD with <strong> strong <strong> tags
<p>This is a child node. with <b> bold </b> tags</p>
<div id=2>
"another line of text to a link "
<p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
</div>
</div>
'
doc = Nokogiri.HTML(html)
html_fragment = doc.at('div#1').to_html
will capture the contents of <div id="1"> as an HTML string:
This is text in the TD with <strong> strong <strong> tags
<p>This is a child node. with <b> bold </b> tags</p>
<div id="2">
"another line of text to a link "
<p> This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em></p>
</div>
</strong></strong>
The trailing </strong></strong> is the result of two opening <strong> tags. That might be deliberate, but with no closing tags Nokogiri will do some fixup to make the HTML correct.
Passing html_fragment to the Sanitize gem:
doc = Sanitize.clean(
html_fragment,
:elements => %w[ a b em strong ],
:attributes => {
'a' => %w[ href ],
},
)
The returned text looks like:
This is text in the TD with <strong> strong <strong> tags
This is a child node. with <b> bold </b> tags
"another line of text to a link "
This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em>
</strong></strong>
Again, because the HTML was malformed with no closing </strong> tags, the two trailing closing tags are present.

Related

How can I xpath target text() "only" directly under a html tag, instead of the text contained under "other html child tags"

How can I xpath target text() "only" directly under a html tag, instead of the text contained under "other html child tags"
Consider
<li class="one">
<label class="two">
<span class="two-one">Unwanted text</span>
Wanted text only directly under under label (not under span)
</label>
Two options :
If you have multiple lines to fetch :
//label/text()[normalize-space()]
If you have just one line to fetch, use position. For your sample data:
//label/text()[last()]
[last()] could be replaced with [1],[2],[3],... to specify the position of the text you're trying to get.

xpath:how to find a node that not contains text?

I have a html like:
...
<div class="grid">
"abc"
<span class="searchMatch">def</span>
</div>
<div class="grid">
<span class="searchMatch">def</span>
</div>
...
I want to get the div which not contains text,but xpath
//div[#class='grid' and text()='']
seems doesn't work,and if I don't know the text that other divs have,how can I find the node?
Let's suppose I have inferred the requirement correctly as:
Find all <div> elements with #class='grid' that have no directly-contained non-whitespace text content, i.e. no non-whitespace text content unless it's within a child element like a <span>.
Then the answer to this is
//div[#class='grid' and not(text()[normalize-space(.)])]
You need a not() statement + normalize-space() :
//div[#class='grid' and not(normalize-space(text()))]
or
//div[#class='grid' and normalize-space(text())='']

Extract text and ignore next node

From this:
<span class="postbody">
<span style="color: #8e2fb6">
<span style="font-weight: bold">nickname</span>
</span>
<br>
Example text
<br>
Example text
<br>
<p class="signature">THIS IS WHAT I DO NOT WANT</p>
</span>
I want to extract:
<br>
Example text
<br>
Example text
<br>
I tried: span/text()[1] but it seems not to work. I always get unwanted p class. Is it even possible to do?
First you need to load your Html string into a HtmlDocument or HtmlNode (Using .load() function).
ChildNodes collection contains every children of your current node (Basically every nodes under span.postbody).
After that what you need to do is pretty obvious, just grab #text and br nodes (keep in mind that you will receive some #text nodes that have just whitespace characters. You may want to filter it out in the result.
//load html to HtmlNode
node.ChildNodes.Where(n => n.Name.Equals("#text") || n.Name.Equals("br")) //It will return collection of HtmlNode
You can use the jQuery selector for postbody, then the .text method which should ignore the HTML. This will also ignore the .
$('.postbody').text();
An alternative would be to iterate through the children of the $('.postbody').text();
'//text()[preceding-sibling::br and normalize-space()]'

How to extract inner text of multiple Paragraph tags which are nested withing an anchor tag

Here is the code:
<a id='Letter1'>
<p>Dear Sir, </p>
<p>This is with.........</p>
<p>I would be.......</p>
<p>Hoping to hear from you soon</p>
<p>Regards.</p>
</a>
Using Xpath I want to extract the inner text of all the Paragraph tags which are contained inside the anchor tag as a single text entity.
The final result i want is
string letterBody= document.DocumentNode.SelectSingleNode("//XPATH QUERY").innerText;
where letterBody="Dear Sir, This is with...................Regards."
You need to just get the <a> element and you will get all the text nodes which are under <a> as its innertext.
So your xpath would be /a[#id='Letter1'] or just /a.

select parent node containing text inside children's node

basically i want to select a node (div) in which it's children node's(h1,b,h3) contain specified text.
<html>
<div id="contents">
<p>
<h1> Child text 1</h1>
<b> Child text 2 </b>
...
</p>
<h3> Child text 3 </h3>
</div>
i am expecting, /html/div/ not /html/div/h1
i have this below, but unfortunately returns the children, instead of the xpath to the div.
expression = "//div[contains(text(), 'Child text 1')]"
doc.xpath(expression)
i am expecting, /html/div/ not /html/div/h1
So is there a way to do this simply with xpath syntax?
The following expression gives a node (div) in which any children nodes (not just h1,b,h3) contain specified text (not the div itself):
doc.xpath('//div[.//*[contains(text(), "Child text 1")]]')
you can refine that and return the only the div with the id contents like in your example:
doc.xpath('//div[#id="contents" and .//*[contains(text(), "Child text 1")]]')
It does not match, if the text is a text node of the div (directly inside the div), which is my interpretation of the question.
You could append "/.." to anchor back to the parent. Not sure if there's a more robust method.
expression = "//div[contains(text(), 'Child text 1')]/.."

Resources