How to sanitalize string with nested html tags but keep <em> tag? - ruby

I am trying to sanitalize Solr search results, cause it has html tags inside:
ActionController::Base.helpers.sanitize( result_string )
It is easy to sanitalize not highlighted string like: I know <ul><li>ruby</li> <li>rails</li></ul>.
But when results is highlighted I have additional important tags inside - <em> and </em>:
I <em>know</em> <<em>ul</em>><<em>li</em>><em>ruby</em></<em>li</em>> <<em>li</em>><em>rails</em></<em>li</em>></<em>ul</em>>.
So, when I sanitalize string with nested html and highlighting tags, I get string with peaces of htmls tags. And it is bad :)
How can I sanitalize highlighted string with <em> tags inside to get correct result (string with <em> tags only)?
I found the way, but it's slow and not pretty:
string = 'I <em>know</em> <<em>ul</em>><<em>li</em>><em>ruby</em></<em>li</em>> <<em>li</em>><em>rails</em></<em>li</em>></<em>ul</em>>'
['p', 'ul', 'li', 'ol', 'span', 'b', 'br'].each do |tag|
string.gsub!( "<<em>#{tag}</em>>", '' )
string.gsub!( "</<em>#{tag}</em>>", '' )
end
string = ActionController::Base.helpers.sanitize string, tags: %w(em)
How can I optimize it or do it using some better solution?
to write some regex and remove html_tags, but keep <em> and </em> e.g.
Please help, thanks.

You could call gsub! to discard all tags but keep only tags that are independent, or that are not included in html tag.
result_string.gsub!(/(<\/?[^e][^m]>)|(<<em>\w*<\/em>>)|(<\/<em>\w*<\/em>>)/, '')
would do the trick
To explain:
# first group (<\/?[^e][^m]>)
# find all html tags that are not <em> or </em>
# second group (<<em>\w*<\/em>>)
# find all opening tags that have <em> </em> inside of them like:
# <<em>li</em>> or <<em>ul</em>>
# third group (<\/<em>\w*<\/em>>)
# find all closing tags that have <em> </em> inside of them:
# </<em>li</em>> or </<em>ul</em>>
# and gsub replaces all of this with empty string

I think you can use the sinitize:
Custom Use (only the mentioned tags and attributes are allowed, nothing else)
<%= sanitize #article.body, tags: %w(table tr td), attributes: %w(id class style) %>
So, something like that should work:
sanitize result_string, tags: %w(em)

With an additional parameter to sanitize, you can specify which tags are allowed.
In your example, try:
ActionController::Base.helpers.sanitize( result_string, tags: %w(em) )
It should do the trick

Related

Xpath getting text with mixed elements in same div

Here is some sample HTML
<div class="something">
<p> This is a <b> Paragraph </b> with mixed elements
<p> Next paragraph....
</div>
what I tried was
//div[contains('#class','something')/text()
and
//div[contains('#class','something')/*/text()
and
//div[contains('#class','something')/p/text()
all of these seem to skip the 'b' tags and the 'a' tags.
Try " ".join(sel.xpath("//div[contains(#class,'something')]//text()").extract()) where sel is selector in your case may be response.
Use the XPath expression
//div[contains(#class,'something')]//text()
to get a concatenation of the text of all the text() nodes in the chosen div element.
Output:
This is a Paragraph with mixed elements
Next paragraph....
It depends on what and how you want to obtain. Anyway, there are couple of problems with what you tried:
You are missing closing bracket (]) after contains in the XPath expression.
#class should not be enclosed in (single) quotes when used inside contains.
If you want to get all the text of div element as one string, you might use
normalize-space(//div[contains(#class,'something')])

How can I extract the node names for fragmented XML document using Ruby?

I an XML-like document which is pre-processed by a system out of my control. The format of the document is like this:
<template>
Hello, there <RECALL>first_name</RECALL>. Thanks for giving me your email.
<SETPROFILE><NAME>email</NAME><VALUE><star/></VALUE></SETPROFILE>. I have just sent you something.
</template>
However, I only get as a text string what is between the <template> tags.
I would like to be able to extract without specifying the tags ahead of time when parsing. I can do this with the Crack gem but only if the tags are at the end of the string and there is only one.
With Crack, I can put a string like
string = "<SETPROFILE><NAME>email</NAME><VALUE>go#go.com</VALUE></SETPROFILE>"
and my output from Crack is:
{"SETPROFILE"=>{"NAME"=>"email", "VALUE"=>"go#go.com"}}
Then I can use a case statement for the possible values I care about.
Given that I need to have multiple <tags> in the string and they cannot be at the end of the string, how can I parse out the node names and the values easily, similar to what I do with crack?
These tags also need to be removed. I would like to continue to use the excellent suggestion from #TinMan.
It works perfectly once I know the name of the tag. The number of tags will be finite. I send the tag to the appropriate method once I know it, but it needs to get parsed out easily first.
Using Nokogiri, you can treat the string as a DocumentFragment, then find the embedded nodes:
require 'nokogiri'
doc = Nokogiri::XML::DocumentFragment.parse(<<EOT)
Hello, there <RECALL>first_name</RECALL>. Thanks for giving me your email.
<SETPROFILE><NAME>email</NAME><VALUE><star/></VALUE></SETPROFILE>. I have just sent you something.
EOT
nodes = doc.search('*').each_with_object({}){ |n, h|
h[n] = n.text
}
nodes # => {#<Nokogiri::XML::Element:0x3ff96083b744 name="RECALL" children=[#<Nokogiri::XML::Text:0x3ff96083a09c "first_name">]>=>"first_name", #<Nokogiri::XML::Element:0x3ff96083b5c8 name="SETPROFILE" children=[#<Nokogiri::XML::Element:0x3ff96083a678 name="NAME" children=[#<Nokogiri::XML::Text:0x3ff960836884 "email">]>, #<Nokogiri::XML::Element:0x3ff96083a650 name="VALUE" children=[#<Nokogiri::XML::Element:0x3ff96083a5c4 name="star">]>]>=>"email", #<Nokogiri::XML::Element:0x3ff96083a678 name="NAME" children=[#<Nokogiri::XML::Text:0x3ff960836884 "email">]>=>"email", #<Nokogiri::XML::Element:0x3ff96083a650 name="VALUE" children=[#<Nokogiri::XML::Element:0x3ff96083a5c4 name="star">]>=>"", #<Nokogiri::XML::Element:0x3ff96083a5c4 name="star">=>""}
Or, more legibly:
nodes = doc.search('*').each_with_object({}){ |n, h|
h[n.name] = n.text
}
nodes # => {"RECALL"=>"first_name", "SETPROFILE"=>"email", "NAME"=>"email", "VALUE"=>"", "star"=>""}
Getting the content of a particular tag is easy then:
nodes['RECALL'] # => "first_name"
Iterating over all the tags is also easy:
nodes.keys.each do |k|
...
end
You can even replace a tag and its content with text:
doc.at('RECALL').replace('Fred')
doc.to_xml # => "Hello, there Fred. Thanks for giving me your email. \n<SETPROFILE>\n <NAME>email</NAME>\n <VALUE>\n <star/>\n </VALUE>\n</SETPROFILE>. I have just sent you something.\n"
How to replace the nested tags is left to you as an exercise.

get node text() with or without anchor tag

I can't figure out how to get a table cell's text() whether or not an anchor tag is parent to the text.
WITH:
<td class="c divComms" title="Komentarz|">
<a id="List1_Dividends_ctl01_HyperLink1" target="_blank" href="http://www.attrader.pl/pl/akcje/DRUKPAK/komunikat/EBI/none,20130104_090845_0000041461">uchwalona</a>
<div class="stcm">2013-01-29</div></td>
WITHOUT:
<td class="c divComms" title="Komentarz|Celem...">
proponowana
<div class="stcm">2012-10-05</div>
</td>
Composing elements of a hash, I would expect
details = rows.collect do |row|
detail = {}
[
[:paystatus, 'td[7]//text()[not(ancestor::div)]'],
[:paydate, 'td[7]/div/text()'], # the 2013-01-29 or 2012-10-05 above
].each do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
to catch either uchwalona or proponowana (notice without the date in the trailing div), but as it stands, it ignores the a tag text, unless I do td[7]/a/text(), in which case only the anchor's text "uchwalona" is read.
Using the union operator | should work:
[:paystatus, '(td[7]|td[7]/a)/text()']
(I think you won't need the [not(ancestor::div)] part if you don't use a double-slash)
The problem appeared to be resolved when I used the row.xpath method instead of .at_xpath, which somehow made the union operator | ineffective.
So changed
detail[name] = row.at_xpath(xpath).to_s.strip
to:
detail[name] = row.xpath(xpath).to_s.strip
This meant I also had to tighten a few xpath expressions in my other field |name, xpath| pairs, to not over-include as unnoticed before.

Ruby - remove part of the text

I have a text similar to this:
<p>some text ...</p><p>The post text... appeared first on some another text.</p>
I need to remove everything from <p>The post, so the results would be:
<p>some text ...</p>
I am trying ot do that this way:
text.sub!(/^<p>The post/, '')
But it returns just an empty string... how to fix that?
Your regex is incorrect. It matches every <p>The post that is in the beginning of the string. You want the opposite: match from its position to the end of the string. Check this out.
s = '<p>some text ...</p><p>The post text... appeared first on some another text.</p>'
s.sub(/<p>The\spost.*$/, '') # => "<p>some text ...</p>"
You have specified ^, which matches the beginning of a string. You should do
text.sub!(/<p>The post.*$/, '')
Play with this in http://rubular.com/r/c91EbHN0Af
'^' is matching the beginning of the whole string. try doing
text.sub!(/<p>The post/, '')
EDIT just read it more carefully...
text.sub!(/<p>The post.*$/, '')

Hpricot: How to extract inner text without other html subelements

I'm working on a vim rspec plugin (https://github.com/skwp/vim-rspec) - and I am parsing some html from rspec. It looks like this:
doc = %{
<dl>
<dt id="example_group_1">This is the heading text</dt>
Some puts output here
</dl>
}
I can get the entire inner of the using:
(Hpricot.parse(doc)/:dl).first.inner_html
I can get just the dt by using
(Hpricot.parse(doc)/:dl).first/:dt
But how can I access the "Some puts output here" area? If I use inner_html, there is way too much other junk to parse through. I've looked through hpricot docs but don't see an easy way to get essentially the inner text of an html element, disregarding its html children.
I ended up figuring out a route by myself, by manually parsing the children:
(#context/"dl").each do |dl|
dl.children.each do |child|
if child.is_a?(Hpricot::Elem) && child.name == 'dd'
# do stuff with the element
elsif child.is_a?(Hpricot::Text)
text=child.to_s.strip
puts text unless text.empty?
end
end
Note that this is bad HTML you have there. If you have control over it, you should wrap the content you want in a <dd>.
In XML terms what you are looking for is the TextNode following the <dt> element. In my comment I showed how you can select this node using XPath in Nokogiri.
However, if you must use Hpricot, and cannot select text nodes using it, then you could hack this by getting the inner_html and then stripping out the unwanted:
(Hpricot.parse(doc)/:dl).first.inner_html.sub %r{<dt>.+?</dt>}, ''

Resources