Rails HTML Sanitizing - ruby

I am trying to sanitize an HTML file and it isn't working correctly. I want to all be entirely plain text except for paragraph and line break tags. Here is my sanitization code (the dots signify other code in my class that isn't relevant to the problem):
.
.
.
include ActionView::Helpers::SanitizeHelper
.
.
.
def remove_html(html_content)
sanitized_content_1 = sanitize(html_content, :tags => %w(p br))
sanitized_content_2 = Nokogiri::HTML(sanitized_content_1)
sanitized_content_2.css("style","script").remove
return sanitized_content_2
end
It isn't working correctly. Here is the original HTML file from which the function is reading its input, and here is the "sanitized" code it is returning. It is leaving in the body of CSS tags, JavaScript, and HTML Comment Tags. It might be leaving in other stuff as well that I have not noticed. Please advise on how to thoroughly remove all CSS, HTML, and JavaScript other than paragraph and line break tags?

I don't think you want to sanitize it. Sanitizing strips HTML, leaving the text behind, except for the HTML elements you deem OK. It is intended for allowing a user-input field to contain some markup.
Instead, you probably want to parse it. For example, the following will print the text content of the <p> tags in a given html string.
doc = Nokogiri::HTML.parse(html)
doc.search('p').each do |el|
puts el.text
end

You can sanitize with using CGI namespace too.
require 'CGI'
str = "<html><head><title>Hello</title></head><body></body></html>"
p str
p CGI::escapeHTML str
Run this script, we get following result.
$ ruby sanitize.rb
"<html><head><title>Hello</title></head><body></body></html>"
"<html><head><title>Hello</title></head><body></body></html>"

Related

Add unescaped entities to document with Nokogiri [duplicate]

I would like to add things like bullet points "•" to HTML using the XML Builder in Nokogiri, but everything is being escaped. How do I prevent it from being escaped?
I would like the result to be:
<span>•</span>
rather than:
<span>&#8226;</span>
I'm just doing this:
xml.span {
xml.text "•\ "
}
What am I missing?
If you define
class Nokogiri::XML::Builder
def entity(code)
doc = Nokogiri::XML("<?xml version='1.0'?><root>&##{code};</root>")
insert(doc.root.children.first)
end
end
then this
builder = Nokogiri::XML::Builder.new do |xml|
xml.span {
xml.text "I can has "
xml.entity 8665
xml.text " entity?"
}
end
puts builder.to_xml
yields
<?xml version="1.0"?>
<span>I can has • entity?</span>
PS this a workaround only, for a clean solution please refer to the libxml2 documentation (Nokogiri is built on libxml2) for more help. However, even these folks admit that handling entities can be quite ..err, cumbersome sometimes.
When you're setting the text of an element, you really are setting text, not HTML source. < and & don't have any special meaning in plain text.
So just type a bullet: '•'. Of course your source code and your XML file will have to be using the same encoding for that to come out right. If your XML file is UTF-8 but your source code isn't, you'd probably have to say '\xe2\x80\xa2' which is the UTF-8 byte sequence for the bullet character as a string literal.
(In general non-ASCII characters in Ruby 1.8 are tricky. The byte-based interfaces don't mesh too well with XML's world of all-text-is-Unicode.)

Why is my Ruby lookahead regex not working [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
RegEx match open tags except XHTML self-contained tags
I tested my regex in rubular.com and it works, but when I run the code it behaves differently.
I want to parse whole paragraphs out of some HTML code
Here is my regex
description = ad_page.body.scan(/(?<=<span id="preview-local-desc">).+(?=<\/span>)/m)
Here is some of the HTML source
<span id="preview-local-desc"> I want to pick up everything typed here.
Paragraphs, everything.
</span>
The match begins where I need it to but then it keeps matching all the way to the end of the document.
Aside from the fact that you shouldn't parse HTML with regex, you want non-greedy matching:
/(?<=<span id="preview-local-desc">).+?(?=<\/span>)/m
Parsing XML or HTML with a regex is marginally OK for trivial tasks, if you own or control the file's format. If you don't, then a simple change to the file could break your regex.
Using a parser will avoid that problem; I've parsed some horrible XML with Nokogiri and it didn't even notice. After writing a RSS aggregator that was handling 1000+ feeds I was hooked on using a parser.
require 'nokogiri'
html = '<span id="preview-local-desc"> I want to pick up everything typed here.
Paragraphs, everything.
</span>'
doc = Nokogiri.HTML(html)
doc.at('span').text
# => " I want to pick up everything typed here.\n Paragraphs, everything.\n "
If there are multiple <span> tags you want:
doc.search('span').map(&:text)
# => [" I want to pick up everything typed here.\n Paragraphs, everything.\n "]
If there are multiple <span> tags and you only want this one:
doc.at('span#preview-local-desc').text
# => " I want to pick up everything typed here.\n Paragraphs, everything.\n "

Find URLs in text and wrap in anchor tag

I'm basically writing my own Markdown parser. I want to detect a URL in a string and wrap it with an anchor tag if it's a valid URL. For example:
string = 'here is a link: http://google.com'
# if string matches regex (which it does)
# should return:
'here is a link: http://google.com'
# but this would remain unchanged:
string 'here is a link: google.com'
How can I achieve this?
Bonus points if you can point me to the code in an existing Ruby markdown parser that I can use as an example.
In general: use a regular expression to find URLs and wrap them in your HTML:
urls = %r{(?:https?|ftp|mailto)://\S+}i
html = str.gsub urls, '\0'
Note that this particular solution will turn this text:
See more at http://www.google.com.
…into…
See more at http://www.google.com.
So you may want to play with the regex a bit to figure out where the URL should really end.
You can use this jquery plugin
http://www.jquery.gr/linker/

How do I cut phrases off a string in ruby?

I wasn't sure about my questions name. I have an HTML page I got using nokogiri. Now I want to cut some tags off that page. I tried using ruby's delete method after converting the html to a string - Though it deletes all the letters I entered. The best result i got was using .gsub('<stuff>', '') though it still leaves some space. Is it possible to actually cut stuff of a string? specific pharses?
Another question - Can I remove spaces?
what I done so far :
doc = Nokogiri::HTML(open("http://www.example.com/"))
tester = doc.css(".example").to_s.gsub('<div class="example">', '')
I'd suggest trying to do it at the xml tree level rather than string editing.
I think the nokogiri api gives you some tools for doing this.
Another approach might be selecting the data you want, with css or xpath, rather than deleting the parts you don't want?
There's also an xpath function for normalising space in strings, there's an example in this question
Some nokogiri help:
Intro article on Engineyard
Railscast/Asciicasts
Official tutorials
Check out Nokogiri's Tutorials. In particular, you want to read "Modifying an HTML / XML Document", Changing text contents.
Nokogiri's XML accessors are very friendly, because you don't need to use XPath. You can use CSS accessors also, and for people who aren't in XML all day long they can help a lot.
In that particular example, they're using the at_css method, which searches for the first occurrence of the target. You have many alternate methods, which are synonyms: at, %, at_css and at_xpath handle "find the first one" cases. search, css, xpath, / similarly handle "find all occurrences".
For instance:
require 'nokogiri'
html = '<h1>Snap, Crackle and Pop</h1>'
doc = Nokogiri::HTML(html)
h1 = doc.at('h1')
h1.content = h1.content[0, h1.content.length - 3] + '...'
puts doc.to_html
>> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
>> <html><body><h1>Snap, Crackle and ...</h1></body></html>
That creates a new HTML document in Nokogiri, searches for the first H1, and trims the trailing three characters in its contents, replacing them with an ellipsis.

How do I extract links from HTML using regex?

I want to extract links from google.com; My HTML code looks like this:
<a href="http://www.test.com/" class="l"
I took me around five minutes to find a regex that works using www.rubular.com.
It is:
"(.*?)" class="l"
The code is:
require "open-uri"
url = "http://www.google.com/search?q=ruby"
source = open(url).read()
links = source.scan(/"(.*?)" class="l"/)
links.each { |link| puts #{link}
}
The problem is, is it not outputting the websites links.
Those links actually have class=l not class="l". By the way, to figure this put I added some logging to the method so that you can see the output at various stages and debug it. I searched for the string you were expecting to find and didn't find it, which is why your regex failed. So I looked for the right string you actually wanted and changed the regex accordingly. Debugging skills are handy.
require "open-uri"
url = "http://www.google.com/search?q=ruby"
source = open(url).read
puts "--- PAGE SOURCE ---"
puts source
links = source.scan(/<a.+?href="(.+?)".+?class=l/)
puts "--- FOUND THIS MANY LINKS ---"
puts links.size
puts "--- PRINTING LINKS ---"
links.each do |link|
puts "- #{link}"
end
I also improved your regex. You are looking for some text that starts with the opening of an a tag (<a), then some characters of some sort that you dont care about (.+?), an href attribute (href="), the contents of the href attribute that you want to capture ((.+?)), some spaces or other attributes (.+?), and lastly the class attrubute (class=l).
I have .+? in three places there. the . means any character, the + means there must be one or more of the things right before it, and the ? means that the .+ should try to match as short a string as possible.
To put it bluntly, the problem is that you're using regexes. The problem is that HTML is what is known as a context-free language, while regular expressions can only the class of languages that are known as regular languages.
What you should do is send the page data to a parser that can handle HTML code, such as Hpricot, and then walk the parse tree you get from the parser.
What im going wrong?
You're trying to parse HTML with regex. Don't do that. Regular expressions cannot cover the range of syntax allowed even by valid XHTML, let alone real-world tag soup. Use an HTML parser library such as Hpricot.
FWIW, when I fetch ‘http://www.google.com/search?q=ruby’ I do not receive ‘class="l"’ anywhere in the returned markup. Perhaps it depends on which local Google you are using and/or whether you are logged in or otherwise have a Google cookie. (Your script, like me, would not.)

Resources