clean/sanitize HTML, but preserve loses HTML chars with Ruby/Rails + Nokogiri + Sanitize (?) - ruby

We were using a combination of the Sanitize gem and HTMLEntities to do some clean up of user input HTML. The Sanitize gem used Hpricot, but now uses Nokogiri. I need to get Hpricot out of the app.
Here are two test strings, each followed by the output I'm expecting:
Test string 1:
"SOME TEXT < '<span style='background-image: url(\"http://evil.ru/webbug.png\")'>MORE' & TEXT!!!</span>"
expected_text = "SOME TEXT < 'MORE' & TEXT!!!"
Second test string (a slightly different path):
'Support <i>odd</i> chars like " < \' ‽'
expected_text = 'Support <i>odd</i> chars like " < ' ‽'
Is this something you've solved? What tools did you use?

You may want to try the Loofah gem:
Loofah.document("SOME TEXT < '<span style='background-image: url(\"http://evil.ru/webbug.png\")'>MORE' & TEXT!!!</span>").to_html
=> "SOME TEXT MORE' & TEXT!!!"
Loofah isn't handling the unicode character in the second example for some reason, but I'd be happy to look into it if you file a Github Issue on Loofah (full disclosure: I'm the author of Loofah and co-author of Nokogiri).
Some more links:
http://rubydoc.info/github/flavorjones/loofah/master/frames
https://github.com/flavorjones/loofah

Related

Printing list with polish letters

I am writing a simple program for windows using Python 2.7. It opens an email, take some words from it and puts them in a form on web. Problem starts when the email contains polish letters like Ó, Ź, Ł etc. Whenever I try to print it I get something like: ['\xc4\x84', '\xc5\xbb', '\xc3\x93', '\xc4\x86', '\xc5\xb9'].
I already know it is because of encoding and that Python 3 has no such problem. Here is what I tried already:
mail = " Ą Ż Ó Ć Ź"
mail = mail.split()
mail = mail.decode("UTF-8")
print mail
or
mail = " Ą Ż Ó Ć Ź"
mail = mail.split()
[x.encode('UTF8') for x in mail]
print mail
Can anyone please show me how to make the list print properly ?
Python 2.x uses ASCII as a default encoding. To force it to use Unicode, add this line to the top of your program.
# -*- coding: utf-8 -*-
Also you should prefix any string literals with 'u'. e.g.
polishLetters = u'Ą Ż Ó Ć Ź'

Ruby: Remove invisible characters after converting string to UTF-8

I am working with text coming from this website with windows-1252 charset. Converting the text to UTF-8 was done using force_encoding, but the text still contains whitespace that I can't get rid of. The whitespace cannot be removed using text.gsub!(/\s/, ' ') or a similar technique.
The iconv gem doesn't do the trick either - as explained here. It is clear that the whitespace is a remnant of the original text and the windows-1252 charset as I get a invalid multibyte char (US-ASCII) warning if I don't specify the encoding as UTF-8.
I'm not an expert of text encoding so I may be overlooking something trivial.
Update: This is the script that I currently use.
#!/bin/env ruby
# encoding: utf-8
require 'rubygems'
require 'nokogiri'
require 'open-uri'
URL = 'http://www.eximsystems.com/LaVerdad/Antiguo/Gn/Genesis.htm'
html = Nokogiri.HTML(open(URL))
# Extract Paragraphs
text = ''
html.css('p').each do |p|
text += p.text
end
# Clean Up Text
text.gsub!(/\s+/, ' ')
puts text
This is a sample of the text that contains invisible characters that I try to remove. The space before the number 16 is what I am referring to.
cobraron aliento para conversar con él.   16 Al punto corrió la voz, y
se divulgó generalmente esta noticia en el palacio del rey: Han
Without seeing your code, it's hard to know exactly what's going on for you. I'll point out, however, that String#force_encoding doesn't transcode the String; it's a way of saying, "No, really, this is UTF-8", for example. To transcode from one encoding to another, use String#encode.
This seems to work for me:
require 'net/http'
s = Net::HTTP.get('www.eximsystems.com', '/LaVerdad/Antiguo/Gn/Genesis.htm')
s.force_encoding('windows-1252')
s.encode!('utf-8')
In general, /[[:space:]]/ should capture more kinds of whitespace that /\s/ (which is equivalent to /[ \t\r\n\f]/), but it doesn't appear to be necessary in this case. I can't find any abnormal whitespace in s at this point. If you're still having problems, you'll need to post your code and a more precise description of the issue.
Update: Thanks for updating your question with your code and an example of the problem. It looks like the issue is non-breaking spaces. I think it's simplest to get rid of them at the source:
require 'nokogiri'
require 'open-uri'
URL = 'http://www.eximsystems.com/LaVerdad/Antiguo/Gn/Genesis.htm'
s = open(URL).read # Separate these three lines to convert
s.gsub!(' ', ' ') # to normal ' ' in source rather than after
html = Nokogiri.HTML(s) # conversion to unicode non-breaking space
# Extract Paragraphs
text = ''
html.css('p').each do |p|
text += p.text
end
# Clean Up Text
text.gsub!(/\s+/, ' ')
puts text
There's now just a single, normal space between the period at the end of 15 and the number 16:
15) Besó también José a todos sus hermanos, orando sobre cada uno de ellos; después de cuyas demostraciones cobraron aliento para conversar con él. 16 Al punto corrió la voz, y se divulgó generalmente esta noticia en el palacio del rey: Han venido los hermanos de José; y holgóse de ello Faraón y toda su corte.
You can try to use text.strip for removing the whitespaces.

How do I parse Google image URLs using Ruby and Nokogiri?

I'm trying to make an array of all the image files on a Google images webpage.
I want a regular expression to pull everything after "imagurl=" and ending before "&amp" as seen in this HTML:
<img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>
I feel like I can do this with a regex, but I can't find a way to search my parsed document using regex, but I'm not finding any solutions.
str = '<img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>'
str.split('imgurl=')[1].split('&amp')[0]
#=> "http://www.trendytree.com/old-world- christmas/images/20031chapel20031-silent-night-chapel.jpg"
Is that what you're looking for?
The problem with using a regex is you assume too much knowledge about the order of parameters in the URL. If the order changes, or & disappears the regex won't work.
Instead, parse the URL, then split the values out:
# encoding: UTF-8
require 'nokogiri'
require 'cgi'
require 'uri'
doc = Nokogiri::HTML.parse('<img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>')
img_url = doc.search('a').each do |a|
query_params = CGI::parse(URI(a['href']).query)
puts query_params['imgurl']
end
Which outputs:
http://www.trendytree.com/old-world-christmas/images/20031chapel20031-silent-night-chapel.jpg
Both URI and CGI are used because URI's decode_www_form raises an exception when trying to decode the query.
I've also been known to decode the query string into a hash using something like:
Hash[URI(a['href']).query.split('&').map{ |p| p.split('=') }]
That will return:
{"imgurl"=>
"http://www.trendytree.com/old-world-christmas/images/20031chapel20031-silent-night-chapel.jpg",
"imgrefurl"=>
"http://www.trendytree.com/old-world-christmas/silent-night-chapel-20031-christmas-ornament-old-world-christmas.html",
"usg"=>"__YJdf3xc4ydSfLQa9tYnAzavKHYQ",
"h"=>"400",
"w"=>"400",
"sz"=>"58",
"hl"=>"en",
"start"=>"19",
"zoom"=>"1",
"tbnid"=>"ajDcsGGs0tgE9M:",
"tbnh"=>"124",
"tbnw"=>"124",
"ei"=>"qagfUbXmHKfv0QHI3oG4CQ",
"itbs"=>"1",
"sa"=>"X",
"ved"=>"0CE4QrQMwEg"}
To get all the img urls you want do
# get all links
url = 'some-google-images-url'
links = Nokogiri::HTML( open(url) ).css('a')
# get regex match or nil on desired img
img_urls = links.map {|a| a['href'][/imgurl=(.*?)&/, 1] }
# get rid of nils
img_urls.compact
The regex you want is /imgurl=(.*?)&/ because you want a non-greedy match between imgurl= and &, otherwise the greedy .* would take everything to the last & in the string.

Programmatically get a list of characters a certain .ttf font file supports

Is there a way to programmatically get a list of characters a .ttf file supports using Ruby and/or Bash. I am trying to pipe the supported character codes into a text file for later processing.
(I would prefer not to use Font Forge.)
Found a Ruby gem called ttfunk which can be found here.
After a gem install ttfunk, you can get all unicode characters by running the following script:
require 'ttfunk'
file = TTFunk::File.open("path/to/font.ttf")
cmap = file.cmap
chars = {}
unicode_chars = []
cmap.tables.each do |subtable|
next if !subtable.unicode?
chars = chars.merge( subtable.code_map )
end
unicode_chars = chars.keys.map{ |dec| dec.to_s(16) }
puts "\n -- Found #{unicode_chars.length} characters in this font \n\n"
p unicode_chars
Which will output something like:
- Found 2815 characters in this font
["20", "21", "22", "23", ... , "fef8", "fef9", "fefa", "fefb", "fefc", "fffc", "ffff"]

trying to get content inside cdata tags in xml file using nokogiri

I have seen several things on this, but nothing has seemed to work so far. I am parsing an xml via a url using nokogiri on rails 3 ruby 1.9.2.
A snippet of the xml looks like this:
<NewsLineText>
<![CDATA[
Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly creme brulee.
]]>
</NewsLineText>
I am trying to parse this out to get the text associated with the NewsLineText
r = node.at_xpath('.//newslinetext') if node.at_xpath('.//newslinetext')
s = node.at_xpath('.//newslinetext').text if node.at_xpath('.//newslinetext')
t = node.at_xpath('.//newslinetext').content if node.at_xpath('.//newslinetext')
puts r
puts s ? if s.blank? 'NOTHING' : s
puts t ? if t.blank? 'NOTHING' : t
What I get in return is
<newslinetext></newslinetext>
NOTHING
NOTHING
So I know my tags are named/spelled correctly to get at the newslinetext data, but the cdata text never shows up.
What do I need to do with nokogiri to get this text?
You're trying to parse XML using Nokogiri's HMTL parser. If node as from the XML parser then r would be nil since XML is case sensitive; your r is not nil so you're using the HTML parser which is case insensitive.
Use Nokogiri's XML parser and you will get things like this:
>> r = doc.at_xpath('.//NewsLineText')
=> #<Nokogiri::XML::Element:0x8066ad34 name="NewsLineText" children=[#<Nokogiri::XML::Text:0x8066aac8 "\n ">, #<Nokogiri::XML::CDATA:0x8066a9c4 "\n Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly creme brulee.\n ">, #<Nokogiri::XML::Text:0x8066a8d4 "\n">]>
>> r.text
=> "\n \n Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly creme brulee.\n \n"
and you'll be able to get at the CDATA through r.text or r.children.
Ah I see. What #mu said is correct. But to get at the cdata directly, maybe:
xml =<<EOF
<NewsLineText>
<![CDATA[
Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly creme brulee.
]]>
</NewsLineText>
EOF
node = Nokogiri::XML xml
cdata = node.search('NewsLineText').children.find{|e| e.cdata?}

Resources