How to remove non-UTF8 characters in RSS with Ruby - ruby

I'm using Ruby's RSS Library to parse an RSS feed, but I am encountering errors occasionally when a bullet point character appears in the RSS feed as a �.
require 'rss'
rss = RSS::Parser.parse('rss_url_here', false)
which results in
#<ArgumentError: invalid byte sequence in UTF-8>
due to the � character. How can I remove � characters?
Update:
I have tried using
require 'net/http'
require 'rss'
uri = URI('https://newyork.craigslist.org/search/jjj?query=graphic%20design&s=100&sort=date&format=rss')
json = Net::HTTP.get(uri)
json.force_encoding('CP1252')
json.force_encoding('utf-8')
rss = RSS::Parser.parse(json, false)
Still getting
ArgumentError: invalid byte sequence in UTF-8

You can use HTMLEntities
HTMLEntities.new.decode(rss_feed_content)

I wonder, whether it is so hard to read a documentation on two functions I mentioned in the comment and distinguish force_encoding and encode.
require 'net/http'
require 'rss'
uri = URI('https://newyork.craigslist.org/search/jjj?query=graphic%20design&s=100&sort=date&format=rss')
text = Net::HTTP.get(uri)
rss = RSS::Parser.parse(text.force_encoding('CP1252').encode('utf-8'), false)
#⇒ #<RSS::RDF:0x000000053791a0 .....

I like to remove junk char codes like this:
json = json.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

Related

I am getting (eval):1: invalid Unicode codepoint error while trying to scrape instagram

I am trying to scrape data from instagram. Here is my code
require 'open-uri'
require 'nokogiri'
require 'json'
require "unicode/emoji"
def get_html
url = 'https://www.instagram.com/muriithi_kabogo/'
html = open(url)
end
def pass_data
html = get_html
doc = Nokogiri::HTML(html)
end
def get_data
profiles = []
body = pass_data.at('body')
script = body.at('script').text
myText = script
json_object_data = eval(myText)
end
get_data()
When I try to change the text into json format, I get an error:
(eval):1: invalid Unicode codepoint (SyntaxError)
usinessmen #beautiful #smile\ud83d\ude0a #teambringit #shebr
How do I move past this error?
JSON, like JavaScript, uses UCS2 encoding, which Ruby chokes on.
Do not use evil. For one thing, Ruby will detect \ud83d\ude0a as invalid codepoints, as it should; for another, it is a security hole; and lastly, it slows down your code.
Use JSON.parse, which is safer, faster, and knows how to deal with UCS2:
require 'json'
json_str = '"usinessmen #beautiful #smile\ud83d\ude0a #teambringit #shebr"'
JSON.parse(json_str)
# => "usinessmen #beautiful #smile😊 #teambringit #shebr"

ruby 1.9 character conversion errors while testing regex

I know there are a tons of docs and debates out there, but still:
This is my best shot on my Rails attempt to test scraped data from various websites. Strange fact is that if I manually copy-paste the source of an URL everything goes right.
What can I do?
# encoding: utf-8
require 'rubygems'
require 'iconv'
require 'nokogiri'
require 'open-uri'
require 'uri'
url = 'http://www.website.com/url/test'
sio = open(url)
#cur_encoding = sio.charset
doc = Nokogiri::HTML(sio, nil, #cur_encoding)
txtdoc = doc.to_s
# 1) String manipulation test
p doc.search('h1')[0].text # "Nove36  "
p doc.search('h1')[0].text.strip! # nil <- ERROR
# 2) Regex test
# txtdoc = "test test 44.00 € test test" # <- THIS WORKS
regex = "[0-9.]+ €"
p /#{regex}/i =~ txtdoc # integer expected
I realize that probably my OS Ubuntu plus my text editor is doing some good encoding conversion over probably some broken encoding: that's fine, BUT how can I fix this problem on my app while running live?
#cur_encoding = doc.encoding # ISO-8859-15
ISO-8859-15 is not the correct encoding for the quoted page; it should have been UTF-8. iconving it to UTF-8 as if it were 8859-15 only compounds the problem.
This encoding is coming from a faulty <meta> tag in the document. A browser will ignore that tag and use the overriding encoding from the Content-Type: text/html;charset=utf-8 HTTP response header.
However Nokogiri appears not to be able to read this header from the open()ed stream. With the caveat that I know nothing about Ruby, looking at the source the problem would seem to be that it uses the property encoding from the string-or-IO instead of charset which seems to be what open-uri writes.
You can pass in an override encoding of your own, so I guess try:
sio= open(url)
doc= Nokogiri::HTML.parse(doc, nil, sio.charset) # should be UTF-8?
The problems you're having are caused by non breaking space characters (Unicode U+00A0) in the page.
In your first problem, the string:
"Nove36 "
actually ends with U+00A0, and String#strip! doesn't consider this character to be whitespace to be removed:
1.9.3-p125 :001 > s = "Foo \u00a0"
=> "Foo  "
1.9.3-p125 :002 > s.strip
=> "Foo  " #unchanged
In your second problem, the space between the price and the euro sign is again a non breaking space, so the regex simply doesn't match as it is looking for a normal space:
# s as before
1.9.3-p125 :003 > s =~ /Foo / #2 spaces, no match
=> nil
1.9.3-p125 :004 > s =~ /Foo / #1 space, match
=> 0
1.9.3-p125 :005 > s =~ /Foo \u00a0/ #space and non breaking space, match
=> 0
When you copy and paste the source, the browser probably normalises the non breaking spaces, so you only copy normal space character, which is why it works that way.
The simplest fix would be to do a global substitution of \u00a0 for space before you start processing:
sio = open(url)
#cur_encoding = sio.charset
txt = sio.read #read the whole file
txt.gsub! "\u00a0", " " #global replace
doc = Nokogiri::HTML(txt, nil, #cur_encoding) #use this new string instead...

Html wrongly encoded fetched by Nokogiri

I use Nokogiri to parse an html. I need both the content and image tags in the page, so I use inner_html instead of content method. But the value returned by content is encoded correct, while wrongly encoded by inner_html. One note, the page is in Chinese and not use UTF-8 encoding.
Here is my code:
# encoding: utf-8
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'iconv'
doc = Nokogiri::HTML.parse(open("http://www.sfzt.org/advise/view.asp?id=536"), nil, 'gb18030')
doc.css('td.font_info').each do |link|
# output, correct but not i expect: 目前市面上影响比
puts link.content
# output, wrong and not i expect: <img ....></img>Ŀǰ??????Ӱ??Ƚϴ?Ľ????
# I expect: <img ....></img>目前市面上影响比
puts link.inner_html
end
That is written on the 'Encoding' section on README: http://nokogiri.org/
Strings are always stored as UTF-8 internally. Methods that return
text values will always return UTF-8 encoded strings. Methods that
return XML (like to_xml, to_html and inner_html) will return a string
encoded like the source document.
So, you should convert inner_html string manually if you want to get it as UTF-8 string:
puts link.inner_html.encode('utf-8') # for 1.9.x
I think content strips out tags well, however the inner_html method nodes does not do this very well or at all.
"I think you can end up with some pretty weird states if you change the inner_html (which contain tags) while you are traversing. In other words, if you are traversing a node tree, you shouldn’t do anything that could add or remove nodes."
Try this:
doc.css('td.font_info').each do |link|
puts link.content
some_stuff = link.inner_html
link.children = Nokogiri::HTML.fragment(some_stuff, 'utf-8')
end

open-uri returning ASCII-8BIT from webpage encoded in iso-8859

I am using open-uri to read a webpage which claims to be encoded in iso-8859-1. When I read the contents of the page, open-uri returns a string encoded in ASCII-8BIT.
open("http://www.nigella.com/recipes/view/DEVILS-FOOD-CAKE-5310") {|f| p f.content_type, f.charset, f.read.encoding }
=> ["text/html", "iso-8859-1", #<Encoding:ASCII-8BIT>]
I am guessing this is because the webpage has the byte (or character) \x92 which is not a valid iso-8859 character. http://en.wikipedia.org/wiki/ISO/IEC_8859-1.
I need to store webpages as utf-8 encoded files. Any ideas on how to deal with webpage where the encoding is incorrect. I could catch the exception and try to guess the correct encoding but that seems cumbersome and error-prone.
ASCII-8BIT is an alias for BINARY
open-uri does a funny thing: if the file is less than 10kb (or something like that), it returns a String and if it's bigger then it returns a StringIO. That can be confusing if you're trying to deal with encoding issues.
If the files aren't huge, I would recommend manually loading them into strings:
require 'uri'
require 'net/http'
require 'net/https'
uri = URI.parse url_to_file
http = Net::HTTP.new(uri.host, uri.port)
if uri.scheme == 'https'
http.use_ssl = true
# possibly useful if you see ssl errors
# http.verify_mode = ::OpenSSL::SSL::VERIFY_NONE
end
body = http.start { |session| session.get uri.request_uri }.body
Then you can use the https://rubygems.org/gems/ensure-encoding gem
require 'ensure/encoding'
utf8_body = body.ensure_encoding('UTF-8', :external_encoding => :sniff, :invalid_characters => :transcode)
I have been pretty happy with ensure-encoding... we use it in production at http://data.brighterplanet.com
Note that you can also say :invalid_characters => :ignore instead of :transcode.
Also, if you know the encoding somehow, you can pass :external_encoding => 'ISO-8859-1' instead of :sniff

ruby 1.9: invalid byte sequence in UTF-8

I'm writing a crawler in Ruby (1.9) that consumes lots of HTML from a lot of random sites.
When trying to extract links, I decided to just use .scan(/href="(.*?)"/i) instead of nokogiri/hpricot (major speedup). The problem is that I now receive a lot of "invalid byte sequence in UTF-8" errors.
From what I understood, the net/http library doesn't have any encoding specific options and the stuff that comes in is basically not properly tagged.
What would be the best way to actually work with that incoming data? I tried .encode with the replace and invalid options set, but no success so far...
In Ruby 1.9.3 it is possible to use String.encode to "ignore" the invalid UTF-8 sequences. Here is a snippet that will work both in 1.8 (iconv) and 1.9 (String#encode) :
require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
file_contents.encode!('UTF-8', 'UTF-8', :invalid => :replace)
else
ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
file_contents = ic.iconv(file_contents)
end
or if you have really troublesome input you can do a double conversion from UTF-8 to UTF-16 and back to UTF-8:
require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
file_contents.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
file_contents.encode!('UTF-8', 'UTF-16')
else
ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
file_contents = ic.iconv(file_contents)
end
The accepted answer nor the other answer work for me. I found this post which suggested
string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
This fixed the problem for me.
My current solution is to run:
my_string.unpack("C*").pack("U*")
This will at least get rid of the exceptions which was my main problem
Try this:
def to_utf8(str)
str = str.force_encoding('UTF-8')
return str if str.valid_encoding?
str.encode("UTF-8", 'binary', invalid: :replace, undef: :replace, replace: '')
end
I recommend you to use a HTML parser. Just find the fastest one.
Parsing HTML is not as easy as it may seem.
Browsers parse invalid UTF-8 sequences, in UTF-8 HTML documents, just putting the "�" symbol. So once the invalid UTF-8 sequence in the HTML gets parsed the resulting text is a valid string.
Even inside attribute values you have to decode HTML entities like amp
Here is a great question that sums up why you can not reliably parse HTML with a regular expression:
RegEx match open tags except XHTML self-contained tags
attachment = file.read
begin
# Try it as UTF-8 directly
cleaned = attachment.dup.force_encoding('UTF-8')
unless cleaned.valid_encoding?
# Some of it might be old Windows code page
cleaned = attachment.encode( 'UTF-8', 'Windows-1252' )
end
attachment = cleaned
rescue EncodingError
# Force it to UTF-8, throwing out invalid bits
attachment = attachment.force_encoding("ISO-8859-1").encode("utf-8", replace: nil)
end
This seems to work:
def sanitize_utf8(string)
return nil if string.nil?
return string if string.valid_encoding?
string.chars.select { |c| c.valid_encoding? }.join
end
I've encountered string, which had mixings of English, Russian and some other alphabets, which caused exception. I need only Russian and English, and this currently works for me:
ec1 = Encoding::Converter.new "UTF-8","Windows-1251",:invalid=>:replace,:undef=>:replace,:replace=>""
ec2 = Encoding::Converter.new "Windows-1251","UTF-8",:invalid=>:replace,:undef=>:replace,:replace=>""
t = ec2.convert ec1.convert t
While Nakilon's solution works, at least as far as getting past the error, in my case, I had this weird f-ed up character originating from Microsoft Excel converted to CSV that was registering in ruby as a (get this) cyrillic K which in ruby was a bolded K. To fix this I used 'iso-8859-1' viz. CSV.parse(f, :encoding => "iso-8859-1"), which turned my freaky deaky cyrillic K's into a much more manageable /\xCA/, which I could then remove with string.gsub!(/\xCA/, '')
Before you use scan, make sure that the requested page's Content-Type header is text/html, since there can be links to things like images which are not encoded in UTF-8. The page could also be non-html if you picked up a href in something like a <link> element. How to check this varies on what HTTP library you are using. Then, make sure the result is only ascii with String#ascii_only? (not UTF-8 because HTML is only supposed to be using ascii, entities can be used otherwise). If both of those tests pass, it is safe to use scan.
There is also the scrub method to filter invalid bytes.
string.scrub('')
If you don't "care" about the data you can just do something like:
search_params = params[:search].valid_encoding? ? params[:search].gsub(/\W+/, '') : "nothing"
I just used valid_encoding? to get passed it. Mine is a search field, and so i was finding the same weirdness over and over so I used something like: just to have the system not break. Since i don't control the user experience to autovalidate prior to sending this info (like auto feedback to say "dummy up!") I can just take it in, strip it out and return blank results.

Resources