Detect encoding - ruby

I'm getting some string data from the web, and I suspect that it's not always what it says it is. I don't know where the problem is, and I just don't care any more. From day one on this project I've been fighting Ruby string encoding. I really want some way to say: "Here's a string. What is it?", and then use that data to get it to UTF-8 so that it doesn't explode gsub() 2,000 lines down in the depths of my app. I've checked out rchardet, but even though it supposedly works for 1.9 now, it just blows up given any input with multiple bytes... which is not helpful.

You can't really detect the encoding. You can only assume it.
For the most Western languages applications, the following construct
will work. The traditional encoding usually is "ISO-8859-1". The new and preferred encoding is UTF-8. Why not simply try to encode it with UTF-8 and fallback with the old encoding
def detect_encoding( str )
begin
str.encode("UTF-8")
"UTF-8"
rescue
"ISO-8859-1"
end
end

It is impossible to tell from a string what encoding it is in. You always need some additional metadata that tells you what the string's encoding is.
If you get the string from the web, that metadata is in the HTTP headers. If the HTTP headers are wrong, there is absolutely nothing that you or Ruby or anyone else can do. You need to file a bug with the webmaster of the site where you got the string from and wait till he fixes it. If you have a Service Level Agreement with the website, file a bug, wait a week, then sue them.

Old question, but chardet works on 1.9: http://rubygems.org/gems/chardet

why not try use https://github.com/brianmario/charlock_holmes to get the exact encoding. Then also use it to convert to UTF8
require 'charlock_holmes'
class EncodeParser
def initialize(text)
#text = text
end
def detected_encoding
CharlockHolmes::EncodingDetector.detect(#text)[:encoding]
end
def convert_to_utf8
CharlockHolmes::Converter.convert(#text, detected_encoding, "UTF-8")
end
end
then just use EncodeParser.new(text).detected_encoding or EncodeParser.new(text). convert_to_utf8

We had some fine experience with ensure_encoding. It actually does the job for us to convert resource files having unknown encoding to UTF-8.
The README will give you some hints which options would be a good fit for your situation.
I have never tried chardet since ensure_encoding did the job just fine for us.
I covered here how we use ensure_encoding.

Try setting these in your environment.
export LC_ALL=en_US.UTF-8
export LC_CTYPE=en_US.UTF-8
Try ruby -EBINARY or ruby -EASCII-8BIT to command line
Try adding -Ku or -Kn to your ruby command line.
Could you paste the error message ?
Also try this: http://github.com/candlerb/string19/blob/master/string19.rb

Might try reading this: http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/

I know it's an old question, but in modern versions of Ruby it's as simple as str.encoding. You get a return value something like this: #Encoding:UTF-8

Related

Strange characters returned after screen scraping using Ruby/Nokogiri?

I'm using Ruby and Nokogiri to scrape data off a client's legacy system.
The text I'm getting contains a trademark symbol. But when I display it on the console or save it to the database, the TM gets converted to a different character.
Diet™ BECOMES Dietâ¢
I'm pretty sure it's just an encoding problem and I'm pretty sure Ruby has an easy way to deal with it, but after several minutes of googling and trying a few obvious options, I'm not any closer.
Thanks in advance!
You have an encoding mismatch, but you haven't told us enough to help you.
Things to check:
What encoding does the server say their page is? It'll be in the HTTPD headers returned.
Is the document REALLY encoded as the server says, or are there characters that are not in that codeset?
Typically, you'll get documents as UTF-8, ISO-8859-1 or Win-1252, so try using those values to give Nokogiri a hint. The documentation for Nokogiri::HTML.parse says:
parse(thing, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML, &block)
Where:
encoding is the encoding that should be used when processing the document.
One way to figure out what the server is sending back is:
require 'open-uri'
open('http://www.example.net') { |io| io.charset }
# => "iso-8859-1"
Warning: What the server sends back is not necessarily what the content really is, so it's only a preliminary hint. The document returned could be anything, and at that point you're on your own to figure out what it is.
Typically we use Nokogiri::HTML('some html to parse'), but you can use:
Nokogiri::HTML('some html to parse', nil, 'UTF-8')
Look at Ruby's Encoding to figure out what the available codesets are:
Encoding.constants

"invalid byte sequence in UTF-8" in rspec controller response

We encounter the said error on some of our newer virtual machines, while other machines remain unaffected and wonder why and furthermore how to get rid of them.
the two main differences are as follows
vm_old:
debian squeeze
ruby1.9.2p0
vm_new:
debian wheezy
ruby1.9.2p320 (over rvm)
There naturally are more changes within the VMs, but i don't know which would affect this behavior.
We have a response containing umlauts within some of our controllers (ie. '{"message": "ü"}') and we have set # encoding: utf-8
Within the spec we test the response against a fixed string with this umlaut
it 'should test something' do
get :some_controller, format: :json
response.status.should == 200
json = ActiveSupport::JSON.decode(response.body)
json["message"].should == 'ü' # breaks on this line
# ... some more tests
end
The substitute for ü seems to be a random 4 digit string.
On occasion this string seems to be valid utf-8 and can be transfered.
We then have a failed spec instead of the error message in the title, since the random string is not the same as ü.
The spec file itself also has the # encoding: utf-8 on the first line.
We tried playing with the locale or with force_encoding('utf-8')
The question now becomes:
Has someone else encountered a problem like this?
and
How to solve it?
Edit: turns out it is not always starting with P\.
Edit 2:
Testing around showed it is a problem with the json decode.
The controller response is something like "{\"foo\": \"\u00fc\"}", decoding that results in random output where the sequence \u00fc used to be.
for simple reproduction:
bundle exec rails c
> ActiveSupport::JSON.decode(ActiveSupport::JSON.encode({:foo => "ü"})
rails version is 3.0.4
Edit 3:
Changing the JSON backend to Yaml seems to be a valid workaround.
I'm not certain if this will be of help to you, but I figured I'd toss it out there. For me, adding this code:
.encode('UTF-16le', :invalid => :replace, :replace => '').encode('UTF-8')
totally saved me. Essentially, it involves converting your UTF-8 encoding to UTF-16, and then encoding it back to UTF-8. More information is available here.

Nokogiri - Encoding Issue - Invalid UTF8 characters

Can someone take a look at this. I think there is invalid UTF-8 characters when making this call.
Nokogiri::HTML(open("http://www.next.co.uk/x502062s2"))
If there a way around this? And is this the issue? I am writing a new open source screen scraper designed for product information capture (when a site does not supply a feed) before anyone says I am doing something a little shifty :-)
Before passing anything to Nokogiri, you can encode the content of the page, and ignore all invalid UTF characters using Iconv.
I was using it like this:
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(open('http://example.com').read)
You can also check "Fixing invalid UTF-8 in Ruby, revisited."

Ruby hexacode to unicode conversion

I crawled a website which contains unicode, an the results look something like, if in code
a = "\\u2665 \\uc624 \\ube60! \\uc8fd \\uae30 \\uc804 \\uc5d0"
May I know how do I do it in Ruby to convert it back to the original Unicode text which is in UTF-8 format?
If you have ruby 1.9, you can try:
a.force_encoding('UTF-8')
Otherwise if you have < 1.9, I'd suggest reading this article on converting to UTF-8 in Ruby 1.8.
short answer: you should be able to 'puts a', and see the string printed out. for me, at least, I can print out that string in both 1.8.7 and 1.9.2
long answer:
First thing: it depends on if you're using ruby 1.8.7, or 1.9.2, since the way strings and encodings were handled changed.
in 1.8.7:
strings are just lists of bytes. when you print them out, if your OS can handle it, you can just 'puts a' and it should work correctly. if you do a[0], you'll get the first byte. if you want to get each character, things are pretty darn tricky.
in 1.9.2
strings are lists of bytes, with an encoding. If the webpage was sent with the correct encoding, your string should already be encoded correctly. if not, you'll have to set it (as per Mike Lewis's answer). if you do a[0], you'll get the first character (the heart). if you want each byte, you can do a.bytes.
If your OS, for whatever reason, is giving you those literal ascii characters,my previous answer is obviously invalid, disregard it. :P
here's what you can do:
a.gsub(/\\u([a-z0-9]+)/){|p| [$1.to_i(16)].pack("U")}
this will scan for the ascii string '\u' followed by a hexadecimal number, and replace it with the correct unicode character.
You can also specify the encoding when you open a new IO object: http://www.ruby-doc.org/core/classes/IO.html#M000889
Compared to Mike's solution, this may prevent troubles if you forget to force the encoding before exposing the string to the rest of your application, if there are multiple mechanisms for retrieving strings from your module or class. However, if you begin crawling SJIS or KOI-8 encoded websites, then Mike's solution will be easier to adapt for the character encoding name returned by the web server in its headers.

What options do exist now to implement UTF8 in Ruby and RoR?

Following the development of Ruby very closely I learned that detailed character encoding is implemented in Ruby 1.9. My question for now is: How may Ruby be used at the moment to talk to a database that stores all data in UTF8?
Background: I am involved in a new project where Ruby/RoR is at least an option. But the project needs to rely on an internationalized character set (it's spread over many countries), preferably UTF8.
So how do you deal with that? Thanks in advance.
Ruby 1.8 works fine with UTF-8 strings for basic operations with the strings. Depending on your application's need, some operations will either not work or not work as expected.
Eg:
1) The size of strings will give you bytes, not characters since the mult-byte support is not there yet. But do you need to know the size of your strings in characters?
2) No splitting a string at a character boundary. But do you need this? Etc.
3) Sorting order will be funky if sorted in Ruby. The suggestion of using the db to sort is a good idea.
etc.
Re poster's comment about sorting data after reading from db: As noted, results will probably not match users' expectations. So the solution is to sort on the db. And it will usually be faster, anyhow--databases are designed to sort data.
Summary: My Ruby 1.8.6 RoR app works fine with international Unicode characters processed and stored as UTF-8 on modern browsers. Right to left languages work fine too. Main issues: be sure that your db and all web pages are set to use UTF-8. If you already have some data in your db, then you'll need to go through a conversion process to change it to UTF-8.
Regards,
Larry
"Unicode ahoy! While Rails has always been able to store and display unicode with no beef, it’s been a little more complicated to truncate, reverse, or get the exact length of a UTF-8 string. You needed to fool around with KCODE yourself and while plenty of people made it work, it wasn’t as plug’n’play easy as you could have hoped (or perhaps even expected).
So since Ruby won’t be multibyte-aware until this time next year, Rails 1.2 introduces ActiveSupport::Multibyte for working with Unicode strings. Call the chars method on your string to start working with characters instead of bytes." Click Here for more
Although I haven't tested it, the character-encodings library (currently in alpha) adds methods to the String class to handle UTF-8 and others. Its page on RubyForge is here. It is designed for Ruby 1.8.
It is my experience, however, that, using Ruby 1.8, if you store data in your database as UTF-8, Ruby will not get in the way as long as your character encoding in the HTTP header is UTF-8. It may not be able to operate on the strings, but it won't break anything. Example:
file.txt:
¡Hola! ¿Como estás? Leí el artículo. ¡Fue muy excellente!
Pardon my poor Spanish; it was the best example of Unicode I could come up with.
in irb:
str = File.read("file.txt")
=> "\302\241Hola! \302\277Como est\303\241s? Le\303\255 el art\303\255culo. \302\241Fue muy excellente!\n"
str += "Foo is equal to bar."
=> "\302\241Hola! \302\277Como est\303\241s? Le\303\255 el art\303\255culo. \302\241Fue muy excellente!\nFoo is equal to bar."
str = " " + str + " "
=> " \302\241Hola! \302\277Como est\303\241s? Le\303\255 el art\303\255culo. \302\241Fue muy excellente!\nFoo is equal to bar. "
str.strip
=> "\302\241Hola! \302\277Como est\303\241s? Le\303\255 el art\303\255culo. \302\241Fue muy excellente!\nFoo is equal to bar."
Basically, it will just treat the UTF-8 as ASCII with odd characters in it. It will not sort lexigraphically if the code points are out of order; however, it will sort by code point. Example:
"\302" <=> "\301"
=> -1
How much are you planning on operating on the data in the Rails app, anyway? Most sorting etc. is usually done by your database engine.

Resources