Convert a unicode string to characters in Ruby? - ruby

I have the following string:
l\u0092issue
My question is how to convert it to utf8 characters ?
I have tried that
1.9.3p484 :024 > "l\u0092issue".encode('utf-8')
=> "l\u0092issue"

You seem to have got your encodings into a bit of a mix up. If you haven’t already, you should first read Joel Spolsky’s article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) which provides a good introduction into this type of thing. There is a good set of articles on how Ruby handles character encodings at http://graysoftinc.com/character-encodings/understanding-m17n-multilingualization. You could also have a look at the Ruby docs for String and Encoding.
In this specific case, the string l\u0092issue means that the second character is the character with the unicode codepoint 0x92. This codepoint is PRIVATE USE TWO (see the chart), which basically means this position isn’t used.
However, looking at the Windows CP-1252 encoding, position 0x92 is occupied by the character ’, so if this is the missing character the the string would be l’issue, whick looks a lot more likely even though I don’t speak French.
What I suspect has happened is your program has received the string l’issue encoded in CP-1252, but has assumed it was encoded in ISO-8859-1 (ISO-8859-1 and CP-1252 are quite closely related) and re-encoded it to UTF-8 leaving you with the string you now have.
The real fix for you is to be careful about the encodings of any strings that enter (and leave) your program, and how you manage them.
To transform your string to l’issue, you can encode it back to ISO-8859-1, then use force_encoding to tell Ruby the real encoding of CP-1252, and then you can re-encode to UTF-8:
2.1.0 :001 > s = "l\u0092issue"
=> "l\u0092issue"
2.1.0 :002 > s = s.encode('iso-8859-1')
=> "l\x92issue"
2.1.0 :003 > s.force_encoding('cp1252')
=> "l\x92issue"
2.1.0 :004 > s.encode('utf-8')
=> "l’issue"
This is only really a demonstration of what is going on though. The real solution is to make sure you’re handling encodings correctly.

That is encoded as UTF-8 (unless you changed the original string encoding). Ruby is just showing you the escape sequences when you inspect the string (which is why IRB does there). \u0092 is the escape sequence for this character.
Try puts "l\u0092issue" to see the rendered character, if your terminal font supports it.

Related

Why does Ruby Integer method 'chr' use ASCII-8bit, not UTF-8 by default?

According to it's source code https://www.rubydoc.info/stdlib/core/Integer:chr, this method uses ASCII encoding if no arguments provided, and really, it gives different results when called with and without arguments:
irb(main):002:0* 255.chr
=> "\xFF"
irb(main):003:0' 255.chr 'utf-8'
=> "ÿ"
Why does this happen? Isn't Ruby supposed to use UTF-8 everywhere by default? At least all strings seem to be encoded with UTF-8:
irb(main):005:0> "".encoding
=> #<Encoding:UTF-8>
Why does this happen?
For characters from U+0000 to U+007F (127), the vast majority of single-octet and variable-length character encodings agree on the encoding. In particular, they all agree on being strict supersets of ASCII.
In other words: for characters up to and including U+007F, ASCII, the entire ISO8859 family, the entire DOS codepage family, the entire Windows family, as well as UTF-8 are actually identical. So, for characters between U+0000 and U+007F, ASCII is the logical choice:
0.chr.encoding
#=> #<Encoding:US-ASCII>
127.chr.encoding
#=> #<Encoding:US-ASCII>
However, for anything above 127, more or less no two character encodings agree. In fact, the overwhelming majority of characters above 127 don't even exist in the overwhelming majority of characters sets, thus don't have an encoding in the vast majority of character encodings.
In other words: it is practically impossible to find a single default encoding for characters above 127.
Therefore, the encoding that is chosen by Ruby is Encoding::BINARY, which is basically a pseudo-encoding that means "this isn't actually text, this is unstructured unknown binary data". (For hysterical raisins, this encoding is also aliased to ASCII-8BIT, which I find absolutely horrible, because ASCII is 7 bit, period, and anything using the 8th bit is by definition not ASCII.)
128.chr.encoding
#=> #<Encoding:ASCII-8BIT>
255.chr.encoding
#=> #<Encoding:ASCII-8BIT>
Note also that Integer#chr is limited to a single octet, i.e. to a range from 0 to 255, so multi-octet or variable-length encodings are not really required here.
Isn't Ruby supposed to use UTF-8 everywhere by default?
Which encoding are you talking about? Ruby has about a half dozen of them.
For the vast majority of encodings, your statement is incorrect.
the locale encoding is the default encoding of the environment
the filesystem encoding is the encoding that is used for file paths: the value is determined by the file system
the external encoding of an IO object is the encoding that text that this read is assumed to be in and text that is written is transcoded to: the default is the locale encoding
the internal encoding of an IO object is the encoding that Strings that are written to the IO object must be in and that Strings that are read from the IO object are transcoded into: the default is the default internal encoding, whose default value, in turn, is nil, meaning no transcoding occurs
the script encoding is the encoding that a Ruby script is read, and also String literals in the script will inherit this encoding: it is set with a magic comment at the beginning of the script, and the default is UTF-8
So, as you can see, there are many different encodings, and many different defaults, and only one of them is UTF-8. And none of those encodings are actually relevant to your question, because 128.chr is neither a String literal nor an IO object. It is a String object that is created by the Integer#chr method using whatever encoding it sees fit.

How do I filter out invisible characters without affecting Japanese character set?

I noticed that some of my input is getting U+2028. I don't know what this is, but how can I prevent this with consideration of UTF-8 and English/Japanese characters?
The character U+2028 is LINE SEPARATOR and is one of space characters.
To select only the Japanese characters is (I am afraid) quite tricky in the Unicode space, because CJK characters spread all over across so many planes, even though Ruby supports an extensive Unicode category format in Regexp like \p{Hiragana}. However, if your only interest is Japanese and ASCII, the NKF library is useful. Here is an example:
require 'nkf'
orig = "b2αÇ()あ相〜\u2028\u3000_━●★】"
p orig
p NKF.nkf('-w -E', NKF.nkf('-e', orig))
# =>
# "b2αÇ()あ相〜\u2028 _━●★】"
# "b2α()あ相〜 _━●★】"
As you see, the unicode character U+2028 is filtered out, whereas a Greek character "α" is preserved because it is included in the Japanese JIS-X-0208 code. Note the accented alphabets like "Ç" are filtered out, because they are not included. The set of so-called hankaku-kana is filtered out (Edited-from) converted into zenkaku-kana (Edited-to) in this formula. The JIS-X-0212 character set is not supported, either.
A solution for your specific case.
I have come up with other solutions (for Ruby 2) in addition to the solution with the NKF library. The comparison as described below is in a way interesting, as they are slightly different from one another. This is a major revision, and so I am posting it as a separate answer. I am also describing the background about this at the end of this post.
I am assuming the original input is in UTF-8 encoding except for the first section (if not, convert it to the UTF-8 first to apply any of the examples).
Solutions to filtering out illegitimate characters
"illegitimate" means the character code that is not included in the encoding defined for a String instance.
In Ruby 2, such String usually should have the encoding ASCII-8BIT. However, some may wrongly have UTF-8 encoding.
If it has the encoding ASCII-8BIT, but if you want to get a legitimate UTF-8 String,
s1 = String.new("あ\x99", encoding: 'ASCII-8BIT') # An example ASCII-8BIT
# => "\xE3\x81\x82\x99"
s1.encoding # => #<Encoding:ASCII-8BIT>
s1.valid_encoding? # => true because 'ASCII-8BIT' accepts anything.
s1.force_encoding('UTF-8')
# => s1=="あ\x99"
s1.valid_encoding? # => false
s2 = s1.encode('UTF-8', invalid: :replace, replace: '')
# => "あ"
s2.valid_encoding? # => true
If it has wrongly the encoding UTF-8, and if you want to filter out the illegitimate codepoints,
s1 = String.new("あ\x99", encoding: 'UTF-8') # An example 'UTF-8'
# => "あ\x99"
s1.encoding # => #<Encoding:UTF-8>
s1.valid_encoding? # => false
s2 = s1.encode('UTF-8', invalid: :replace, replace: '')
# => "あ"
s2.valid_encoding? # => true
Solutions to filtering out "non-Japanese" characters
All the following methods are to filter out "non-Japanese" characters.
Basically, "non-Japanese" characters are those that are not included in one or more of the traditional standard of the Japanese character set.
See the next section for the detailed background of the definition of the "non-Japanese" characters.
The strategy here is to convert the encoding of the original String to a Japanese JIS encoding (referred to as ISO-2022-JP or EUC-JP; basically JIS-X-0208) and to convert back to UTF-8.
Use String#encode
Ruby-2 built-in String#encode does the exact job.
orig = "b2◇〒α()あ相〜\u3000_8D━●★】$£€Ç♡㌔③\u2028ハンカク"
print "Orig:"; p orig
print "Enc: "; p orig.encode('ISO-2022-JP', undef: :replace, replace: '').encode('UTF-8')
Characteristics
"zenkaku-alnum": preserved
"hankaku-kana": filtered out
Euro-sign: filtered out
Latin1: filtered out
JISX0212: filtered out
CJK Compatibility: filtered out
Circled Digit: filtered out
Unicode Line Separator: filtered out
Use NKF library
The NKF library is one of the standard libraries that come with the official Ruby release.
The library is traditional and has been used for decades; cf., NKF stands for Network Kanji Filter.
It does a very similar, though slightly different, job to/from Ruby Encoding.
orig = "b2◇〒α()あ相〜\u3000_8D━●★】$£€Ç♡㌔③\u2028ハンカク"
require 'nkf'
print "NKF: "; p NKF.nkf('-w -E', NKF.nkf('-e', orig))
Characteristics
"zenkaku-alnum": preserved
"hankaku-kana": converted into "zenkaku" (aka full-width)
Euro-sign: filtered out
Latin1: filtered out
JISX0212: filtered out
CJK Compatibility: preserved
Circled Digit: preserved
Unicode Line Separator: filtered out
Use iconv Gem
Ruby Gem iconv does not come with the standard Ruby anymore (I think it used to, up to Ruby 2.1 or something). But you can easily install it with the gem command like gem install iconv .
It can handle ISO-2022-JP-2, unlike the above-mentioned 2 methods, which may be handy (n.b., the encoding ISO-2022-JP-2 is actually defined in Ruby Encoding, but no conversion is defiend for or from it in Ruby in default). Once installed, the following is an example.
orig = "b2◇〒α()あ相〜\u3000_8D━●★】$£€Ç♡㌔③\u2028ハンカク"
require 'iconv'
output = ''
Iconv.open('iso-2022-jp-2', 'utf-8') do |cd|
cd.discard_ilseq=true
output = cd.iconv orig << cd.iconv(nil)
end
s2 = Iconv.conv('utf-8', 'iso-2022-jp-2', output)
print "Icon:"; p s2
Characteristics
"zenkaku-alnum": preserved
"hankaku-kana": preserved
Euro-sign: preserved
Latin1: preserved
JISX0212: preserved
CJK Compatibility: filtered out
Circled Digit: preserved
Unicode Line Separator: filtered out
Summary
Here are the outputs of the above-mentioned three methods:
Orig:"b2◇〒α()あ相〜 _8D━●★】$£€Ç♡㌔③\u2028ハンカク"
Enc: "b2◇〒α()あ相〜 _8D━●★】$£"
NKF: "b2◇〒α()あ相〜 _8D━●★】$£㌔③ハンカク"
Icon:"b2◇〒α()あ相〜 _8D━●★】$£€Ç♡③ハンカク"
All the code snippets above here are available as a gist in Github for convenience — download or git clone and run it.
Background
What is an invalid character? The character U+2028, for example as in the question, is a legitimate UTF-8 character (Line Separator). So, there is no general reason to filter such characters out, though some individual situations may require to.
What is an English character? The lower- and upper-case alphabets (52 in total) probably are. Then, how about the dollar sign ($)? Pound sign (£)? Euro sign (€)? The dollar sign is an ASCII character, whereas neither of the pound and Euro signs is not. The pound sign is included in the traditional Latin-1 (ISO-8859-1) character set, whereas the Euro sign is not. As such, what is an English character is not a trivial question.
You may define ASCII (or Latin-1, or whatever) is the only English character set in your definition, but it is somewhat arbitrary.
What is a Japanese character? OK, Hiragana and Katakana are unique to Japanese. How about Kanji? Do you accept simplified Chinese characters, which are not used in Japan, as Kanji? How about symbols? OK, a few symbols, such as 。 (U+3002; Ideographic Full Stop) and 「 (U+300c; Left Corner Bracket) are essential punctuations in the Japanese text. But, is there any reason to regard characters like ▼ (Black Down-Pointing Triangle), which has been used widely among Japanese-language computer users for decades, as Japanese specific? Perhaps not. They are just symbols that can be used anywhere in the world. And worse, it is not a clear cut; for example, although it is perhaps fair to argue Postal Mark 〒 is Japanese specific, it is not an essential punctuation like the full stop but just a symbol fairly popularly used in Japan. I would not be surprised if the very similar symbol is actually used elsewhere in the world, unknown to me.
Being similar to the argument of ASCII and Latin-1 for English characters, you could define the traditionally used characters included in the JIS (X 0208) character set are the valid Japanese characters. Again, it is inevitably arbitrary. For example, the Pound sign (£) is included in it, whereas Euro sign is not. The diamond mark ◇ (White Diamond) is included, whereas the heart mark ♡ (White Heart Suit) is not. Or, what about those so-called "zenkaku" (aka full-width) characters, which are just duplications of alphabets and Arabic numerals of 0 to 9 of ASCII?
After all, the Unicode is the unified set of the characters used in the world regardless of the languages (— well, ideally at least, though you may argue the real Unicode is not quite idealistic). In this sense there is no definite answer to filter out non-English or non-Japanese characters. Consequently, the original question about filtering out U+2028 is one of those arbitrary demands coming from some specific situations, even though it can well be a popular demand in fact (and hence my answer).
Only the definitive thing you could do is to filter out illegitimate characters for the chosen character encoding, such as UTF-8, as described in the first section of this answer. The rest is, really, up to each individual's need in their specific situations.
Background of the "Japanese" character sets
The Japanese character set was traditionally defined in the JIS standards in the official term. Specifically, JIS-X-0208 and much less popular JIS-X-0212 (often casually called "補助漢字") are the two standards (n.b., they have their specific details like 1983 and 1990). Unfortunately, in practice, NEC, Microsoft and Apple adopted their own variations (called broadly Shift_JIS or SJIS, though each has their own variation). Due to the popularity of their OSs, they were (and to some extent still are(!)) more widely used in Japan in reality than the strict official ones before the era where the UTF-8 is widely accepted.
Note that all of them accept the ASCII at least. So, it has been always safe to use ASCII in pretty much any situations (excepting some in early 80s or before).
The Unicode is very inclusive, containing pretty much any of the characters that have been defined in any of these character codesets. That means any of the characters that have once stirred hot debate (whether you should not use or you may) can be legitimately used in (any of) the Unicode encoding now – I mean legitimate as far as the character encoding is concerned.
I presume this confused practical situation has lead to the results as shown above that slightly differ from one another, depending which method you use. Pick your favourite, depending on your need!

Convert UTF-8 to CP1252 ruby 2.2

How to keep all characters converting from UTF-8 to CP1252 on ruby 2.2
this code:
file = 'd:/1 descrição.txt'
puts file.encode('cp1252')
Give this error:
`encode': U+0327 to WINDOWS-1252 in conversion from UTF-8 to WINDOWS-1252 (Encoding::UndefinedConversionError)
My application need to be cp1252, but I can't find any way to keep all the characters.
I can't replace this characters, because later I will use this info to read the file from file system.
puts file.encode('cp1252', undef: :replace, replace: '')
> d:/1 descricao.txt
ps: It is a ruby script not a ruby on rails application
UTF-8 covers the entire range of unicode, but CP1252 only includes a subset of them. Obviously this means that there are characters that can be encoded in UTF-8 but not in CP1252. This is the problem you are facing.
In your example it looks like the string only contains characters that should work in CP1252, but clearly it doesn’t.
The character in the error message, U+0327 is a combining character, and is not representable in CP1252. It combines with the preceding c to produce ç. ç can also be represented as a single character (U+00E7), which is representable in CP1252.
One option might be normalisation, which will convert the string into a form that is representable in CP1252.
file = 'd:/1 descrição.txt'.unicode_normalize(:nfc)
puts file.encode('cp1252')
(It appears that Stack Overflow is normalizing the string when displaying your question, which is probably why copying the code from the question and running it doesn’t produce any errors.)
This will avoid the error, but note that it is not necessarily possible to reverse the process to get the original string unless the original is in a known normalized form already.

Ruby character encoding issue with scraped HTML

I'm having a character encoding issue with a Ruby script that does some HTML scraping and parsing with the Nokogiri gem. At one point in the script, I call join("\n") on an array of strings that have been pulled from some HTML, which causes this error:
./script.rb:333:in `join': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
In my logs, I can see Café showing up for some of the strings that would be included in the join operation.
Is it that some of the strings in my array to be joined are ASCII-8BIT and some are UTF-8 and ruby can't combine them? Do I need to convert or sanitize my strings after parsing them with Nokogiri (into UTF-8)?.
I tried force_encoding('UTF-8') and encode('UTF-8') on the scraped HTML content before I do anything else with it, but it didn't help. In fact, after I tried encode('UTF-8'), my script crashed even earlier when it called to_s on a string containing Café.
Character encoding always really confuses me. Is there something else I can do to sanitize the strings to avoid this error?
Edit:
I was doing something similar in Perl recently and used a module called Text::Unidecode and was able to pass my strings to a function that translates any problematic characters e.g. the letter a with an acute to the plain letter a. Is there anything similar for ruby? (This isn't necessarily what I'm aiming for though, if I can keep the a with acute then that's preferable I think.
Edit2:
I'm really confused by this and it's proving difficult to reproduce reliably. Here's some code:
[CODE REMOVED]
Edit3:
I removed the previously posted code example because it wasn't correct. But the bottom line is, whenever I try to print or call to_s on the string that was scraped, I get the encoding error.
Edit4:
It turned out in the end that the scraped html input was not what was causing the problem. I got the encoding error whenever I tried to print or call to_s on a hash containing, among other things, the scraped html text. The 'other things' were values from database queries, and they were being returned in ASCII-8BIT. To fix the issue, I explicitly had to call force_encoding('UTF-8') on each database value that I use (although I hear that the mysql2 gem does this automatically so I should switch to that).
I hate character encoding.
Presumably, Café is supposed to be Café. If we start out with Café in UTF-8 but treat the bytes as though they were encoded in ISO-8859-1 (AKA Latin-1) and then re-encode them as UTF-8, we get the Café that you're seeing; for example:
> s = 'Café'
=> "Café"
> s.encoding
=> #<Encoding:UTF-8>
> s.force_encoding('iso-8859-1').encode('utf-8')
=> "Café"
So somewhere you're reading a UTF-8 string but treating it as Latin-1 and re-encoding it as UTF-8. I'd guess that Nokogiri is reading the page and thinking that it is Latin-1 or being told by your user agent that it is getting Latin-1 text. Perhaps you have a bad default encoding somewhere, or the HTTP headers are lying about the encoding, or the page itself is lying about its encoding.
You need to get everything into UTF-8 at the edges of your scraper. Figure out who is lying about the encoding and sort it out right there.
Don't feel bad, scraping and encoding is a nightmare of confusion, stupidity, guesswork, and hard liquor. Servers lie, pages lie, browsers lie, no one is happy.

Ruby hexacode to unicode conversion

I crawled a website which contains unicode, an the results look something like, if in code
a = "\\u2665 \\uc624 \\ube60! \\uc8fd \\uae30 \\uc804 \\uc5d0"
May I know how do I do it in Ruby to convert it back to the original Unicode text which is in UTF-8 format?
If you have ruby 1.9, you can try:
a.force_encoding('UTF-8')
Otherwise if you have < 1.9, I'd suggest reading this article on converting to UTF-8 in Ruby 1.8.
short answer: you should be able to 'puts a', and see the string printed out. for me, at least, I can print out that string in both 1.8.7 and 1.9.2
long answer:
First thing: it depends on if you're using ruby 1.8.7, or 1.9.2, since the way strings and encodings were handled changed.
in 1.8.7:
strings are just lists of bytes. when you print them out, if your OS can handle it, you can just 'puts a' and it should work correctly. if you do a[0], you'll get the first byte. if you want to get each character, things are pretty darn tricky.
in 1.9.2
strings are lists of bytes, with an encoding. If the webpage was sent with the correct encoding, your string should already be encoded correctly. if not, you'll have to set it (as per Mike Lewis's answer). if you do a[0], you'll get the first character (the heart). if you want each byte, you can do a.bytes.
If your OS, for whatever reason, is giving you those literal ascii characters,my previous answer is obviously invalid, disregard it. :P
here's what you can do:
a.gsub(/\\u([a-z0-9]+)/){|p| [$1.to_i(16)].pack("U")}
this will scan for the ascii string '\u' followed by a hexadecimal number, and replace it with the correct unicode character.
You can also specify the encoding when you open a new IO object: http://www.ruby-doc.org/core/classes/IO.html#M000889
Compared to Mike's solution, this may prevent troubles if you forget to force the encoding before exposing the string to the rest of your application, if there are multiple mechanisms for retrieving strings from your module or class. However, if you begin crawling SJIS or KOI-8 encoded websites, then Mike's solution will be easier to adapt for the character encoding name returned by the web server in its headers.

Resources