How to tell if a UTF-8 file has asian characteres? - ruby

Question: Is there a simple way to discover if a given UTF file has or does not have not Asian characters? Would be great if that works with both UTF-8 and UTF-16. Better yet if done with ruby instead of a generic algorithm.
EDIT:
By the comments I learn about CJK, that is most likely what I'm looking for.
So, is there a way to test if a UTF file have CJK characters?

This may be reinventing the wheel but you can use unpack('U*') to get the unicode codepoints from any string. IE
codepoints = '㌂'.unpack('U*').first
=> 13058
Then you can use .any?
codepoints.any?{|c| overlaps_cjk?(c)}
The overlaps_cjk function you can derive by getting all the desired codepoint blocks you consider "asian characters" from http://graphemica.com/blocks
for instance:
CJK_CODEPOINTS = [(13000..13500)]
def overlaps_cjk?(codepoint)
CJK_CODEPOINTS.any?{|range| range.cover?(codepoint)}
end

Related

How do I filter out invisible characters without affecting Japanese character set?

I noticed that some of my input is getting U+2028. I don't know what this is, but how can I prevent this with consideration of UTF-8 and English/Japanese characters?
The character U+2028 is LINE SEPARATOR and is one of space characters.
To select only the Japanese characters is (I am afraid) quite tricky in the Unicode space, because CJK characters spread all over across so many planes, even though Ruby supports an extensive Unicode category format in Regexp like \p{Hiragana}. However, if your only interest is Japanese and ASCII, the NKF library is useful. Here is an example:
require 'nkf'
orig = "b2αÇ()あ相〜\u2028\u3000_━●★】"
p orig
p NKF.nkf('-w -E', NKF.nkf('-e', orig))
# =>
# "b2αÇ()あ相〜\u2028 _━●★】"
# "b2α()あ相〜 _━●★】"
As you see, the unicode character U+2028 is filtered out, whereas a Greek character "α" is preserved because it is included in the Japanese JIS-X-0208 code. Note the accented alphabets like "Ç" are filtered out, because they are not included. The set of so-called hankaku-kana is filtered out (Edited-from) converted into zenkaku-kana (Edited-to) in this formula. The JIS-X-0212 character set is not supported, either.
A solution for your specific case.
I have come up with other solutions (for Ruby 2) in addition to the solution with the NKF library. The comparison as described below is in a way interesting, as they are slightly different from one another. This is a major revision, and so I am posting it as a separate answer. I am also describing the background about this at the end of this post.
I am assuming the original input is in UTF-8 encoding except for the first section (if not, convert it to the UTF-8 first to apply any of the examples).
Solutions to filtering out illegitimate characters
"illegitimate" means the character code that is not included in the encoding defined for a String instance.
In Ruby 2, such String usually should have the encoding ASCII-8BIT. However, some may wrongly have UTF-8 encoding.
If it has the encoding ASCII-8BIT, but if you want to get a legitimate UTF-8 String,
s1 = String.new("あ\x99", encoding: 'ASCII-8BIT') # An example ASCII-8BIT
# => "\xE3\x81\x82\x99"
s1.encoding # => #<Encoding:ASCII-8BIT>
s1.valid_encoding? # => true because 'ASCII-8BIT' accepts anything.
s1.force_encoding('UTF-8')
# => s1=="あ\x99"
s1.valid_encoding? # => false
s2 = s1.encode('UTF-8', invalid: :replace, replace: '')
# => "あ"
s2.valid_encoding? # => true
If it has wrongly the encoding UTF-8, and if you want to filter out the illegitimate codepoints,
s1 = String.new("あ\x99", encoding: 'UTF-8') # An example 'UTF-8'
# => "あ\x99"
s1.encoding # => #<Encoding:UTF-8>
s1.valid_encoding? # => false
s2 = s1.encode('UTF-8', invalid: :replace, replace: '')
# => "あ"
s2.valid_encoding? # => true
Solutions to filtering out "non-Japanese" characters
All the following methods are to filter out "non-Japanese" characters.
Basically, "non-Japanese" characters are those that are not included in one or more of the traditional standard of the Japanese character set.
See the next section for the detailed background of the definition of the "non-Japanese" characters.
The strategy here is to convert the encoding of the original String to a Japanese JIS encoding (referred to as ISO-2022-JP or EUC-JP; basically JIS-X-0208) and to convert back to UTF-8.
Use String#encode
Ruby-2 built-in String#encode does the exact job.
orig = "b2◇〒α()あ相〜\u3000_8D━●★】$£€Ç♡㌔③\u2028ハンカク"
print "Orig:"; p orig
print "Enc: "; p orig.encode('ISO-2022-JP', undef: :replace, replace: '').encode('UTF-8')
Characteristics
"zenkaku-alnum": preserved
"hankaku-kana": filtered out
Euro-sign: filtered out
Latin1: filtered out
JISX0212: filtered out
CJK Compatibility: filtered out
Circled Digit: filtered out
Unicode Line Separator: filtered out
Use NKF library
The NKF library is one of the standard libraries that come with the official Ruby release.
The library is traditional and has been used for decades; cf., NKF stands for Network Kanji Filter.
It does a very similar, though slightly different, job to/from Ruby Encoding.
orig = "b2◇〒α()あ相〜\u3000_8D━●★】$£€Ç♡㌔③\u2028ハンカク"
require 'nkf'
print "NKF: "; p NKF.nkf('-w -E', NKF.nkf('-e', orig))
Characteristics
"zenkaku-alnum": preserved
"hankaku-kana": converted into "zenkaku" (aka full-width)
Euro-sign: filtered out
Latin1: filtered out
JISX0212: filtered out
CJK Compatibility: preserved
Circled Digit: preserved
Unicode Line Separator: filtered out
Use iconv Gem
Ruby Gem iconv does not come with the standard Ruby anymore (I think it used to, up to Ruby 2.1 or something). But you can easily install it with the gem command like gem install iconv .
It can handle ISO-2022-JP-2, unlike the above-mentioned 2 methods, which may be handy (n.b., the encoding ISO-2022-JP-2 is actually defined in Ruby Encoding, but no conversion is defiend for or from it in Ruby in default). Once installed, the following is an example.
orig = "b2◇〒α()あ相〜\u3000_8D━●★】$£€Ç♡㌔③\u2028ハンカク"
require 'iconv'
output = ''
Iconv.open('iso-2022-jp-2', 'utf-8') do |cd|
cd.discard_ilseq=true
output = cd.iconv orig << cd.iconv(nil)
end
s2 = Iconv.conv('utf-8', 'iso-2022-jp-2', output)
print "Icon:"; p s2
Characteristics
"zenkaku-alnum": preserved
"hankaku-kana": preserved
Euro-sign: preserved
Latin1: preserved
JISX0212: preserved
CJK Compatibility: filtered out
Circled Digit: preserved
Unicode Line Separator: filtered out
Summary
Here are the outputs of the above-mentioned three methods:
Orig:"b2◇〒α()あ相〜 _8D━●★】$£€Ç♡㌔③\u2028ハンカク"
Enc: "b2◇〒α()あ相〜 _8D━●★】$£"
NKF: "b2◇〒α()あ相〜 _8D━●★】$£㌔③ハンカク"
Icon:"b2◇〒α()あ相〜 _8D━●★】$£€Ç♡③ハンカク"
All the code snippets above here are available as a gist in Github for convenience — download or git clone and run it.
Background
What is an invalid character? The character U+2028, for example as in the question, is a legitimate UTF-8 character (Line Separator). So, there is no general reason to filter such characters out, though some individual situations may require to.
What is an English character? The lower- and upper-case alphabets (52 in total) probably are. Then, how about the dollar sign ($)? Pound sign (£)? Euro sign (€)? The dollar sign is an ASCII character, whereas neither of the pound and Euro signs is not. The pound sign is included in the traditional Latin-1 (ISO-8859-1) character set, whereas the Euro sign is not. As such, what is an English character is not a trivial question.
You may define ASCII (or Latin-1, or whatever) is the only English character set in your definition, but it is somewhat arbitrary.
What is a Japanese character? OK, Hiragana and Katakana are unique to Japanese. How about Kanji? Do you accept simplified Chinese characters, which are not used in Japan, as Kanji? How about symbols? OK, a few symbols, such as 。 (U+3002; Ideographic Full Stop) and 「 (U+300c; Left Corner Bracket) are essential punctuations in the Japanese text. But, is there any reason to regard characters like ▼ (Black Down-Pointing Triangle), which has been used widely among Japanese-language computer users for decades, as Japanese specific? Perhaps not. They are just symbols that can be used anywhere in the world. And worse, it is not a clear cut; for example, although it is perhaps fair to argue Postal Mark 〒 is Japanese specific, it is not an essential punctuation like the full stop but just a symbol fairly popularly used in Japan. I would not be surprised if the very similar symbol is actually used elsewhere in the world, unknown to me.
Being similar to the argument of ASCII and Latin-1 for English characters, you could define the traditionally used characters included in the JIS (X 0208) character set are the valid Japanese characters. Again, it is inevitably arbitrary. For example, the Pound sign (£) is included in it, whereas Euro sign is not. The diamond mark ◇ (White Diamond) is included, whereas the heart mark ♡ (White Heart Suit) is not. Or, what about those so-called "zenkaku" (aka full-width) characters, which are just duplications of alphabets and Arabic numerals of 0 to 9 of ASCII?
After all, the Unicode is the unified set of the characters used in the world regardless of the languages (— well, ideally at least, though you may argue the real Unicode is not quite idealistic). In this sense there is no definite answer to filter out non-English or non-Japanese characters. Consequently, the original question about filtering out U+2028 is one of those arbitrary demands coming from some specific situations, even though it can well be a popular demand in fact (and hence my answer).
Only the definitive thing you could do is to filter out illegitimate characters for the chosen character encoding, such as UTF-8, as described in the first section of this answer. The rest is, really, up to each individual's need in their specific situations.
Background of the "Japanese" character sets
The Japanese character set was traditionally defined in the JIS standards in the official term. Specifically, JIS-X-0208 and much less popular JIS-X-0212 (often casually called "補助漢字") are the two standards (n.b., they have their specific details like 1983 and 1990). Unfortunately, in practice, NEC, Microsoft and Apple adopted their own variations (called broadly Shift_JIS or SJIS, though each has their own variation). Due to the popularity of their OSs, they were (and to some extent still are(!)) more widely used in Japan in reality than the strict official ones before the era where the UTF-8 is widely accepted.
Note that all of them accept the ASCII at least. So, it has been always safe to use ASCII in pretty much any situations (excepting some in early 80s or before).
The Unicode is very inclusive, containing pretty much any of the characters that have been defined in any of these character codesets. That means any of the characters that have once stirred hot debate (whether you should not use or you may) can be legitimately used in (any of) the Unicode encoding now – I mean legitimate as far as the character encoding is concerned.
I presume this confused practical situation has lead to the results as shown above that slightly differ from one another, depending which method you use. Pick your favourite, depending on your need!

How to use internal/external encoding when importing a YAML file?

How can I load a YAML file regardlessly of its encoding?
My YAML file can be encoded in UTF-8 or ANSI (that's what Notepad++ calls it - I guess it's Windows-1252):
:key1:
:key2: "ä"
utf8.yml is encoded in UTF-8, ansi.yml is encoded in ANSI. I load the files as follows:
# encoding: utf-8
Encoding.default_internal = "utf-8"
utf8_load = YAML::load(File.open('utf8.yml'))
utf8_load_file = YAML::load_file('utf8.yml')
ansi_load = YAML::load(File.open('ansi.yml'))
ansi_load_file = YAML::load_file('ansi.yml')
It seems like Ruby doesn't recognize the encoding correctly:
utf8_load [:key1][:key2].encoding #=> "UTF-8"
utf8_load_file [:key1][:key2].encoding #=> "UTF-8"
ansi_load [:key1][:key2].encoding #=> "UTF-8"
ansi_load_file [:key1][:key2].encoding #=> "UTF-8"
because the bytes aren't the same:
utf8_load [:key1][:key2].bytes #=> [195, 164]
utf8_load_file [:key1][:key2].bytes #=> [195, 164]
ansi_load [:key1][:key2].bytes #=> [239, 191, 189]
ansi_load_file [:key1][:key2].bytes #=> [239, 191, 189]
If I miss Encoding.default_internal = "utf-8", the bytes are also different:
utf8_load [:key1][:key2].bytes #=> [195, 131, 194, 164]
utf8_load_file [:key1][:key2].bytes #=> [195, 164]
ansi_load [:key1][:key2].bytes #=> [195, 164]
ansi_load_file [:key1][:key2].bytes #=> [239, 191, 189]
What happens actually when I don't set the default_internal to utf-8?
Which encodings do the strings in both examples have?
How can I load a file even if I don't know its encoding?
I believe officially YAML only supports UTF-8 (and maybe UTF-16). There have historically been all sorts of encoding confusions in YAML libraries. I think you are going to run into trouble trying to have YAML in something other than a Unicode encoding.
What happens actually when I don't set the default_internal to utf-8?
Encoding.default_internal controls the encoding your input will be converted to when it is read in, at least by some operations that respect Encoding.default_internal, not everything does. Rails seems to set it to UTF-8. So if you don't set the Encoding.default_internal to UTF-8, it might be UTF-8 already anyway.
If Encoding.default_internal is nil, then those operations that respect it, and try to convert any input to Encoding.default_internal upon reading it in won't do that, they'll leave any input in the encoding it was believed to originate in, not try to convert it.
If you set it to something else, like say "WINDOWS-1252" Ruby would automatically convert your stuff to WINDOWS-1252 when it read it in with File.open, which would possibly confuse YAML::load when you pass the string that's now encoded and tagged as WINDOWS-1252 to it. Generally there's no good reason to do this, so leave Encoding.default_internal alone.
Note: The Ruby docs say:
"You should not set ::default_internal in Ruby code as strings created before changing the value may have a different encoding from strings created after the change. Instead you should use ruby -E to invoke Ruby with the correct default_internal."
See also: http://ruby-doc.org/core-1.9.3/Encoding.html#method-c-default_internal
Which encodings do the strings in both examples have?
I don't really know. One would have to have to look at the bytes and try to figure out if they are legal bytes for various plausible encodings, and beyond being legal, if they mean something likely to be intended.
For example take: "ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔǵÇ≠ǻǢ". That's a perfectly legal UTF-8 string, but as humans we know it's probably not intended, and is probably garbage, quite likely from the result of an encoding misinterpretation. But a computer has no way to know that, it's perfectly legal UTF-8, and, hey, maybe someone actually did mean to write "ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔǵÇ≠ǻǢ", after all, I just did, when writing this post!
So you can try to interpret the bytes according to various encodings and see if any of them make sense.
You're really just guessing at this point. Which means...
How can I load a file even if I don't know it's encoding?
Generally, you can not. You need to know and keep track of encodings. There's no real way to know what the bytes mean without knowing their encoding.
If you have some legacy data for which you've lost this, you've got to try to figure it out. Manually, or with some code that tries to guess likely encodings based on heuristics. Here's one Ruby gem Charlock Holmes that tries to guess, using the ICU library heuristics (this particular gem only works on MRI).
What Ruby says in response to string.encoding is just the encoding the string is tagged with. The string can be tagged with the wrong encoding, the bytes in the string don't actually mean what is intended in the encoding it's tagged with... in which case you'll get garbage.
Ruby will do the right things with your string instead of creating garbage only if the string's encoding tag is correct. The string's encoding tag is determined by Encoding.default_external for most input operations by default (Encoding.default_external usually starts out as UTF-8, or ASCII-8BIT which really means the null encoding, binary data, not tagged with an encoding), or by passing an argument to File.open: File.open("something", "r:UTF-8" or, means the same thing, File.open("something", "r", :encoding => "UTF-8"). The actual bytes are determined by whatever is in the file. It's up to you to tell Ruby the correct encoding to interpret those bytes as text meaning what they were intended to mean.
There were a couple posts recently to reddit /r/ruby that try to explain how to troubleshoot and workaround encoding issues that you may find helpful:
http://www.justinweiss.com/articles/how-to-get-from-theyre-to-theyre/
http://www.justinweiss.com/articles/3-steps-to-fix-encoding-problems-in-ruby/
Also, this is my favorite article on understanding encoding generally: http://kunststube.net/encoding/
For YAML files in particular, if I were you, I'd just make sure they are all in UTF-8. Life will be much easier and you won't have to worry about it. If you have some legacy ones that have become corrupted, it's going to be a pain to fix them, but that's what you've got to do, unless you can just rewrite them from scratch. Try to fix them to be in valid and correct UTF-8, and from here on out keep all your YAML in UTF-8.
The YAML specification says in "5.1. Character Set":
To ensure readability, YAML streams use only the printable subset of the Unicode character set. The allowed character range explicitly excludes the C0 control block #x0-#x1F (except for TAB #x9, LF #xA, and CR #xD which are allowed), DEL #x7F, the C1 control block #x80-#x9F (except for NEL #x85 which is allowed), the surrogate block #xD800-#xDFFF, #xFFFE, and #xFFFF.
This means that Windows-1252 or ISO-8859-1 encoding are acceptable as long as the characters being output are within the defined range. Windows users tend to use the "the C1 control block #x80-#x9F" range for diacritical and accented characters, so if those are present in a YAML file the file is not going to meet the spec and the YAML generator didn't do its job correctly. And that explains why "ä" isn't acceptable.
On output, a YAML processor must only produce acceptable characters. Any excluded characters must be presented using escape sequences. In addition, any allowed characters known to be non-printable should also be escaped. This isn’t mandatory since a full implementation would require extensive character property tables.
These days, by default, Ruby uses UTF-8, however YAML isn't limited to that. The spec goes on to say in "5.2. Character Encodings":
On input, a YAML processor must support the UTF-8 and UTF-16 character encodings. For JSON compatibility, the UTF-32 encodings must also be supported.
If a character stream begins with a byte order mark, the character encoding will be taken to be as as indicated by the byte order mark. Otherwise, the stream must begin with an ASCII character. This allows the encoding to be deduced by the pattern of null (#x00) characters.
So, UTF-8, 16 and 32 are supported, but Ruby will assume UTF-8. If the BOM is present you'll see it when you view the file in an editor. I haven't tried loading a UTF-16 or 32 file to see what Ruby's YAML does, so that's left as an experiment.

Ruby character transliteration

What's the current best way to transliterate characters to 7-bit ASCII in Ruby? Most of questions I've seen on SO are 3 or 4 years old and the solutions don't fully work.
I want a method that will work for a wide range of Latin alphabets and, for example, convert
Your résumé’s a non–encyclopædia
to
Your resume's a non-encyclopaedia
but I cannot find a way that does that, particularly for folding 8-bit ASCII to 7-bit ASCII.
s = "Your r\u00e9sum\u00e9\u2019s a non\u2013encyclop\u00e6dia"
puts Iconv.iconv('ascii//ignore//translit', 'utf-8', s)
# => Your r'esum'e's a non-encyclopaedia
puts s.encode('ascii//ignore//translit', 'utf-8')
# => Encoding::ConverterNotFoundError: code converter not found (UTF-8 to ascii//ignore//translit)
puts s.encode('ascii', 'utf-8')
# Encoding::UndefinedConversionError: U+00E9 from UTF-8 to US-ASCII
puts s.encode('ascii', 'utf-8', invalid: :replace, undef: :replace)
# Your r?sum??s a non?encyclop?dia
puts I18n.transliterate(s)
# Your resume?s a non?encyclopaedia
Since Iconv is deprecated I'd rather not use that if I don't have to, but I'd do it if that is the only thing that works. Obviously I could put in custom 8-bit ASCII to 7-bit ASCII translations, but I'd prefer to use a supported solution that has been thoroughly tested.
The translation is handled fine by International Components for Unicode with its Latin-ASCII translation, but that is only available for Java and C.
UPDATE
What I ended up doing was writing my own character translation routines to take care of punctuation and whitespace, after which I could use I18n.transliterate to do the rest. I'd still prefer finding and using a well-maintained library function to handle the stuff I18n does not.
If you're willing to add a somewhat heavy dependency (unless your already on Rails), ActiveSupport has support (pun not intended) for this:
ActiveSupport::Multibyte::Chars.new("Your r\u00e9sum\u00e9\u2019s not an encyclop\u00e6dia").mb_chars.normalize(:kd).chars.to_a.delete_if {|c| !c.ascii_only?}.join('')
This works for all of the letters. It doesn't handle the apostrophe right yet though.
I guess the removeaccents script is just right what your want.
Maybe UnicodeUtils gem can be useful, but only to remove the accents (not to convert things like æ AFAIK).

Convert unicode codepoint to string character in Ruby

I have these values from a unicode database but I'm not sure how to translate them into the human readable form. What are these even called?
Here they are:
U+2B71F
U+2A52D
U+2A68F
U+2A690
U+2B72F
U+2B4F7
U+2B72B
How can I convert these to there readable symbols?
How about:
# Using pack
puts ["2B71F".hex].pack("U")
# Using chr
puts (0x2B71F).chr(Encoding::UTF_8)
In Ruby 1.9+ you can also do:
puts "\u{2B71F}"
I.e. the \u{} escape sequence can be used to decode Unicode codepoints.
The unicode symbols like U+2B71F are referred to as a codepoint.
The unicode system defines a unique codepoint for each character in a multitude of world languages, scientific symbols, currencies etc. This character set is steadily growing.
For example, U+221E is infinity.
The codepoints are hexadecimal numbers. There is always exactly one number defined per character.
There are many ways to arrange this in memory. This is known as an encoding of which the common ones are UTF-8 and UTF-16. The conversion to and fro is well defined.
Here you are most probably looking for converting the unicode codepoint to UTF-8 characters.
codepoint = "U+2B71F"
You need to extract the hex part coming after U+ and get only 2B71F. This will be the first group capture. See this.
codepoint.to_s =~ /U\+([0-9a-fA-F]{4,5}|10[0-9a-fA-F]{4})$/
And you're UTF-8 character will be:
utf_8_character = [$1.hex].pack("U")
References:
Convert Unicode codepoints to UTF-8 characters with Module#const_missing.
Tim Bray on the goodness of unicode.
Joel Spolsky - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Dissecting the Unicode regular expression

ruby from any encoding to ascii

I have to deal with mainly English alphabets and all the punctuation marks, I don't have to worry about European accents. So the only concern I have is when a user paste something he copies from the web that includes, for instance, an apostrophe that when I do a puts in the console (on Win7), it outputs
"ItΓÇÖs" # where as it actually is " It's "
So my main question is, is there a end-it-all conversion method I can use in Ruby that just properly replaces all the ,.;?!"'~` _- with ASCII counter parts?
I really understand very little about encodings, if you think this is wrong question to ask, which can very likely be the case, please do advice as to what I should look for instead.
Thank you
I work in publishing where we deal with this a lot. We have had success with stringex https://github.com/rsl/stringex. They have a to_ascii method that normalizes unicode dashes etc.
And in ruby 2.0:
"ItΓÇÖs".encode("ASCII", invalid: :replace, undef: :replace, replace: '')
=> "Its"
For programmatically handling multibyte encodings iconv is your friend. And, James Grey wrote a series of blog articles talking about how to take apart the problem and convert encodings.
The problem gets more complicated when dealing with text that has been pasted in, because some characters could be in one multibyte-encoding, and other characters could be in another. You might have to walk the string checking for multibyte characters, then asking Ruby what the encoding is, and, if it's not what you expect, convert it to the expected or desired encoding, then move to the next character. Grey's articles cover it all nicely and are good reading.

Resources