How to read only English characters - ruby

I am reading a file that sometimes has Chinese and characters of languages other than English.
How can I write a regex that only reads English words/letters?
Should it just be /^[a-zA-Z]+/?
If I do the above, then words like "eété" will still be picked but I don't want that:
"été".match(/^[a-zA-Z]+/) => #nil good I didn't want that word
"eété".match(/^[a-zA-Z]+/) => #not nil tricked into picking something I did not want

The only truly English letter that comes to mind is wynn ƿ.
One could make an argument for eth ð and thorn þ, but it would be a much weaker argument than can be made for wynn.
Other than those, English typically uses the Latin alphabet, albeit with certain modifications. Wynn possibly excepted, there is no such thing as an English letter, only a Latin letter.
There certainly exist regular expressions that demand that base characters be in the Latin or Common scripts, such as for example
(?:[\p{Script=Latin}\p{Script=Common}]\pM*+)+
but as you haven’t specified whether you’re using a 7- or 8-bit version of Ruby or a 21-bit version, I’m not sure what to tell you.

You need $, which means end of line:
/^[a-zA-Z]+$/
Or if you use such filtration:
strings.select { |s| /^[a-zA-Z]+$/ =~ s }
# which is equal to strings.grep /^[a-zA-Z]+$/
you can use negative filtration method with slightly simplier regex:
strings.reject { |s| /[^a-zA-Z]/ =~ s }
where [^a-zA-Z] means any non-english character.

Sometimes it's useful to use the Iconv library to deal with non-ASCII:
require 'iconv'
utf8_to_latin1 = Iconv.new("LATIN1//TRANSLIT//IGNORE", "UTF8") # !> encoding option isn't portable: TRANSLIT//IGNORE
utf8_to_ascii_translit = Iconv.new("ASCII//TRANSLIT", "UTF8") # !> encoding option isn't portable: TRANSLIT
utf8_to_ascii_ignore = Iconv.new("ASCII//IGNORE", "UTF8") # !> encoding option isn't portable: IGNORE
resume = "Résumé"
utf8_to_latin1.iconv(resume) # => "R\xE9sum\xE9"
utf8_to_ascii_translit.iconv(resume) # => "R'esum'e"
utf8_to_ascii_ignore.iconv(resume) # => "Rsum"
Notice that Ruby is warning that the option choices are not portable. That means there might be some damage to the string being processed; The "//TRANSLIT" and "//IGNORE" options can degrade the string but for our purpose it's OK.
James Gray wrote a nice article about Encoding Conversion With iconv, which is useful for understanding what Iconv can do, along with dealing with UTF-8 and Unicode characters.

Related

How do I filter out invisible characters without affecting Japanese character set?

I noticed that some of my input is getting U+2028. I don't know what this is, but how can I prevent this with consideration of UTF-8 and English/Japanese characters?
The character U+2028 is LINE SEPARATOR and is one of space characters.
To select only the Japanese characters is (I am afraid) quite tricky in the Unicode space, because CJK characters spread all over across so many planes, even though Ruby supports an extensive Unicode category format in Regexp like \p{Hiragana}. However, if your only interest is Japanese and ASCII, the NKF library is useful. Here is an example:
require 'nkf'
orig = "b2αÇ()あ相〜\u2028\u3000_━●★】"
p orig
p NKF.nkf('-w -E', NKF.nkf('-e', orig))
# =>
# "b2αÇ()あ相〜\u2028 _━●★】"
# "b2α()あ相〜 _━●★】"
As you see, the unicode character U+2028 is filtered out, whereas a Greek character "α" is preserved because it is included in the Japanese JIS-X-0208 code. Note the accented alphabets like "Ç" are filtered out, because they are not included. The set of so-called hankaku-kana is filtered out (Edited-from) converted into zenkaku-kana (Edited-to) in this formula. The JIS-X-0212 character set is not supported, either.
A solution for your specific case.
I have come up with other solutions (for Ruby 2) in addition to the solution with the NKF library. The comparison as described below is in a way interesting, as they are slightly different from one another. This is a major revision, and so I am posting it as a separate answer. I am also describing the background about this at the end of this post.
I am assuming the original input is in UTF-8 encoding except for the first section (if not, convert it to the UTF-8 first to apply any of the examples).
Solutions to filtering out illegitimate characters
"illegitimate" means the character code that is not included in the encoding defined for a String instance.
In Ruby 2, such String usually should have the encoding ASCII-8BIT. However, some may wrongly have UTF-8 encoding.
If it has the encoding ASCII-8BIT, but if you want to get a legitimate UTF-8 String,
s1 = String.new("あ\x99", encoding: 'ASCII-8BIT') # An example ASCII-8BIT
# => "\xE3\x81\x82\x99"
s1.encoding # => #<Encoding:ASCII-8BIT>
s1.valid_encoding? # => true because 'ASCII-8BIT' accepts anything.
s1.force_encoding('UTF-8')
# => s1=="あ\x99"
s1.valid_encoding? # => false
s2 = s1.encode('UTF-8', invalid: :replace, replace: '')
# => "あ"
s2.valid_encoding? # => true
If it has wrongly the encoding UTF-8, and if you want to filter out the illegitimate codepoints,
s1 = String.new("あ\x99", encoding: 'UTF-8') # An example 'UTF-8'
# => "あ\x99"
s1.encoding # => #<Encoding:UTF-8>
s1.valid_encoding? # => false
s2 = s1.encode('UTF-8', invalid: :replace, replace: '')
# => "あ"
s2.valid_encoding? # => true
Solutions to filtering out "non-Japanese" characters
All the following methods are to filter out "non-Japanese" characters.
Basically, "non-Japanese" characters are those that are not included in one or more of the traditional standard of the Japanese character set.
See the next section for the detailed background of the definition of the "non-Japanese" characters.
The strategy here is to convert the encoding of the original String to a Japanese JIS encoding (referred to as ISO-2022-JP or EUC-JP; basically JIS-X-0208) and to convert back to UTF-8.
Use String#encode
Ruby-2 built-in String#encode does the exact job.
orig = "b2◇〒α()あ相〜\u3000_8D━●★】$£€Ç♡㌔③\u2028ハンカク"
print "Orig:"; p orig
print "Enc: "; p orig.encode('ISO-2022-JP', undef: :replace, replace: '').encode('UTF-8')
Characteristics
"zenkaku-alnum": preserved
"hankaku-kana": filtered out
Euro-sign: filtered out
Latin1: filtered out
JISX0212: filtered out
CJK Compatibility: filtered out
Circled Digit: filtered out
Unicode Line Separator: filtered out
Use NKF library
The NKF library is one of the standard libraries that come with the official Ruby release.
The library is traditional and has been used for decades; cf., NKF stands for Network Kanji Filter.
It does a very similar, though slightly different, job to/from Ruby Encoding.
orig = "b2◇〒α()あ相〜\u3000_8D━●★】$£€Ç♡㌔③\u2028ハンカク"
require 'nkf'
print "NKF: "; p NKF.nkf('-w -E', NKF.nkf('-e', orig))
Characteristics
"zenkaku-alnum": preserved
"hankaku-kana": converted into "zenkaku" (aka full-width)
Euro-sign: filtered out
Latin1: filtered out
JISX0212: filtered out
CJK Compatibility: preserved
Circled Digit: preserved
Unicode Line Separator: filtered out
Use iconv Gem
Ruby Gem iconv does not come with the standard Ruby anymore (I think it used to, up to Ruby 2.1 or something). But you can easily install it with the gem command like gem install iconv .
It can handle ISO-2022-JP-2, unlike the above-mentioned 2 methods, which may be handy (n.b., the encoding ISO-2022-JP-2 is actually defined in Ruby Encoding, but no conversion is defiend for or from it in Ruby in default). Once installed, the following is an example.
orig = "b2◇〒α()あ相〜\u3000_8D━●★】$£€Ç♡㌔③\u2028ハンカク"
require 'iconv'
output = ''
Iconv.open('iso-2022-jp-2', 'utf-8') do |cd|
cd.discard_ilseq=true
output = cd.iconv orig << cd.iconv(nil)
end
s2 = Iconv.conv('utf-8', 'iso-2022-jp-2', output)
print "Icon:"; p s2
Characteristics
"zenkaku-alnum": preserved
"hankaku-kana": preserved
Euro-sign: preserved
Latin1: preserved
JISX0212: preserved
CJK Compatibility: filtered out
Circled Digit: preserved
Unicode Line Separator: filtered out
Summary
Here are the outputs of the above-mentioned three methods:
Orig:"b2◇〒α()あ相〜 _8D━●★】$£€Ç♡㌔③\u2028ハンカク"
Enc: "b2◇〒α()あ相〜 _8D━●★】$£"
NKF: "b2◇〒α()あ相〜 _8D━●★】$£㌔③ハンカク"
Icon:"b2◇〒α()あ相〜 _8D━●★】$£€Ç♡③ハンカク"
All the code snippets above here are available as a gist in Github for convenience — download or git clone and run it.
Background
What is an invalid character? The character U+2028, for example as in the question, is a legitimate UTF-8 character (Line Separator). So, there is no general reason to filter such characters out, though some individual situations may require to.
What is an English character? The lower- and upper-case alphabets (52 in total) probably are. Then, how about the dollar sign ($)? Pound sign (£)? Euro sign (€)? The dollar sign is an ASCII character, whereas neither of the pound and Euro signs is not. The pound sign is included in the traditional Latin-1 (ISO-8859-1) character set, whereas the Euro sign is not. As such, what is an English character is not a trivial question.
You may define ASCII (or Latin-1, or whatever) is the only English character set in your definition, but it is somewhat arbitrary.
What is a Japanese character? OK, Hiragana and Katakana are unique to Japanese. How about Kanji? Do you accept simplified Chinese characters, which are not used in Japan, as Kanji? How about symbols? OK, a few symbols, such as 。 (U+3002; Ideographic Full Stop) and 「 (U+300c; Left Corner Bracket) are essential punctuations in the Japanese text. But, is there any reason to regard characters like ▼ (Black Down-Pointing Triangle), which has been used widely among Japanese-language computer users for decades, as Japanese specific? Perhaps not. They are just symbols that can be used anywhere in the world. And worse, it is not a clear cut; for example, although it is perhaps fair to argue Postal Mark 〒 is Japanese specific, it is not an essential punctuation like the full stop but just a symbol fairly popularly used in Japan. I would not be surprised if the very similar symbol is actually used elsewhere in the world, unknown to me.
Being similar to the argument of ASCII and Latin-1 for English characters, you could define the traditionally used characters included in the JIS (X 0208) character set are the valid Japanese characters. Again, it is inevitably arbitrary. For example, the Pound sign (£) is included in it, whereas Euro sign is not. The diamond mark ◇ (White Diamond) is included, whereas the heart mark ♡ (White Heart Suit) is not. Or, what about those so-called "zenkaku" (aka full-width) characters, which are just duplications of alphabets and Arabic numerals of 0 to 9 of ASCII?
After all, the Unicode is the unified set of the characters used in the world regardless of the languages (— well, ideally at least, though you may argue the real Unicode is not quite idealistic). In this sense there is no definite answer to filter out non-English or non-Japanese characters. Consequently, the original question about filtering out U+2028 is one of those arbitrary demands coming from some specific situations, even though it can well be a popular demand in fact (and hence my answer).
Only the definitive thing you could do is to filter out illegitimate characters for the chosen character encoding, such as UTF-8, as described in the first section of this answer. The rest is, really, up to each individual's need in their specific situations.
Background of the "Japanese" character sets
The Japanese character set was traditionally defined in the JIS standards in the official term. Specifically, JIS-X-0208 and much less popular JIS-X-0212 (often casually called "補助漢字") are the two standards (n.b., they have their specific details like 1983 and 1990). Unfortunately, in practice, NEC, Microsoft and Apple adopted their own variations (called broadly Shift_JIS or SJIS, though each has their own variation). Due to the popularity of their OSs, they were (and to some extent still are(!)) more widely used in Japan in reality than the strict official ones before the era where the UTF-8 is widely accepted.
Note that all of them accept the ASCII at least. So, it has been always safe to use ASCII in pretty much any situations (excepting some in early 80s or before).
The Unicode is very inclusive, containing pretty much any of the characters that have been defined in any of these character codesets. That means any of the characters that have once stirred hot debate (whether you should not use or you may) can be legitimately used in (any of) the Unicode encoding now – I mean legitimate as far as the character encoding is concerned.
I presume this confused practical situation has lead to the results as shown above that slightly differ from one another, depending which method you use. Pick your favourite, depending on your need!

In Ruby, how to convert special characters like ë,à,é,ä all to e,a,e,a?

I want to convert characters like ë to just plain e. I am looking to convert with regards to language and how people type cities. For example, most people actually type Brasilia when searching for it, instead of Brasília. And when news agencies like Rueters report on Brasília, they usually spell it Brasilia. So again, just looking for any gem (or character encoding math/method is probably better since that answer can be used, for reference, in other languages).
This is just to handle the typical "extended ASCII" character sets. Note: I am working with standard Unicode strings.
You may be looking for I18n#transliterate.
Gem is here, install with gem install i18n.
Example:
irb(main):001:0> require 'i18n'
=> true
irb(main):002:0> I18n.enforce_available_locales = false
=> false
irb(main):003:0> I18n.transliterate("ë,à,é,ä")
=> "e,a,e,a"
Starting with Ruby 2.2, there is String#unicode_normalize to normalize unicode strings. The NFKD form separates character and punctuation:
'ë'.unicode_normalize(:nfkd).chars
#=> ["e", "̈"]
# ^ ^
# char punctuation
Since the character is a valid ASCII codepoint and the punctuation is not, this can be used to remove the latter:
'ë,à,é,ä'.unicode_normalize(:nfkd).encode('ASCII', replace: '')
#=> "e,a,e,a"

Ruby character transliteration

What's the current best way to transliterate characters to 7-bit ASCII in Ruby? Most of questions I've seen on SO are 3 or 4 years old and the solutions don't fully work.
I want a method that will work for a wide range of Latin alphabets and, for example, convert
Your résumé’s a non–encyclopædia
to
Your resume's a non-encyclopaedia
but I cannot find a way that does that, particularly for folding 8-bit ASCII to 7-bit ASCII.
s = "Your r\u00e9sum\u00e9\u2019s a non\u2013encyclop\u00e6dia"
puts Iconv.iconv('ascii//ignore//translit', 'utf-8', s)
# => Your r'esum'e's a non-encyclopaedia
puts s.encode('ascii//ignore//translit', 'utf-8')
# => Encoding::ConverterNotFoundError: code converter not found (UTF-8 to ascii//ignore//translit)
puts s.encode('ascii', 'utf-8')
# Encoding::UndefinedConversionError: U+00E9 from UTF-8 to US-ASCII
puts s.encode('ascii', 'utf-8', invalid: :replace, undef: :replace)
# Your r?sum??s a non?encyclop?dia
puts I18n.transliterate(s)
# Your resume?s a non?encyclopaedia
Since Iconv is deprecated I'd rather not use that if I don't have to, but I'd do it if that is the only thing that works. Obviously I could put in custom 8-bit ASCII to 7-bit ASCII translations, but I'd prefer to use a supported solution that has been thoroughly tested.
The translation is handled fine by International Components for Unicode with its Latin-ASCII translation, but that is only available for Java and C.
UPDATE
What I ended up doing was writing my own character translation routines to take care of punctuation and whitespace, after which I could use I18n.transliterate to do the rest. I'd still prefer finding and using a well-maintained library function to handle the stuff I18n does not.
If you're willing to add a somewhat heavy dependency (unless your already on Rails), ActiveSupport has support (pun not intended) for this:
ActiveSupport::Multibyte::Chars.new("Your r\u00e9sum\u00e9\u2019s not an encyclop\u00e6dia").mb_chars.normalize(:kd).chars.to_a.delete_if {|c| !c.ascii_only?}.join('')
This works for all of the letters. It doesn't handle the apostrophe right yet though.
I guess the removeaccents script is just right what your want.
Maybe UnicodeUtils gem can be useful, but only to remove the accents (not to convert things like æ AFAIK).

Remove all but some special characters

I am trying to come up with a regex to remove all special characters except some. For example, I have a string:
str = "subscripción gustaría♥"
I want the output to be "subscripción gustaría".
The way I tried to do is, match anything which is not an ascii character (00 - 7F) and not special character I want and replace it with blank.
str.gsub(/(=?[^\x00-\x7F])(=?^\xC3\xB3)(=?^\xC3\xA1)/,'')
This doesn't work. The last special character is not removed.
Can someone help? (This is ruby 1.8)
Update: I am trying to make the question a little more clear. The string is utf-8 encoded. And I am trying to whitelist the ascii characters plus ó and í and blacklist everything else.
Oniguruma has support for all the characters you care about without having to deal with codepoints. You can just add the unicode characters inside the character class you're whitelisting, followed by the 'u' option.
ruby-1.8.7-p248 > str = "subscripción gustaría♥"
=> "subscripci\303\263n gustar\303\255a\342\231\245"
ruby-1.8.7-p248 > puts str.gsub(/[^a-zA-Z\sáéíóúÁÉÍÓÚ]/u,'')
subscripción gustaría
=> nil
str.split('').find_all {|c| (0x00..0x7f).include? c.ord }.join('')
The question is a bit vague. There is not a word about encoding of the string. Also, you want to white-list characters or black list? Which ones?
But you get the idea, decide what you want, and then use proper ranges as colleagues here already proposed. Some examples:
if str = "subscripción gustaría♥" is utf-8
then you can blacklist all char above the range (excl. whitespaces):
str.gsub(/[^\x{0021}-\x{017E}\s]/,'')
if string is in ISO-8859-1 codepage you can try to match all quirky characters like the "heart" from the beginning of ASCII range:
str.gsub(/[\x01-\x1F]/,'')
The problem is here with regex, has nothing to do with Ruby. You probably will need to experiment more.
It is not completely clear which characters you want to keep and which you want to delete. The example string's character is some Unicode character that, in my browser, displays as a heart symbol. But it seems you are dealing with 8-bit ASCII characters (since you are using ruby 1.8 and your regular expressions point that way).
Nonetheless, you should be able to do it in one of two ways; either specify the characters you want to keep or, alternatively, specify the characters you want to delete. For example, the following specifies that all characters 0x00-0x7F and 0xC0-0xF6 should be kept (remove everything that is not in that group):
puts str.gsub(/[^\x00-\x7F\xC0-\xF6]/,'')
This next example specifies that characters 0xA1 and 0xC3 should be deleted.
puts str.gsub(/[\xA1\xC3]/,'')
I ended up doing this: str.gsub(/[^\x00-\x7FÁáÉéÍíÑñÓóÚúÜü]/,''). It doesn't work on my mac but works on linux.

Ruby 1.9: Regular Expressions with unknown input encoding

Is there an accepted way to deal with regular expressions in Ruby 1.9 for which the encoding of the input is unknown? Let's say my input happens to be UTF-16 encoded:
x = "foo<p>bar</p>baz"
y = x.encode('UTF-16LE')
re = /<p>(.*)<\/p>/
x.match(re)
=> #<MatchData "<p>bar</p>" 1:"bar">
y.match(re)
Encoding::CompatibilityError: incompatible encoding regexp match (US-ASCII regexp with UTF-16LE string)
My current approach is to use UTF-8 internally and re-encode (a copy of) the input if necessary:
if y.methods.include?(:encode) # Ruby 1.8 compatibility
if y.encoding.name != 'UTF-8'
y = y.encode('UTF-8')
end
end
y.match(/<p>(.*)<\/p>/u)
=> #<MatchData "<p>bar</p>" 1:"bar">
However, this feels a little awkward to me, and I wanted to ask if there's a better way to do it.
As far as I am aware, there is no better method to use. However, might I suggest a slight alteration?
Rather than changing the encoding of the input, why not change the encoding of the regex? Translating one regex string every time you meet a new encoding is a lot less work than translating hundreds or thousands of lines of input to match the encoding of your regex.
# Utility function to make transcoding the regex simpler.
def get_regex(pattern, encoding='ASCII', options=0)
Regexp.new(pattern.encode(encoding),options)
end
# Inside code looping through lines of input.
# The variables 'regex' and 'line_encoding' should be initialized previously, to
# persist across loops.
if line.methods.include?(:encoding) # Ruby 1.8 compatibility
if line.encoding != last_encoding
regex = get_regex('<p>(.*)<\/p>',line.encoding,16) # //u = 00010000 option bit set = 16
last_encoding = line.encoding
end
end
line.match(regex)
In the pathological case (where the input encoding changes every line) this would be just as slow, since you're re-encoding the regex every single time through the loop. But in 99.9% of situations where the encoding is constant for an entire file of hundreds or thousands of lines, this will result in a vast reduction in re-encoding.
Follow the advice of this page: http://gnuu.org/2009/02/02/ruby-19-common-problems-pt-1-encoding/ and add
# encoding: utf-8
to the top of your rb file.

Resources