Transliteration in ruby - ruby

What is the simplest way for transliteration of non English characters in ruby. That is conversion such as:
translit "Gévry"
#=> "Gevry"

Ruby has an Iconv library in its stdlib which converts encodings in a very similar way to the usual iconv command

Use the UnicodeUtils gem. This works in 1.9 and 2.0. Iconv has been deprecated in these releases.
gem install unicode_utils
Then try this in IRB:
2.0.0p0 :001 > require 'unicode_utils' #=> true
2.0.0p0 :002 > r = "Résumé" #=> "Résumé"
2.0.0p0 :003 > r.encoding #=> #<Encoding:UTF-8>
2.0.0p0 :004 > UnicodeUtils.nfkd(r).gsub(/(\p{Letter})\p{Mark}+/,'\\1')
#=> "Resume"
Now an explanation of how this works!
First you have to normalize the string in NFKD (Normalization Form (K)ompatability Decomposition) format. The "é" unicode codepoint, known as "latin small letter e with acute", can be represented in two ways:
é = U+00E9
é = (e = U+0065) + (acute = U+0301)
With the first form being the most popular as a single code point. The second form is the decomposed format, separating the grapheme (what appears as "é" on your screen) into its two base code points, the ASCII "e" and the acute accent mark. Unicode can compose a grapheme from many code points, which is useful in some Asian writing systems.
Note you typically want to normalize your data in a standard format for comparison, sorting, etc. In ruby the two formats of "é" here are NOT equal(). In IRB, do this:
> "\u00e9" #=> "é"
> "\u0065\u0301" #=> "é"
> "\u00e9" == "\u0065\u0301" #=> false
> "\u00e9" > "\u0065\u0301" #=> true
> "\u00e9" >= "f" #=> true (composed é > f)
> "\u0065\u0301" > "f" #=> false (decomposed é < f)
> "Résumé".chars.count #=> 6
> decomposed = UnicodeUtils.nfkd("Résumé")
#=> "Résumé"
> decomposed.chars.count #=> 8
> decomposed.length #=> 6
> decomposed.gsub(/(\p{Letter})\p{Mark}+/,'\\1')
#=> "Resume"
Now that we have the string in NFKD format, we can apply a regular expression using the "property name" syntax (\p{property_name}) to match a letter followed by one or more diacritic "marks". By capturing the matching letter, we can use gsub to replace the letter+diacritics by the captured letter throughout the string.
This technique removed diacritic marks from ASCII letters and will not transliterate character sets such as Greek or Cyrillic strings into equivalent ASCII letters.

Try taking a look at this script from TechniConseils which replaces accented characters in a string. Example of usage:
"Gévry".removeaccents #=> Gevry

Related

How to count instances of any Unicode letter in my string

Using Ruby 2.4, how do I count the number of instances of a Unicode letter in my string? I'm trying:
2.4.0 :009 > string = "a"
=> "a"
2.4.0 :010 > string.count('\p{L}')
=> 0
but it's displaying 0, and it should be returning 1.
I want to use the above expression rather than "a-z" because "a-z" won't cover things like accented e's.
You could try using String#scan, passing your \p{L} regex, and then chain the count method:
string = "aá"
p string.scan(/\p{L}/).count
# 2
This is a way that does not create a temporary array.
str = "Même temps l'année prochaine."
str.size - str.gsub(/[[:alpha:]]/, '').size
#=> 24
The POSIX bracket expression [[:alpha:]] is the same as \p{Alpha} (aka \p{L}).
Note that
str.gsub(/[[:alpha:]]/, '')
#=> " ' ."

Why is RegExp.escape not working in my Ruby expression?

I'm using Ruby 2.4. I have some strings that contain characters that have special meaning in regular expression. So to eliminate any possibility of those characters being interpreted as regexp characters, I use the "Regexp.escape" to attempt to escape them. However, I still seem unable to make teh below regular expression work ...
2.4.0 :005 > tokens = ["a", "b?", "c"]
=> ["a", "b?", "c"]
2.4.0 :006 > line = "1\ta\tb?\tc\t3"
=> "1\ta\tb?\tc\t3"
2.4.0 :009 > /#{Regexp.escape(tokens.join(" ")).gsub(" ", "\\s+")}/.match(line)
=> nil
How can I properly escape the characters before substituting the space with a "\s+" expression, whcih I do want interpreted as a regexp character?
When the Regexp.escape(tokens.join(" ")).gsub(" ", "\\s+") is executed, tokens.join(" ") yields a b? c, then the string is escaped -> a\ b\?\ c, and then the gsub is executed resulting in a\\s+b\?\\s+c. Now, line is 1 a b? c 3. So, all \\ are now matching a literal backslash, they no longer form an special regex metacharacter matching whitespace.
You need to escape the tokens, and join with \s+, or join with space and later replace the space with \s+:
/#{tokens.map { |n| Regexp.escape(n) }.join("\\s+")}/.match(line)
OR
/#{tokens.map { |n| Regexp.escape(n) }.join(" ").gsub(" ", "\\s+")}/.match(line)

Percent encoding in Ruby

In Ruby, I get the percent-encoding of 'ä' by
require 'cgi'
CGI.escape('ä')
=> "%C3%A4"
The same with
'ä'.unpack('H2' * 'ä'.bytesize)
=> ["c3", "a4"]
I have two questions:
What is the reverse of the first operation? Shouldn't it be
["c3", "a4"].pack('H2' * 'ä'.bytesize)
=> "\xC3\xA4"
For my application I need 'ä' to be encoded as "%E4" which is the hex-value of 'ä'.ord. Is there any Ruby-method for it?
As I mentioned in my comment, equating the character ä as the codepoint 228 (0xE4) implies that you're dealing with the ISO 8859-1 character encoding.
So, you need to tell Ruby what encoding you want for your string.
str1 = "Hullo ängstrom" # uses whatever encoding is current, generally utf-8
str2 = str1.encode('iso-8859-1')
Then you can encode it as you like:
require 'cgi'
s2c = CGI.escape str2
#=> "Hullo+%E4ngstrom"
require 'uri'
s2u = URI.escape str2
#=> "Hullo%20%E4ngstrom"
Then, to reverse it, you must first (a) unescape the value, and then (b) turn the encoding back into what you're used to (likely UTF-8), telling Ruby what character encoding it should interpret the codepoints as:
s3a = CGI.unescape(s2c) #=> "Hullo \xE4ngstrom"
puts s3a.encode('utf-8','iso-8859-1')
#=> "Hullo ängstrom"
s3b = URI.unescape(s2u) #=> "Hullo \xE4ngstrom"
puts s3b.encode('utf-8','iso-8859-1')
#=> "Hullo ängstrom"

Ruby trying to dynamically create unicode string throws "invalid Unicode escape" error

I have a requirement wherein I want to dynamically create a unicode string using interpolation.For e.g. please see the following code tried out in irb
2.1.2 :016 > hex = 0x0905
=> 2309
2.1.2 :017 > b = "\u#{hex}"
SyntaxError: (irb):17: invalid Unicode escape
b = "\u#{hex}"
The hex-code 0x0905 corresponds to unicode for independent vowel for DEVANAGARI LETTER A.
I am unable to figure how to achieve the desired result.
You can pass an encoding to Integer#chr:
hex = 0x0905
hex.chr('UTF-8') #=> "अ"
The parameter can be omitted, if Encoding::default_internal is set to UTF-8:
$ ruby -E UTF-8:UTF-8 -e "p 0x0905.chr"
"अ"
You can also append codepoints to other strings:
'' << hex #=> "अ"
String interpolation happens after ruby decodes the escapes, so what you are trying to do is interpreted by ruby like an incomplete escape.
To create a unicode character from a number, you need to pack it:
hex = 0x0905
[hex].pack("U")
=> "अ"

How to capitalize the first letter in a String in Ruby

The upcase method capitalizes the entire string, but I need to capitalize only the first letter.
Also, I need to support several popular languages, like German and Russian.
How do I do it?
It depends on which Ruby version you use:
Ruby 2.4 and higher:
It just works, as since Ruby v2.4.0 supports Unicode case mapping:
"мария".capitalize #=> Мария
Ruby 2.3 and lower:
"maria".capitalize #=> "Maria"
"мария".capitalize #=> мария
The problem is, it just doesn't do what you want it to, it outputs мария instead of Мария.
If you're using Rails there's an easy workaround:
"мария".mb_chars.capitalize.to_s # requires ActiveSupport::Multibyte
Otherwise, you'll have to install the unicode gem and use it like this:
require 'unicode'
Unicode::capitalize("мария") #=> Мария
Ruby 1.8:
Be sure to use the coding magic comment:
#!/usr/bin/env ruby
puts "мария".capitalize
gives invalid multibyte char (US-ASCII), while:
#!/usr/bin/env ruby
#coding: utf-8
puts "мария".capitalize
works without errors, but also see the "Ruby 2.3 and lower" section for real capitalization.
capitalize first letter of first word of string
"kirk douglas".capitalize
#=> "Kirk douglas"
capitalize first letter of each word
In rails:
"kirk douglas".titleize
=> "Kirk Douglas"
OR
"kirk_douglas".titleize
=> "Kirk Douglas"
In ruby:
"kirk douglas".split(/ |\_|\-/).map(&:capitalize).join(" ")
#=> "Kirk Douglas"
OR
require 'active_support/core_ext'
"kirk douglas".titleize
Rails 5+
As of Active Support and Rails 5.0.0.beta4 you can use one of both methods: String#upcase_first or ActiveSupport::Inflector#upcase_first.
"my API is great".upcase_first #=> "My API is great"
"мария".upcase_first #=> "Мария"
"мария".upcase_first #=> "Мария"
"NASA".upcase_first #=> "NASA"
"MHz".upcase_first #=> "MHz"
"sputnik".upcase_first #=> "Sputnik"
Check "Rails 5: New upcase_first Method" for more info.
Well, just so we know how to capitalize only the first letter and leave the rest of them alone, because sometimes that is what is desired:
['NASA', 'MHz', 'sputnik'].collect do |word|
letters = word.split('')
letters.first.upcase!
letters.join
end
=> ["NASA", "MHz", "Sputnik"]
Calling capitalize would result in ["Nasa", "Mhz", "Sputnik"].
Unfortunately, it is impossible for a machine to upcase/downcase/capitalize properly. It needs way too much contextual information for a computer to understand.
That's why Ruby's String class only supports capitalization for ASCII characters, because there it's at least somewhat well-defined.
What do I mean by "contextual information"?
For example, to capitalize i properly, you need to know which language the text is in. English, for example, has only two is: capital I without a dot and small i with a dot. But Turkish has four is: capital I without a dot, capital İ with a dot, small ı without a dot, small i with a dot. So, in English 'i'.upcase # => 'I' and in Turkish 'i'.upcase # => 'İ'. In other words: since 'i'.upcase can return two different results, depending on the language, it is obviously impossible to correctly capitalize a word without knowing its language.
But Ruby doesn't know the language, it only knows the encoding. Therefore it is impossible to properly capitalize a string with Ruby's built-in functionality.
It gets worse: even with knowing the language, it is sometimes impossible to do capitalization properly. For example, in German, 'Maße'.upcase # => 'MASSE' (Maße is the plural of Maß meaning measurement). However, 'Masse'.upcase # => 'MASSE' (meaning mass). So, what is 'MASSE'.capitalize? In other words: correctly capitalizing requires a full-blown Artificial Intelligence.
So, instead of sometimes giving the wrong answer, Ruby chooses to sometimes give no answer at all, which is why non-ASCII characters simply get ignored in downcase/upcase/capitalize operations. (Which of course also reads to wrong results, but at least it's easy to check.)
Use capitalize. From the String documentation:
Returns a copy of str with the first character converted to uppercase and the remainder to lowercase.
"hello".capitalize #=> "Hello"
"HELLO".capitalize #=> "Hello"
"123ABC".capitalize #=> "123abc"
My version:
class String
def upcase_first
return self if empty?
dup.tap {|s| s[0] = s[0].upcase }
end
def upcase_first!
replace upcase_first
end
end
['NASA title', 'MHz', 'sputnik'].map &:upcase_first #=> ["NASA title", "MHz", "Sputnik"]
Check also:
https://www.rubydoc.info/gems/activesupport/5.0.0.1/String%3Aupcase_first
https://www.rubydoc.info/gems/activesupport/5.0.0.1/ActiveSupport/Inflector#upcase_first-instance_method
You can use mb_chars. This respects umlaute:
class String
# Only capitalize first letter of a string
def capitalize_first
self[0] = self[0].mb_chars.upcase
self
end
end
Example:
"ümlaute".capitalize_first
#=> "Ümlaute"
Below is another way to capitalize each word in a string. \w doesn't match Cyrillic characters or Latin characters with diacritics but [[:word:]] does. upcase, downcase, capitalize, and swapcase didn't apply to non-ASCII characters until Ruby 2.4.0 which was released in 2016.
"aAa-BBB ä мария _a a_a".gsub(/\w+/,&:capitalize)
=> "Aaa-Bbb ä мария _a A_a"
"aAa-BBB ä мария _a a_a".gsub(/[[:word:]]+/,&:capitalize)
=> "Aaa-Bbb Ä Мария _a A_a"
[[:word:]] matches characters in these categories:
Ll (Letter, Lowercase)
Lu (Letter, Uppercase)
Lt (Letter, Titlecase)
Lo (Letter, Other)
Lm (Letter, Modifier)
Nd (Number, Decimal Digit)
Pc (Punctuation, Connector)
[[:word:]] matches all 10 of the characters in the "Punctuation, Connector" (Pc) category:
005F _ LOW LINE
203F ‿ UNDERTIE
2040 ⁀ CHARACTER TIE
2054 ⁔ INVERTED UNDERTIE
FE33 ︳ PRESENTATION FORM FOR VERTICAL LOW LINE
FE34 ︴ PRESENTATION FORM FOR VERTICAL WAVY LOW LINE
FE4D ﹍ DASHED LOW LINE
FE4E ﹎ CENTRELINE LOW LINE
FE4F ﹏ WAVY LOW LINE
FF3F _ FULLWIDTH LOW LINE
This is another way to only convert the first character of a string to uppercase:
"striNG".sub(/./,&:upcase)
=> "StriNG"

Resources