Difference in Unicode behaviour between `\w` vs `[[:alpha:]]` in Ruby - ruby

(For this question, ignore the number and underscore matching of \w, which is irrelevant to the discussion here.)
According to the Oniguruma docs, both the shorthand character classes like \w and POSIX classes like [:alpha:] have similar behaviour with regard to Unicode: they have simple ascii behaviour for "Not Unicode Case" (I assume that means the string's encoding is not a Unicode one), and a different behaviour that uses Unicode properties for "Unicode Case".
From that documentation, it sounds as in a case where one of those uses Unicode properties, the other will also use them. However, in practice they seem to differ: the POSIX classes use Unicode properties automatically, whereas the \w type classes have to be explicitly marked with ?u to use Unicode property based matching:
$ ruby -e 'print("~café.".encoding)'
UTF-8
$ ruby -e 'print(/[[:alpha:]]+/.match("~café."))'
café
$ ruby -e 'print(/\w+/.match("~café."))'
caf
$ ruby -e 'print(/(?u)\w+/.match("~café."))'
café
$ ruby -v
ruby 2.3.6p384
Is this a bug, or is my interpretation of the docs wrong? (And what exactly does ?u do, could someone link to where it is documented?)

Since version 2.0, Ruby uses Onigmo, an Oniguruma fork that supports more features implemented in Perl 5.10.
If you compare the doc you linked (Oniguruma) with Onigmo's doc you can see a difference between the \w descriptions:
Oniguruma:
\w word character
Not Unicode:
alphanumeric, "_" and multibyte char.
Unicode: General_Category -- (Letter|Mark|Number|Connector_Punctuation)
Onigmo:
\w word character
Not Unicode:
alphanumeric and "_".
Unicode: General_Category -- (Letter|Mark|Number|Connector_Punctuation)
It depends on ONIG_OPTION_ASCII_RANGE option that non-ASCII char includes or not.
As you can see, there's no more this "and multibyte char." that doesn't make sense (at least for me) and that is probably a typo. Whatever, it's very unclear.
The u modifier switches the shorthand character classes from "Not Unicode" (default) to "Unicode".
That's why you obtain only caf without it and café with it when you try to match it using the character class \w.
On the other side the character class [[:alpha:]] seems to be already extended by default to unicode characters since it matches "café" without the u modifier. A start of explanation can be found in the doc:
It depends on ONIG_OPTION_ASCII_RANGE option and
ONIG_OPTION_POSIX_BRACKET_ALL_RANGE option that POSIX brackets
match non-ASCII char or not.
But you can force it to ascii using the (?a) modifier.

Related

Removing diacritical marks from a Greek text in an automatic way

I have a decompiled stardict dictionary in the form of a tab file
κακός <tab> bad
where <tab> signifies a tabulation.
Unfortunately, the way the words are defined requires the query to include all diacritical marks. So if I want to search for ζῷον, I need to have all the iotas and circumflexes correct.
Thus I'd like to convert the whole file so that the keyword has the diacritic removed. So the line would become
κακος <tab> <h3>κακός</h3> <br/> bad
I know I could read the file line by line in bash, as described here [1]
while read line
do
command
done <file
But what is there any way to automatize the operation of converting the line? I heard about iconv [2] but didn't manage to achieve the desired conversion using it. I'd best like to use a bash script.
Besides, is there an automatic way of transliterating Greek, e.g. using the method Perseus has?
/edit: Maybe we could use the Unicode codes? We can notice that U+1F0x, U+1F8x for x < 8, etc. are all variants of the letter α. This would reduce the amount of manual work. I'd accept a C++ solution as well.
[1] http://en.kioskea.net/faq/1757-how-to-read-a-file-line-by-line
[2] How to remove all of the diacritics from a file?
You can remove diacritics from a string relatively easily using Perl:
$_=NFKD($_);s/\p{InDiacriticals}//g;
for example:
$ echo 'ὦὢῶὼώὠὤ ᾪ' | perl -CS -MUnicode::Normalize -pne '$_=NFKD($_);s/\p{InDiacriticals}//g'
ωωωωωωω Ω
This works as follows:
The -CS enables UTF8 for Perl's stdin/stdout
The -MUnicode::Normalize loads a library for Unicode normalisation
-e executes the script from the command line; -n automatically loops over lines in the input; -p prints the output automatically
NFKD() translates the line into one of the Unicode normalisation forms; this means that accents and diacritics are decomposed into separate characters, which makes it easier to remove them in the next step
s/\p{InDiacriticals}//g removes all characters that Unicoded denotes as diacritical marks
This should in fact work for removing diacritics etc for all scripts/languages that have good Unicode support, not just Greek.
I'm not so familiar with Ancient Greek as I am with Modern Greek (which only really uses two diacritics)
However I went through the vowels and found out which combined with diacritics. This gave me the following list:
ἆἂᾶὰάἀἄ
ἒὲέἐἔ
ἦἢῆὴήἠἤ
ἶἲῖὶίἰἴ
ὂὸόὀὄ
ὖὒῦὺύὐὔ
ὦὢῶὼώὠὤ
I saved this list as a file and passed it to this sed
cat test.txt | sed -e 's/[ἆἂᾶὰάἀἄ]/α/g;s/[ἒὲέἐἔ]/ε/g;s/[ἦἢῆὴήἠἤ]/η/g;s/[ἶἲῖὶίἰἴ]/ι/g;s/[ὂὸόὀὄ]/ο/g;s/[ὖὒῦὺύὐὔ]/υ/g;s/[ὦὢῶὼώὠὤ]/ω/g'
Credit to hungnv
It's a simple sed. It takes each of the options and replaces it with the unmarked character. The result of the above command is:
ααααααα
εεεεε
ηηηηηηη
ιιιιιιι
οοοοο
υυυυυυυ
ωωωωωωω
Regarding transliterating the Greek: the image from your post is intended to help the user type in Greek on the site you took it from using similar glyphs, not always similar sounds. Those are poor transliterations. e.g. β is most often transliterated as v. ψ is ps. φ is ph, etc.

Loading Text File in Ruby in Windows

Aptana is returning:
Invalid escape character syntax
File.open("C:\Users\C*****\Documents\RubyProjects\text.txt
What do I do?
\ is an escape charecter in most languages, so the compiler expects an escaped char after it, in this case its also \, so you just need to use 2 of them
File.open("C:\\Users\\C*****\\Documents\\RubyProjects\\text.txt
Ruby doesn't need you to use reverse slashes. In your string
"C:\Users\C*****\Documents\RubyProjects\text.txt"
you're confusing Ruby because you have reverse-slashes, which denote escapes in a double-quoted (interpreted) string and make Ruby throw up. Instead use:
"C:/Users/C*****/Documents/RubyProjects/text.txt"
From the IO documentation:
Ruby will convert pathnames between different operating system conventions if possible. For instance, on a Windows system the filename "/gumby/ruby/test.rb" will be opened as "\gumby\ruby\test.rb". When specifying a Windows-style filename in a Ruby string, remember to escape the backslashes:
"c:\\gumby\\ruby\\test.rb"
Our examples here will use the Unix-style forward slashes; File::ALT_SEPARATOR can be used to get the platform-specific separator character.

How to split a file containing non-ascii characters into words, in bash?

For example, I have a file with normal text, like:
"Word1 Kuͦn, buͤtten; word4:"
I want to get a file with 1 word per line, keeping the punctiuation, and ordered:
,
:
;
Word1
Kuͦn
buͤtten
word4
The code I use:
grep -Eo '\w+|[^\w ]' input.txt | sort -f >> output.txt
This the code works almost perfectly, except for one thing: it splits diacretical characters apart from the letters they belong to, as if they were separate words:
,
:
;
Word1
Ku
ͦ
n
bu
ͤ
tten
word4
The letters uͦ, uͤ and other with the same diacretics are not in the ASCII table. How can I split my file correctly without deleting or replacing these characters?
Edit:
locale output:
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
Unfortunately, U+366 (COMBINING LATIN SMALL LETTER O) is not an alphabetic character. It is a non-spacing mark, unicode category Mn, which generally maps to the Posix ctype cntrl.
Roughly speaking, an alphabetic grapheme is an alphabetic character possibly followed by one or more combining characters. It's possible to write that as a regex pattern if you have a regex library which implements Unicode general categories. Gnu grep is usually compiled with an interface to the popular pcre (Perl-compatible regular expression) library, which has reasonably good Unicode support. So if you have Gnu grep, you're in luck.
To enable "perl-like" regular expressions, you need to invoke grep with the -P option (or as pgrep). However, that is not quite enough because by default grep will use an 8-bit encoding even if the locale specifies a UTF-8 encoding. So you need to put the regex system into "UTF-8" mode in order to get it to recognize your character encoding.
Putting all that together, you might end up with something like the following:
grep -Po '(*UTF8)(\p{L}\p{M}*|\p{N})+|[\p{P}\p{S}]'
-P patterns are "perl-compatible"
-o output each substring matched
(*UTF8) If the pattern starts with exactly this sequence,
pcre is put into UTF-8 mode.
\p{...} Select a character in a specified Unicode general category
\P{...} Select a character not in a specified Unicode general category
\p{L} General category L: letters
\p{N} General category N: numbers
\p{M} General category M: combining marks
\p{P} General category P: punctuation
\p{S} General category S: symbols
\p{L}\p{M}* A letter possibly followed by various combining marks
\p{L}\p{M}*|\p{N} ... or a number
More information on Unicode general categories and Unicode regular expression matching in general can be found in Unicode Technical Report 18 on regular expression matching. But beware that the syntax described in that TR is a recommendation and is not exactly implemented by most regex libraries. In particular, pcre does not support the useful notation \p{L|N} (letter or number). Instead, you need to use [\p{L}\p{N}].
Documentation about pcre is probably available on your system (man pcre); if not, have a link on me.
If you don't have Gnu grep or in the unlikely case that your version was compiled without pcre support, you might be able to use perl, python or other languages with regex capabilites. However, doing so is surprisingly difficult. After some experimentation, I found the following Perl incantation which seems to work:
perl -CIO -lne 'print $& while /(\p{L}\p{M}*|\p{N})+|[\p{P}\p{S}]/g'
Here, -CIO tells Perl that input and output in UTF-8, and -nle is a standard incantation which means "automatically output new**l**ines after a print; loop through every li**n**e of the input, **e**xecuting the following in the loop".

Regular expression "empty range in char class error"

I got a regex in my code, which is to match pattern of url and threw error:
/^(http|https):\/\/([\w-]+\.)+[\w-]+([\w- .\/?%&=]*)?$/
The error was "empty range in char class error". I found the cause of that is in ([\w- .\/?%&=]*)? part. Ruby seems to recognize - in \w- . as an operator for range instead of a literal -. After adding escape to the dash, the problem was solved.
But the original regular expression ran well on my co-workers' machines. We use the same version of osx, rails and ruby: Ruby version is ruby 1.9.3p194, rails is 3.1.6 and osx is 10.7.5. And after we deployed code to our Heroku server, everything worked fine too. Why did only my environment have error regarding this regex? What is the mechanism of Ruby regex interpreting?
I can replicate this error on Ruby 1.9.3p194 (2012-04-20 revision 35410) [i686-linux], installed on Ubuntu 12.04.1 LTS using rvm 1.13.4. However, this should not be a version-specific error. In fact, I'm surprised it worked on the other machines at all.
A a simpler demonstration that fails just as well:
"abcd" =~ /[\w- ]/
This is because [\w- ] is interpreted as "a range beginning with any word character up to space (or blank)", rather than a character class containing a word, a hyphen, or a space, which is what you had intended.
Per Ruby's regular expression documentation:
Within a character class the hyphen (-) is a metacharacter denoting an inclusive range of characters. [abcd] is equivalent to [a-d]. A range can be followed by another range, so [abcdwxyz] is equivalent to [a-dw-z]. The order in which ranges or individual characters appear inside a character class is irrelevant.
As you saw, prepending a backslash escaped the hyphen, thus changing the nature of the regexp from a range to a character class, removing the error. However, escaping the hyphen in the middle of character class is not recommended, since it's easy to confuse the intended meaning of the hyphen in such cases. As m.buettner pointed out, always place hyphens either at the beginning or the end of a character class:
"abcd" =~ /[-\w ]/

Scanning for Unicode Numbers in a string with \d

According to the Oniguruma documentation, the \d character type matches:
decimal digit char
Unicode: General_Category -- Decimal_Number
However, scanning for \d in a string with all the Decimal_Number characters results in only latin 0-9 digits being matched:
#encoding: utf-8
require 'open-uri'
html = open("http://www.fileformat.info/info/unicode/category/Nd/list.htm").read
digits = html.scan(/U\+([\da-f]{4})/i).flatten.map{ |s| s.to_i(16) }.pack('U*')
puts digits.encoding, digits
#=> UTF-8
#=> 0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…
p RUBY_DESCRIPTION, digits.scan(/\d/)
#=> "ruby 1.9.2p180 (2011-02-18) [i386-mingw32]"
#=> ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
Am I misreading the documentation? Why doesn't \d match other Unicode numerals, and/or is there a way to make it do so?
Noted by Brian Candler on ruby-talk:
\w only matches ASCII letters and digits, while [[:alpha:]] matches the full set of Unicode letters.
\d only matches ASCII digits, while [[:digit:]] matches the full set of Unicode numbers.
The behavior is thus 'consistent', and we have a simple workaround for Unicode numbers. Reading up on \w in the same Oniguruma doc we see the text:
\w word character
Not Unicode: alphanumeric, "_" and multibyte char.
Unicode: General_Category -- (Letter|Mark|Number|Connector_Punctuation)
In light of the real behavior of Ruby and the "Not Unicode" text above, it would appear that the documentation is describing two modes—a Unicode mode and a Not Unicode mode—and that Ruby is operating in the Not Unicode mode.
This would explain why \d does not match the full Unicode set: although the Oniguruma documentation fails to describe exactly what is matched when in Not Unicode mode, we now know that the behavior documented as "Unicode" is not to be expected.
p "abç".scan(/\w/), "abç".scan(/[[:alpha:]]/)
#=> ["a", "b"]
#=> ["a", "b", "\u00E7"]
It is left as an exercise to the reader to discover how (if at all) to enable Unicode mode in Ruby regexps, as the /u flag (e.g. /\w/u) does not do it. (Perhaps Ruby must be recompiled with a special flag for Oniguruma.)
Update: It would appear that the Oniguruma document I have linked to is not accurate for Ruby 1.9. See this ticket discussion, including these posts:
[Yui NARUSE] "RE.txt is for original Oniguruma, not for Ruby 1.9's regexp. We may need our own document."
[Matz] "Our Oniguruma is forked one. The original Oniguruma found in geocities.jp has not been changed."
Better Reference: Here is official documentation on Ruby 1.9's regexp syntax:
https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc
Try the Unicode character class \p{N} instead. That matches all Unicode digits. No idea why \d isn't working.
\d will only match for ASCII numbers by default. You can manually turn on Unicode matching in a regex using the (counter-intuitive) (?u) syntax:
"𝟛".match(/(?u)\d/) # => #<MatchData "𝟛">
Alternatively, you can use "posix" or "unicode property" style in your regex, which don't require you to manually turn on Unicode matching:
/[[:digit:]]/ # posix style
/\p{Nd}/ # unicode property/category style
You can find more detailed information about how to do advanced matching for Unicode characters in Ruby in this blog post:
https://idiosyncratic-ruby.com/30-regex-with-class.html

Resources