Capitalized words with UTF-8 in Hunspell - utf-8

I'm trying to use Hunspell for spellchecking in Polish.
I did convert encoding of a dictionary from http://wiki.openoffice.org/wiki/Dictionaries to UTF-8. However, capitalized words (i.e. with first letter in uppercase) with a non-latin character are considered misspelled (but the spellchecker suggests the same word). Other words are fine.
Such issue was raised on their bug-tracking a few years ago False negatives with leading capital letter and UTF-8 - ID: 1432866. However, either this solution does not work, or I miss something.
Is there a solution for this problem?
In my pl_PL.aff file the leading lines are:
SET UTF-8
TRY aioeznrwcysptkmdłuljągbhęśćóżfńźvqxAIOEZNRWCYSPTKMDŁULJĄGBHĘŚĆÓŻFŃŹVQX
(Just in case, I'm using it in Sublime Text 2, but I doubt whether it makes a difference.)

I've found Sublime specific dictionaries on github: https://github.com/SublimeText/Dictionaries. They somehow have fixed this error.

Related

How to strip vowels/nekudot/diacritics from hebrew utf-8 in ruby?

Hebrew prints vowels as nekudot/points around the letters.
with vowels: " יִתְגַּבֵּר כַּאֲרִי לַעֲמֹד בַּבֹּקֶר לַעֲבוֹדַת בּוֹרְאוֹ שֶׁיְּהֵא הוּא מְעוֹרֵר הַשַּׁחַר"
without vowels: "יתגב כארי לעמוד בבקר לעבודת בוראו שיהא הוא מעורר השחר"
I need a way to strip these vowels from a string. As in convert the string, with vowels, to the string, without vowels. Any suggestions?
-p.s.
I have tried "hebrew.gsub(/[^א-ת]/, '')"
but this has two problems: a: this will remove all punctuation, english, etc. I only want to remove the vowels. b: some letters get removed as well. (my understanding is limited, but it seems some letters/vowel combinations become "multibyte" in utf-8 and will not match "א-ת".
I found this:https://gist.github.com/yakovsh/345a71d841871cc3d375 online, but the ruby suggestion only works with rails, (assuming it works at all).
However, perhaps that page can be helpful in finding a solution.
Please help, thanks in advance.
The vowels are all between U+0591 and U+05C7, so you can just do
hebrew.gsub(/[\u0591-\u05c7]/,"")
e.g.
" יִתְגַּבֵּר כַּאֲרִי לַעֲמֹד בַּבֹּקֶר לַעֲבוֹדַת בּוֹרְאוֹ שֶׁיְּהֵא הוּא מְעוֹרֵר הַשַּׁחַר".gsub(/[\u0591-\u05c7]/,"")
# => " יתגבר כארי לעמד בבקר לעבודת בוראו שיהא הוא מעורר השחר"
However, that only works if the vowels are all separate characters in the string - or to say the same thing in Unicode-speak, if the text is in Normalization Form D. You can ensure that is the case by calling String#unicode_normalize on it first:
hebrew.unicode_normalize(:nfd).gsub(/[\u0591-\u05c7]/,"")
This step is necessary because Unicode includes several individual characters that combine a letter with nekuddot in a single code point, for round-trip compatibility with older character sets that did not support combining diacritics. Those characters mean you can't tell just by looking whether the string "בּ" consists of the two-code-point sequence U+05D1 HEBREW LETTER BET followed by U+05BC HEBREW POINT DAGESH OR MAPIQ, or just the single character U+FB31 HEBREW LETTER BET WITH DAGESH. Putting the string into Normalization Form D replaces the latter with the former, and also splits up any other "precomposed" characters into their component parts.

Why 'аnd' == 'and' is false?

I tagged character-encoding and text because I know if you type 'and' == 'and' into the rails console, or most any other programming language, you will get true. However, I am having the issue when one of my users pastes his text into my website, I can't spell check it properly or verify it's originality via copyscape because of some issue with the text. (or maybe my understanding of text encoding?)
EXAMPLE:
If you copy and paste the following line into the rails console you will get false.
'аnd' == 'and' #=> false
If you copy and paste the following line into the rails console you will get true even though they appear exactly the same in the browser.
'and' == 'and' #=> true
The difference is, in the first example, the first 'аnd' is copied and pasted from my user's text that is causing the issues. All the other instances of 'and' are typed into the browser.
Is this an encoding issue?
How to fix my issue?
This isn’t really an encoding problem, in the first case the strings compare as false simply because they are different.
The first character of the first string isn’t a ”normal“ a, it is actually U+0430 CYRILLIC SMALL LETTER A — the first two bytes (208 and 176, or 0xD0 and 0xB0 in hex) are the UTF-8 encoding for this character. It just happens to look exactly like a “normal” Latin a, which is U+0061 LATIN SMALL LETTER A.
Here’s the “normal” a: a, and this is the Cyrillic a: а, they appear pretty much identical.
The fix for this really depends on what you want your application to do. Ideally you would want to handle all languages, and so you might want to just leave it and rely on users to provide reasonable input.
You could replace the character in question with a latin a using e.g. gsub. The problem with that is there are many other characters that have similar appearance to the more familiar ones. If you choose this route you would be better looking for a library/gem that did it for you, and you might find you’re too strict about conversions.
Another option could be to choose a set of Unicode scripts that your application supports and refuse any characters outside those scripts. You can check fairly easily for this with Ruby‘s regular expression script support, e.g. /\p{Cyrillic}/ will match all Cyrillic characters.
The problem is not with encodings. A single file or a single terminal can only have a single encoding. If you copy and paste both strings into the same source file or the same terminal window, they will get inserted with the same encoding.
The problem is also not with normalization or folding.
The first string has 4 octets: 0xD0 0xB0 0x6E 0x64. The first two octets are a two-octet UTF-8 encoding of a single Unicode codepoint, the third and fourth octets are one-octet UTF-8 encodings of Unicode code points.
So, the string consists of three Unicode codepoints: U+0430 U+006E U+0064.
These three codepoints resolve to the following three characters:
CYRILLIC SMALL LETTER A
LATIN SMALL LETTER N
LATIN SMALL LETTER D
The second string has 3 octets: 0x61 0x6E 0x64. All three octets are one-octet UTF-8 encodings of Unicode code points.
So, the string consists of three Unicode codepoints: U+0061 U+006E U+0064.
These three codepoints resolve to the following three characters:
LATIN SMALL LETTER A
LATIN SMALL LETTER N
LATIN SMALL LETTER D
Really, there is no problem at all! The two strings are different. With the font you are using, a cyrillic a looks the same as a latin a, but as far as Unicode is concerned, they are two different characters. (And in a different font, they might even look different!) There's really nothing you can do from an encoding or Unicode perspective, because the problem is not with encodings or Unicode.
This is called a homoglyph, two characters that are different but have the same (or very similar) glyphs.
What you could try to do is transliterate all strings into Latin (provided that you can guarantee that nobody ever wants to enter non-Latin characters), but really, the questions are:
Where does that cyrillic a come from?
Maybe it was meant to be a cyrillic a and really should be treated not-equal to a latin a?
And depending on the answers to those questions, you might either want to fix the source, or just do nothing at all.
This is a very hot topic for browser vendors, BTW, because nowadays someone could register the domain google.com (with one of the letters switched out for a homoglpyh) and you wouldn't be able to spot the difference in the address bar. This is called a homograph attack. That's why they always display the Punycode domain in addition to the Unicode domain name.
I think it is eccoding issue, you can have a try like this.
irb(main):010:0> 'and'.each_byte {|b| puts b}
97
110
100
=> "and"
irb(main):011:0> 'аnd'.each_byte {|b| puts b} #copied and
208
176
110
100
=> "аnd"

Any ruby gems to do (Chinese) Transliterate (Romanization), especially for URL?

Generally spoken, it takes Unicode text and tries to represent it in
US-ASCII characters (universally displayable, unaccented characters)
by attempting to transliterate the pronunciation expressed by the text
in some other writing system to Roman letters.
ex,
"一二三".ooxx => "e-er-san"
After doing http://rubygems.org/search?utf8=%E2%9C%93&query=pinyin I got some rubygems, but none of them are robustly workable for this issue.
Doing this perfectly is almost impossible, since some Chinese characters have two or more pronunciations, for example 银行 = yin hang, 不行 = bu xing (the last character is identical, pronounced hang in one context and xing in the other)... Other than that, you could probably roll your own using the unicode database, which I think has pronunciation info as well. If you want to be more fancy, I think there are some open source input methods which have the mappings, and they'll have them for words too, so that if you find 银行 together, it will know that the second character is hang, not xing. OpenVanilla might have databases you can work with (OSS).

Remove all but some special characters

I am trying to come up with a regex to remove all special characters except some. For example, I have a string:
str = "subscripción gustaría♥"
I want the output to be "subscripción gustaría".
The way I tried to do is, match anything which is not an ascii character (00 - 7F) and not special character I want and replace it with blank.
str.gsub(/(=?[^\x00-\x7F])(=?^\xC3\xB3)(=?^\xC3\xA1)/,'')
This doesn't work. The last special character is not removed.
Can someone help? (This is ruby 1.8)
Update: I am trying to make the question a little more clear. The string is utf-8 encoded. And I am trying to whitelist the ascii characters plus ó and í and blacklist everything else.
Oniguruma has support for all the characters you care about without having to deal with codepoints. You can just add the unicode characters inside the character class you're whitelisting, followed by the 'u' option.
ruby-1.8.7-p248 > str = "subscripción gustaría♥"
=> "subscripci\303\263n gustar\303\255a\342\231\245"
ruby-1.8.7-p248 > puts str.gsub(/[^a-zA-Z\sáéíóúÁÉÍÓÚ]/u,'')
subscripción gustaría
=> nil
str.split('').find_all {|c| (0x00..0x7f).include? c.ord }.join('')
The question is a bit vague. There is not a word about encoding of the string. Also, you want to white-list characters or black list? Which ones?
But you get the idea, decide what you want, and then use proper ranges as colleagues here already proposed. Some examples:
if str = "subscripción gustaría♥" is utf-8
then you can blacklist all char above the range (excl. whitespaces):
str.gsub(/[^\x{0021}-\x{017E}\s]/,'')
if string is in ISO-8859-1 codepage you can try to match all quirky characters like the "heart" from the beginning of ASCII range:
str.gsub(/[\x01-\x1F]/,'')
The problem is here with regex, has nothing to do with Ruby. You probably will need to experiment more.
It is not completely clear which characters you want to keep and which you want to delete. The example string's character is some Unicode character that, in my browser, displays as a heart symbol. But it seems you are dealing with 8-bit ASCII characters (since you are using ruby 1.8 and your regular expressions point that way).
Nonetheless, you should be able to do it in one of two ways; either specify the characters you want to keep or, alternatively, specify the characters you want to delete. For example, the following specifies that all characters 0x00-0x7F and 0xC0-0xF6 should be kept (remove everything that is not in that group):
puts str.gsub(/[^\x00-\x7F\xC0-\xF6]/,'')
This next example specifies that characters 0xA1 and 0xC3 should be deleted.
puts str.gsub(/[\xA1\xC3]/,'')
I ended up doing this: str.gsub(/[^\x00-\x7FÁáÉéÍíÑñÓóÚúÜü]/,''). It doesn't work on my mac but works on linux.

Removing Junk Characters from Ruby String

I am trying to clean up a string that I get from a website using mechanize
here is an except from the string with the junk characters
"Mountain</b></a><br>ΓÇÄ1hr 39minΓÇÄΓÇÄ - Rated PGΓÇÄΓÇÄ - Action/Adventure/Science fictionΓÇÄΓÇÄ - EnglishΓÇÄ - <a href="
Does any one know where they characters come from and how I can replace them with spaces? How does ruby handle the encoding of characters?
Those characters look like they might be appearing as the result of a UTF-8 encoding problem. I recommend reading Joel's excellent article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) which will explain UTF-8 encoding and how to handle it in your code.

Resources