Dealing with Cyrillic alphabet used in place of Latin characters - utf-8

We've recently had a user enter english text, but it appears to have been done on a computer set up for Cyrillic as some of the letters such as "a" are actually CYRILLIC SMALL LETTER A, as opposed to LATIN SMALL LETTER A.
I thought that normalising would convert the Cyrillic to the Latin equivalent, but it does not (I guess that they are only equivalent in how they are displayed rather than their meaning).
Is this a common problem - user's who have computers setup for Cyrillic may be writing english, but with the Cyrillic alphabet instead?
What would be a safe way to spot this in general, and convert it appropriately?

To detect Cyrillic just use Regex match [\p{IsCyrillic}]. A more generic approach would be to search for any characters which are non-Latin ones.
Ones you've got a match, you'll need to replace the characters with their Latin equivalents.

Related

What are Unicode codepoint types for?

I recently read the UTF-8 Everywhere manifesto, a document arguing for handling text with UTF-8 by default. The manifesto argues that Unicode codepoints aren't a generally useful concept and shouldn't be directly interacted with outside of programs/libraries specializing in text processing.
However, some modern languages that use the UTF-8 default have built-in codepoint types, such as rune in Go and char in Rust.
What are these types actually useful for? Are they legacy from times before the meaninglessness of codepoints was broadly understood? Or is that an incomplete perspective?
Texts have many different meaning and usages, so the question is difficult to answer.
First: about codepoint. We uses the term codepoint because it is easy, it implies a number (code), and not really confuseable with other terms. Unicode tell us that it doesn't use the term codepoint and character in a consistent way, but also that it is not a problem: context is clear, and they are often interchangeable (but for few codepoints which are not characters, like surrogates, and few reserved codepoints). Note: Unicode is mostly about characters, and ISO 10646 was most about codepoints. So original ISO was about a table with numbers (codepoint) and names, and Unicode about properties of characters. So we may use codepoints where Unicode character should be better, but character is easy confuseable with C char, and with font glyphs/graphemes.
Codepoints are one basic unit, so useful for most of programs, e.g. to store in databases, to exchange to other programs, to save files, for sorting, etc. For this exact reasons program languages uses the codepoint as type. UTF-8 code units may be an alternative, but it would be more difficult to navigate (see a UTF-8 as a tape disk where you should read sequentially, and codepoint text as an hard disk where you can just in middle of a text). Not a 100% appropriate, because you may need some context bytes. If you are getting user text, your program probably do not need to split in graphemes, to do liguatures, etc. if it will just store the data in a database. Codepoint is really low level and so fast for most operations.
The other part of text: displaying (or speech). This part is very complex, because we have many different scripts with very different rules, and then different languages with own special cases. So we needs a series of libraries, e.g. text layout (so word separation, etc. like pango), sharper engine (to find which glyph to use, combining characters, where to put next characters, e.g. HarfBuzz), and a font library which display the font (cairo plus freetype). it is complex, but most programmers do not need special handling: just reading text from database and sent to screen, so we just uses the relevant library (and it depends on operating system), and just going on. It is too complex for a language specification (and also a moving target, maybe in 30 years things are more standardized). So it is complex, and with many operation, so we may use complex structures (array of array of codepoint: so array of graphemes): not much a slow down. Note: fonts have codepoint tables to perform various operation before to find the glyph index. Various API uses Unicode strings (as codepoint array, UTF-16, UTF-8, etc.).
Naturally things are more complex, and it requires a lot of knowledge of different part of Unicode, if you are trying to program an editor (WYSIWYG, but also with terminals): you mix both worlds, and you need much more information (e.g. for selection of text). But in this case you must create your own structures.
And really: things are complex: do you want to just show first x characters on your blog? (maybe about assessment), or split at words (some language are not so linear, so the interpretation may be very wrong). For now just humans can do a good job for all languages, so also not yet need to a supporting type in different languages.
The manifesto argues that Unicode codepoints aren't a generally useful concept and shouldn't be directly interacted with outside of programs/libraries specializing in text processing.
Where? It merely outlines advantages and disadvantages of code points. Two examples are:
Some abstract characters can be encoded by different code points; U+03A9 greek capital letter omega and U+2126 ohm sign both correspond to the same abstract character Ω, and must be treated identically.
Moreover, for some abstract characters, there exist representations using multiple code points, in addition to the single coded character form. The abstract character ǵ can be coded by the single code point U+01F5 latin small letter g with acute, or by the sequence <U+0067 latin small letter g, U+0301 combining acute accent>.
In other words: code points just index which graphemes Unicode supports.
Sometimes they're meant as single characters: one prominent example would be € (EURO SIGN), having only the code point U+20AC.
Sometimes the same character has multiple code-points as per context: the dollar sign exists as:
﹩ = U+FE69 (SMALL DOLLAR SIGN)
$ = U+FF04 (FULLWIDTH DOLLAR SIGN)
💲 = U+1F4B2 (HEAVY DOLLAR SIGN)
Storage wise when searching for one variant you might want to match all 3 variants instead on relying on the exact code point only.
Sometimes multiple code points can be combined to form up a single character:
á = U+00E1 (LATIN SMALL LETTER A WITH ACUTE), also termed "precomposed"
á = combination of U+0061 (LATIN SMALL LETTER A) and U+0301 (COMBINING ACUTE ACCENT) - in a text editor trying to delete á (from the right side) will mostly result in actually deleting the acute accent first. Searching for either variant should find both variants.
Storage wise you avoid to need searching for both variants by performing Unicode normalization, i.e. NFC to always favor precombined code points over two combined code points to form one character.
As for homoglyphs code points clearly distinguish the contextual meaning:
A = U+0041 (LATIN CAPITAL LETTER A)
Α = U+0391 (GREEK CAPITAL LETTER ALPHA)
А = U+0410 (CYRILLIC CAPITAL LETTER A)
Copy the greek or cyrillic character, then search this website for that letter - it will never find the other letters, no matter how similar they look. Likewise the latin letter A won't find the greek or cyrillic one.
Writing system wise code points can be used by multiple alphabets: the CJK portion is an attempt to use as few code points as possible while supporting as many languages as possible - Chinese (simplified, traditional, Hong Kong), Japanese, Korean, Vietnamese:
今 = U+4ECA
入 = U+5165
才 = U+624D
Dealing as a programmer with code points has valid reasons. Programming languages which support these may (or may not) support correct encodings (UTF-8 vs. UTF-16 vs. ISO-8859-1) and may (or may not) correctly produce surrogates for UTF-16. Text wise users should not be concerned about code points, although it would help them distinguishing homographs.

Implementing sample code for unicode collation algorithm

I have the following requirement in my project.
I need to sort strings based on order of the characters provided by the client.
For example:
Order provided by the user:d,a,A,D,z,p,P,Z
So if we have some strings like AaP,aAp,PpZ,pPz.
After sorting the output should be aAp,AaP,pPz,PpZ as a>A>p>P according to initial order given by the user.
Now I am thinking of picking Unicode Collation algorithm(http://unicode.org/reports/tr10/) for implementing the above requirement.
Can some one suggest me the data structures to use for the following few things for better performance.
1.)Mapping the ascii values of the characters to given order order of user--I am thinking of using map.But it can be O(logn) for access.I could not use hashmap as I code in c++.
2.)What sorting techniques can be used for comparing the sort key after generating the sort keys.Can some thing like radix sort be used here?
Please share your thoughts..
Though the following requirement is not needed for my project,I just want to know
how are collation elements actually created from the Unicode values or ascii values like this as mentioned in the above link for Unicode collation algorithm?
Character Collation Element Name
0300 "`" [.0000.0021.0002] COMBINING GRAVE ACCENT
0061 "a" [.06D9.0020.0002] LATIN SMALL LETTER A
0062 "b" [.06EE.0020.0002] LATIN SMALL LETTER B
0063 "c" [.0706.0020.0002] LATIN SMALL LETTER C
0043 "C" [.0706.0020.0008] LATIN CAPITAL LETTER C
0064 "d" [.0712.0020.0002] LATIN SMALL LETTER D

Glyph to unicode string translation

Given a glyph index for a specific font, I need to get the unicode translation of the glyph. in order to build a glyph-to-unicode translation I'm using GetGlyphIndices for the whole unicode range and from the result I build the reverse translation (glyph to unicode character map). However, this gives me a translation between a single glyph to a single unicode character, and I can see that in Hindi for example, two unicode characters can be represented by one glyph.
For example, in the word namaste (नमस्ते) there are 6 unicode characters which are represented by 5 glyphs (the middle two unicode characters are represented by one glyph). I can see this by attaching to notepad.exe, inserting a breakpoint in ExtTextOut and printing this word from notepad.
Is there any way I can translate a glyph to a unicode string (in case the glyph represents more than one unicode character)?
1) For all but very simple cases, you should use Uniscribe functions (not GetGlyphIndices) for converting a string (sequence of Unicodes) into glyphs. This is noted in the documentation for GetGlyphIndices: http://msdn.microsoft.com/en-us/library/windows/desktop/dd144890(v=vs.85).aspx
2) There is no way to reliably do what you want to do for all cases. Even for most cases. This is the result of something known as complex script shaping, which translates a sequence of input Unicodes into a sequence of output glyphs. This is done using a number of tables in the font data. The two of most interest are the cmap and the GSUB.
The cmap maps Unicode values to font-specific glyphs. The cmap may specify multiple Unicodes mapping to a single glyph (multi-mapping). This is a commonly-used scheme in many fonts. Also, many glyphs in the font may not even be mapped in the cmap. Thus with this alone, you cannot reliably reverse-map a glyph to a single Unicode.
But it gets even more difficult: the GSUB may specify numerous rules and may convert one input glyph to many output glyphs, or a series of input glyphs into one output glyph. It can even specify contexts under which the conversion will occur (for example, it could say something like "convert 'A' to 'B' but only when the 'A' is preceded by a 'C'", so CA -> CB but DA -> DA). In some cases, specifically with Hindi and other Indic languages, the output glyph sequence may even be in a different order than the logical Unicode input sequence. The net result is that the output sequence of glyphs may map back to a single Unicode, or multiple Unicodes, or none at all. It may be possible to decode the rules of the GSUB + the logic of the script-shaping engine to narrow things down a bit (an adventure not suitable for the weak of spirit!), but the problem is still that multiple input Unicodes could end up resolving to the same output glyph.
Bottom line: it's best to view the process of converting a string -> font-specific glyphs as a one-way trip.
For a better understanding of these concepts, I strongly recommend that you read up on complex script shaping as implemented in Windows: http://www.microsoft.com/typography/otspec/TTOCHAP1.htm . As for coding in an application, the Uniscribe reference is also very informative: http://msdn.microsoft.com/en-us/library/windows/desktop/dd374091(v=vs.85).aspx

Is there a language(s) which will require three or more bytes per character when encoded using UTF-8? Which ones?

Commonly used ofc, Klingon doesnt count :-)
thanks, guys, let me run willItFit() testcases
OK, now i figured out what saving bytes with UTF-8 is causing more problems than solving, thanks again
Characters requiring 3 bytes start at U+0800 and all subsequent characters, so that's a HUGE number of potential characters. This includes East Asian scripts such as Japanese, Chinese, Korean, and Thai.
For a complete list of script ranges, you can refer to Unicode's block data. Only these blocks can be represented with 1 or 2 bytes, characters from all other blocks require 3 or 4 bytes:
0000..007F Basic Latin
0080..00FF Latin-1 Supplement
0100..017F Latin Extended-A
0180..024F Latin Extended-B
0250..02AF IPA Extensions
02B0..02FF Spacing Modifier Letters
0300..036F Combining Diacritical Marks
0370..03FF Greek and Coptic
0400..04FF Cyrillic
0500..052F Cyrillic Supplement
0530..058F Armenian
0590..05FF Hebrew
0600..06FF Arabic
0700..074F Syriac
0750..077F Arabic Supplement
0780..07BF Thaana
07C0..07FF NKo
Here we go:
So the first 128 characters (US-ASCII)
need one byte. The next 1,920
characters need two bytes to encode.
This includes Latin letters with
diacritics and characters from Greek,
Cyrillic, Coptic, Armenian, Hebrew,
Arabic, Syriac and Tāna alphabets.
Three bytes are needed for the rest of
the Basic Multilingual Plane (which
contains virtually all characters in
common use). Four bytes are needed for
characters in the other planes of
Unicode, which include less common CJK
characters and various historic
scripts.
More details:
http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes , Basic Multilingual Plane, Codes from 0x8000.
Some examples: Indic scripts, Thai, Philippine scripts, Hiragana, Katakana. So all East Asia scripts and some other.
You even need three bytes just for English. For example, the typographically correct apostrophe is encoded in UTF-8 as 0xE2 0x80 0x99, opening quote marks are 0xE2 0x80 0x9C and closing quote marks are 0xE2 0x80 0x9D. The ellipsis is 0xE2 0x80 0xA6. And that's not even talking about all the different dashes, spaces or the inch and feet signs.
“It’s kinda hard to write English without the apostrophe’s help …”
There are representations of many Asian languages that use more than 2 bytes. While it's true that they probably don't specifically need to, Japanese and Korean (at least) are often represented in multi-byte form.

String typeface comparison algorithm

I believe, there is an algorithm, which can equal two strings with similar typefaces of a characters, but different symbols (digits, Cyrillic, Latin or other alphabets). For example:
"hello" (Latin symbols) equals to "he11o" (digits and Latin symbols)
"HELLO" (Latin symbols) equals to "НЕLLО" (Cyrillic and Latin symbols)
"really" (Latin symbols) equals to "геа11у" (digits and Cyrillic symbols)
You may be thinking of the algorithm that Paul E. Black developed for ICANN that determines whether two TLDs are "confusingly similar", though it currently does not work with mixed-script input (e.g. Latin and Cyrillic). See "Algorithm Helps ICANN Manage Top-level Domains" and the ICANN Similarity Assessment Tool.
Also, if you are interested in extending this algorithm, then you might want to incorporate information from the Unicode code charts, which commonly list similar glyphs and sequences of code points that render similarly.
I am not exactly sure what you are asking for.
If you want to know whether two characters look the same under a given typeface then you need to render each character in the chosen fonts into bitmaps and compare them to see if they are close to being identical.
If you just want to always consider lower-case latin 'l' to be the same as the digit '1' regardless of the font used, then you can simply define a character mapping table. Probably the easiest way to do this would be to pick a canonical value for each set of characters that looks the same and map all members of the set to that character. When you compare the strings, compare the canonical instance of each character from the table.

Resources