String typeface comparison algorithm - algorithm

I believe, there is an algorithm, which can equal two strings with similar typefaces of a characters, but different symbols (digits, Cyrillic, Latin or other alphabets). For example:
"hello" (Latin symbols) equals to "he11o" (digits and Latin symbols)
"HELLO" (Latin symbols) equals to "НЕLLО" (Cyrillic and Latin symbols)
"really" (Latin symbols) equals to "геа11у" (digits and Cyrillic symbols)

You may be thinking of the algorithm that Paul E. Black developed for ICANN that determines whether two TLDs are "confusingly similar", though it currently does not work with mixed-script input (e.g. Latin and Cyrillic). See "Algorithm Helps ICANN Manage Top-level Domains" and the ICANN Similarity Assessment Tool.
Also, if you are interested in extending this algorithm, then you might want to incorporate information from the Unicode code charts, which commonly list similar glyphs and sequences of code points that render similarly.

I am not exactly sure what you are asking for.
If you want to know whether two characters look the same under a given typeface then you need to render each character in the chosen fonts into bitmaps and compare them to see if they are close to being identical.
If you just want to always consider lower-case latin 'l' to be the same as the digit '1' regardless of the font used, then you can simply define a character mapping table. Probably the easiest way to do this would be to pick a canonical value for each set of characters that looks the same and map all members of the set to that character. When you compare the strings, compare the canonical instance of each character from the table.

Related

What are Unicode codepoint types for?

I recently read the UTF-8 Everywhere manifesto, a document arguing for handling text with UTF-8 by default. The manifesto argues that Unicode codepoints aren't a generally useful concept and shouldn't be directly interacted with outside of programs/libraries specializing in text processing.
However, some modern languages that use the UTF-8 default have built-in codepoint types, such as rune in Go and char in Rust.
What are these types actually useful for? Are they legacy from times before the meaninglessness of codepoints was broadly understood? Or is that an incomplete perspective?
Texts have many different meaning and usages, so the question is difficult to answer.
First: about codepoint. We uses the term codepoint because it is easy, it implies a number (code), and not really confuseable with other terms. Unicode tell us that it doesn't use the term codepoint and character in a consistent way, but also that it is not a problem: context is clear, and they are often interchangeable (but for few codepoints which are not characters, like surrogates, and few reserved codepoints). Note: Unicode is mostly about characters, and ISO 10646 was most about codepoints. So original ISO was about a table with numbers (codepoint) and names, and Unicode about properties of characters. So we may use codepoints where Unicode character should be better, but character is easy confuseable with C char, and with font glyphs/graphemes.
Codepoints are one basic unit, so useful for most of programs, e.g. to store in databases, to exchange to other programs, to save files, for sorting, etc. For this exact reasons program languages uses the codepoint as type. UTF-8 code units may be an alternative, but it would be more difficult to navigate (see a UTF-8 as a tape disk where you should read sequentially, and codepoint text as an hard disk where you can just in middle of a text). Not a 100% appropriate, because you may need some context bytes. If you are getting user text, your program probably do not need to split in graphemes, to do liguatures, etc. if it will just store the data in a database. Codepoint is really low level and so fast for most operations.
The other part of text: displaying (or speech). This part is very complex, because we have many different scripts with very different rules, and then different languages with own special cases. So we needs a series of libraries, e.g. text layout (so word separation, etc. like pango), sharper engine (to find which glyph to use, combining characters, where to put next characters, e.g. HarfBuzz), and a font library which display the font (cairo plus freetype). it is complex, but most programmers do not need special handling: just reading text from database and sent to screen, so we just uses the relevant library (and it depends on operating system), and just going on. It is too complex for a language specification (and also a moving target, maybe in 30 years things are more standardized). So it is complex, and with many operation, so we may use complex structures (array of array of codepoint: so array of graphemes): not much a slow down. Note: fonts have codepoint tables to perform various operation before to find the glyph index. Various API uses Unicode strings (as codepoint array, UTF-16, UTF-8, etc.).
Naturally things are more complex, and it requires a lot of knowledge of different part of Unicode, if you are trying to program an editor (WYSIWYG, but also with terminals): you mix both worlds, and you need much more information (e.g. for selection of text). But in this case you must create your own structures.
And really: things are complex: do you want to just show first x characters on your blog? (maybe about assessment), or split at words (some language are not so linear, so the interpretation may be very wrong). For now just humans can do a good job for all languages, so also not yet need to a supporting type in different languages.
The manifesto argues that Unicode codepoints aren't a generally useful concept and shouldn't be directly interacted with outside of programs/libraries specializing in text processing.
Where? It merely outlines advantages and disadvantages of code points. Two examples are:
Some abstract characters can be encoded by different code points; U+03A9 greek capital letter omega and U+2126 ohm sign both correspond to the same abstract character Ω, and must be treated identically.
Moreover, for some abstract characters, there exist representations using multiple code points, in addition to the single coded character form. The abstract character ǵ can be coded by the single code point U+01F5 latin small letter g with acute, or by the sequence <U+0067 latin small letter g, U+0301 combining acute accent>.
In other words: code points just index which graphemes Unicode supports.
Sometimes they're meant as single characters: one prominent example would be € (EURO SIGN), having only the code point U+20AC.
Sometimes the same character has multiple code-points as per context: the dollar sign exists as:
﹩ = U+FE69 (SMALL DOLLAR SIGN)
$ = U+FF04 (FULLWIDTH DOLLAR SIGN)
💲 = U+1F4B2 (HEAVY DOLLAR SIGN)
Storage wise when searching for one variant you might want to match all 3 variants instead on relying on the exact code point only.
Sometimes multiple code points can be combined to form up a single character:
á = U+00E1 (LATIN SMALL LETTER A WITH ACUTE), also termed "precomposed"
á = combination of U+0061 (LATIN SMALL LETTER A) and U+0301 (COMBINING ACUTE ACCENT) - in a text editor trying to delete á (from the right side) will mostly result in actually deleting the acute accent first. Searching for either variant should find both variants.
Storage wise you avoid to need searching for both variants by performing Unicode normalization, i.e. NFC to always favor precombined code points over two combined code points to form one character.
As for homoglyphs code points clearly distinguish the contextual meaning:
A = U+0041 (LATIN CAPITAL LETTER A)
Α = U+0391 (GREEK CAPITAL LETTER ALPHA)
А = U+0410 (CYRILLIC CAPITAL LETTER A)
Copy the greek or cyrillic character, then search this website for that letter - it will never find the other letters, no matter how similar they look. Likewise the latin letter A won't find the greek or cyrillic one.
Writing system wise code points can be used by multiple alphabets: the CJK portion is an attempt to use as few code points as possible while supporting as many languages as possible - Chinese (simplified, traditional, Hong Kong), Japanese, Korean, Vietnamese:
今 = U+4ECA
入 = U+5165
才 = U+624D
Dealing as a programmer with code points has valid reasons. Programming languages which support these may (or may not) support correct encodings (UTF-8 vs. UTF-16 vs. ISO-8859-1) and may (or may not) correctly produce surrogates for UTF-16. Text wise users should not be concerned about code points, although it would help them distinguishing homographs.

Does it exist some kind of sorting convention?

Does it exist some established convention of sorting lines (characters)? Some convention which should play the similar role as PCRE for regular expressions.
For example, if you try to sort 0A1b-a2_B (each character on its own line) with Sublime Text (Ctrl-F9) and Vim (:%sort), the result will be the same (see below). However, I'm not sure it will be the same with another editors and IDEs.
-
0
1
2
A
B
_
a
b
Generally, characters are sorted based on their numeric value. While this used to only be applied to ASCII characters, this has also been adopted by unicode encodings as well. http://www.asciitable.com/
If no preference is given to the contrary, this is the de facto standard for sorting characters. Save for the actual alphabetical characters, the ordering is somewhat arbitrary.
There are two main ways of sorting character strings:
Lexicographic: numeric value of either the codepoint values or the code unit values or the serialized code unit values (bytes). For some character encodings, they would all be the same. The algorithm is very simple but this method is not human-friendly.
Culture/Locale-specific: an ordinal database for each supported culture is used. For the Unicode character set, it's called the CLDR. Also, in applying sorting for Unicode, sorting can respect grapheme clusters. A grapheme cluster is a base codepoint followed by a sequence of zero or more non-spacing (applied as extensions of the previous glyph) marks.
For some older character sets with one encoding, designed for only one or two scripts, the two methods might amount to the same thing.
Sometimes, people read a format into strings, such as a sequence of letters followed by a sequence of digits, or one of several date formats. These are very specialized sorts that need to be applied where users expect. Note: The ISO 8601 date format for the Julian calendar sorts correctly regardless of method (for all? character encodings).

Does Unicode have a code point for a garbled digit?

I'm looking for a placeholder glyph to display "insert any digit here", to tersly communicate in limited GUI space that a range of numbers is meant.
For decimal numbers I would use x, e.g.
1xx - room numbers on first floor
2xx - room numbers on second floor
but my ranges are hexadecimal, so
0x00xx - IDs reserved for future use
0x01xx - IDs reserved for development
0x02xx - IDs managed by team Bravo
looks a bit odd, as the x would have two different meanings.
There is no Unicode character that simply means "any digit here". Unicode does offer an extensive range of symbols to choose from though, which will not be confused with 'x'. An underscore has the benefit that, in many fonts, it has the same width as a numeral. If you choose something more exotic, like ◌ DOTTED CIRCLE or ⯑ UNCERTAINTY SIGN, just ensure that it will be present in the font used for your interface.

Glyph to unicode string translation

Given a glyph index for a specific font, I need to get the unicode translation of the glyph. in order to build a glyph-to-unicode translation I'm using GetGlyphIndices for the whole unicode range and from the result I build the reverse translation (glyph to unicode character map). However, this gives me a translation between a single glyph to a single unicode character, and I can see that in Hindi for example, two unicode characters can be represented by one glyph.
For example, in the word namaste (नमस्ते) there are 6 unicode characters which are represented by 5 glyphs (the middle two unicode characters are represented by one glyph). I can see this by attaching to notepad.exe, inserting a breakpoint in ExtTextOut and printing this word from notepad.
Is there any way I can translate a glyph to a unicode string (in case the glyph represents more than one unicode character)?
1) For all but very simple cases, you should use Uniscribe functions (not GetGlyphIndices) for converting a string (sequence of Unicodes) into glyphs. This is noted in the documentation for GetGlyphIndices: http://msdn.microsoft.com/en-us/library/windows/desktop/dd144890(v=vs.85).aspx
2) There is no way to reliably do what you want to do for all cases. Even for most cases. This is the result of something known as complex script shaping, which translates a sequence of input Unicodes into a sequence of output glyphs. This is done using a number of tables in the font data. The two of most interest are the cmap and the GSUB.
The cmap maps Unicode values to font-specific glyphs. The cmap may specify multiple Unicodes mapping to a single glyph (multi-mapping). This is a commonly-used scheme in many fonts. Also, many glyphs in the font may not even be mapped in the cmap. Thus with this alone, you cannot reliably reverse-map a glyph to a single Unicode.
But it gets even more difficult: the GSUB may specify numerous rules and may convert one input glyph to many output glyphs, or a series of input glyphs into one output glyph. It can even specify contexts under which the conversion will occur (for example, it could say something like "convert 'A' to 'B' but only when the 'A' is preceded by a 'C'", so CA -> CB but DA -> DA). In some cases, specifically with Hindi and other Indic languages, the output glyph sequence may even be in a different order than the logical Unicode input sequence. The net result is that the output sequence of glyphs may map back to a single Unicode, or multiple Unicodes, or none at all. It may be possible to decode the rules of the GSUB + the logic of the script-shaping engine to narrow things down a bit (an adventure not suitable for the weak of spirit!), but the problem is still that multiple input Unicodes could end up resolving to the same output glyph.
Bottom line: it's best to view the process of converting a string -> font-specific glyphs as a one-way trip.
For a better understanding of these concepts, I strongly recommend that you read up on complex script shaping as implemented in Windows: http://www.microsoft.com/typography/otspec/TTOCHAP1.htm . As for coding in an application, the Uniscribe reference is also very informative: http://msdn.microsoft.com/en-us/library/windows/desktop/dd374091(v=vs.85).aspx

Is there any logic behind ASCII codes' ordering?

I was teaching C to my younger brother studying engineering. I was explaining him how different data-types are actually stored in the memory. I explained him the logistics behind having signed/unsigned numbers and floating point bit in decimal numbers. While I was telling him about char type in C, I also took him through the ASCII code system and also how char is also stored as 1 byte number.
He asked me why 'A' has been given ASCII code 65 and not anything else? Similarly why 'a' is given the code 97 specifically? Why is there a gap of 6 ASCII codes between the range of capital letters and small letters? I had no idea of this. Can you help me understand this, since this has created a great curiosity to me as well. I've never found any book so far that has discussed this topic.
What is the reason behind this? Are ASCII codes logically organized?
There are historical reasons, mainly to make ASCII codes easy to convert:
Digits (0x30 to 0x39) have the binary prefix 110000:
0 is 110000
1 is 110001
2 is 110010
etc.
So if you wipe out the prefix (the first two '1's), you end up with the digit in binary coded decimal.
Capital letters have the binary prefix 1000000:
A is 1000001
B is 1000010
C is 1000011
etc.
Same thing, if you remove the prefix (the first '1'), you end up with alphabet-indexed characters (A is 1, Z is 26, etc).
Lowercase letters have the binary prefix 1100000:
a is 1100001
b is 1100010
c is 1100011
etc.
Same as above. So if you add 32 (100000) to a capital letter, you have the lowercase version.
This chart shows it quite well from wikipedia: Notice the two columns of control 2 of upper 2 of lower, and then gaps filled in with misc.
Also bear in mind that ASCII was developed based on what had passed before. For more detail on the history of ASCII, see this superb article by Tom Jennings, which also includes the meaning and usage of some of the stranger control characters.
Here is very detailed history and description of ASCII codes: http://en.wikipedia.org/wiki/ASCII
In short:
ASCII is based on teleprinter encoding standards
first 30 characters are "nonprintable" - used for text formatting
then they continue with printable characters, roughly in order they are placed on keyboard. Check your keyboard:
space,
upper case sign on number caps: !, ", #, ...,
numbers
signs usually placed at the end of keyboard row with numbers - upper case
capital letters, alphabetically
signs usually placed at the end of keyboard rows with letters - upper case
small letters, alphabetically
signs usually placed at the end of keyboard rows with letters - lower case
The distance between A and a is 32. That's quite round number, isn't it?
The gap of 6 characters between capital letters and small letters is because (32 - 26) = 6. (Note: there are 26 letters in the English alphabet).
If you look at the binary representations for 'a' and 'A', you'll see that they only differ by 1 bit, which is pretty useful (turning upper case to lower case or vice-versa is just a matter of flipping a bit). Why start there specifically, I have no idea.
'A' is 0x41 in hexidecimal.
'a' is 0x61 in hexidecimal.
'0' thru '9' is 0x30 - 0x39 in hexidecimal.
So at least it is easy to remember the numbers for A, a and 0-9. I have no idea about the symbols. See The Wikipedia article on ASCII Ordering.
Wikipedia:
The code itself was structured so that
most control codes were together, and
all graphic codes were together. The
first two columns (32 positions) were
reserved for control characters.[14]
The "space" character had to come
before graphics to make sorting
algorithms easy, so it became position
0x20.[15] The committee decided it was
important to support upper case
64-character alphabets, and chose to
structure ASCII so it could easily be
reduced to a usable 64-character set
of graphic codes.[16] Lower case
letters were therefore not interleaved
with upper case. To keep options open
for lower case letters and other
graphics, the special and numeric
codes were placed before the letters,
and the letter 'A' was placed in
position 0x41 to match the draft of
the corresponding British
standard.[17] The digits 0–9 were
placed so they correspond to values in
binary prefixed with 011, making
conversion with binary-coded decimal
straightforward.

Resources