Implementing sample code for unicode collation algorithm - algorithm

I have the following requirement in my project.
I need to sort strings based on order of the characters provided by the client.
For example:
Order provided by the user:d,a,A,D,z,p,P,Z
So if we have some strings like AaP,aAp,PpZ,pPz.
After sorting the output should be aAp,AaP,pPz,PpZ as a>A>p>P according to initial order given by the user.
Now I am thinking of picking Unicode Collation algorithm(http://unicode.org/reports/tr10/) for implementing the above requirement.
Can some one suggest me the data structures to use for the following few things for better performance.
1.)Mapping the ascii values of the characters to given order order of user--I am thinking of using map.But it can be O(logn) for access.I could not use hashmap as I code in c++.
2.)What sorting techniques can be used for comparing the sort key after generating the sort keys.Can some thing like radix sort be used here?
Please share your thoughts..
Though the following requirement is not needed for my project,I just want to know
how are collation elements actually created from the Unicode values or ascii values like this as mentioned in the above link for Unicode collation algorithm?
Character Collation Element Name
0300 "`" [.0000.0021.0002] COMBINING GRAVE ACCENT
0061 "a" [.06D9.0020.0002] LATIN SMALL LETTER A
0062 "b" [.06EE.0020.0002] LATIN SMALL LETTER B
0063 "c" [.0706.0020.0002] LATIN SMALL LETTER C
0043 "C" [.0706.0020.0008] LATIN CAPITAL LETTER C
0064 "d" [.0712.0020.0002] LATIN SMALL LETTER D

Related

Does it exist some kind of sorting convention?

Does it exist some established convention of sorting lines (characters)? Some convention which should play the similar role as PCRE for regular expressions.
For example, if you try to sort 0A1b-a2_B (each character on its own line) with Sublime Text (Ctrl-F9) and Vim (:%sort), the result will be the same (see below). However, I'm not sure it will be the same with another editors and IDEs.
-
0
1
2
A
B
_
a
b
Generally, characters are sorted based on their numeric value. While this used to only be applied to ASCII characters, this has also been adopted by unicode encodings as well. http://www.asciitable.com/
If no preference is given to the contrary, this is the de facto standard for sorting characters. Save for the actual alphabetical characters, the ordering is somewhat arbitrary.
There are two main ways of sorting character strings:
Lexicographic: numeric value of either the codepoint values or the code unit values or the serialized code unit values (bytes). For some character encodings, they would all be the same. The algorithm is very simple but this method is not human-friendly.
Culture/Locale-specific: an ordinal database for each supported culture is used. For the Unicode character set, it's called the CLDR. Also, in applying sorting for Unicode, sorting can respect grapheme clusters. A grapheme cluster is a base codepoint followed by a sequence of zero or more non-spacing (applied as extensions of the previous glyph) marks.
For some older character sets with one encoding, designed for only one or two scripts, the two methods might amount to the same thing.
Sometimes, people read a format into strings, such as a sequence of letters followed by a sequence of digits, or one of several date formats. These are very specialized sorts that need to be applied where users expect. Note: The ISO 8601 date format for the Julian calendar sorts correctly regardless of method (for all? character encodings).

How is the soundex code for burroughs is B622 instead of B620?

Steps of Soundex Algorithm
Site For Checking Soundex code
If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by 'h' or 'w' are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.
Soundex Algo
Dropping vowels and letter 'h' and 'w' Burroughs will convert to its numeric form as B6622 Applying adjacent same number rule will give code B62 .Since we need three numbers we append zero at the end.
So finally code should be B620!??
Yes, the Soundex code for Burroughs is B620.
Confirmed by manually working through the "alternate" version of the American Soundex algorithm listed on Wikipedia.
https://en.wikipedia.org/wiki/Soundex
Burroughs:
Burrougs
1u66ou22
1u6ou2
162
B62
B620
Also verified with another soundex calculator.
http://resources.rootsweb.ancestry.com/cgi-bin/soundexconverter
The original Soundex function checker web site you used does seem to be wrong - it is not applying the "h/w" part of the algorithm you highlighted in your question. That web site gives the same value "B622" for "Burroughs" and "Burrouges", which it should not.

Dealing with Cyrillic alphabet used in place of Latin characters

We've recently had a user enter english text, but it appears to have been done on a computer set up for Cyrillic as some of the letters such as "a" are actually CYRILLIC SMALL LETTER A, as opposed to LATIN SMALL LETTER A.
I thought that normalising would convert the Cyrillic to the Latin equivalent, but it does not (I guess that they are only equivalent in how they are displayed rather than their meaning).
Is this a common problem - user's who have computers setup for Cyrillic may be writing english, but with the Cyrillic alphabet instead?
What would be a safe way to spot this in general, and convert it appropriately?
To detect Cyrillic just use Regex match [\p{IsCyrillic}]. A more generic approach would be to search for any characters which are non-Latin ones.
Ones you've got a match, you'll need to replace the characters with their Latin equivalents.

String typeface comparison algorithm

I believe, there is an algorithm, which can equal two strings with similar typefaces of a characters, but different symbols (digits, Cyrillic, Latin or other alphabets). For example:
"hello" (Latin symbols) equals to "he11o" (digits and Latin symbols)
"HELLO" (Latin symbols) equals to "НЕLLО" (Cyrillic and Latin symbols)
"really" (Latin symbols) equals to "геа11у" (digits and Cyrillic symbols)
You may be thinking of the algorithm that Paul E. Black developed for ICANN that determines whether two TLDs are "confusingly similar", though it currently does not work with mixed-script input (e.g. Latin and Cyrillic). See "Algorithm Helps ICANN Manage Top-level Domains" and the ICANN Similarity Assessment Tool.
Also, if you are interested in extending this algorithm, then you might want to incorporate information from the Unicode code charts, which commonly list similar glyphs and sequences of code points that render similarly.
I am not exactly sure what you are asking for.
If you want to know whether two characters look the same under a given typeface then you need to render each character in the chosen fonts into bitmaps and compare them to see if they are close to being identical.
If you just want to always consider lower-case latin 'l' to be the same as the digit '1' regardless of the font used, then you can simply define a character mapping table. Probably the easiest way to do this would be to pick a canonical value for each set of characters that looks the same and map all members of the set to that character. When you compare the strings, compare the canonical instance of each character from the table.

Is there any logic behind ASCII codes' ordering?

I was teaching C to my younger brother studying engineering. I was explaining him how different data-types are actually stored in the memory. I explained him the logistics behind having signed/unsigned numbers and floating point bit in decimal numbers. While I was telling him about char type in C, I also took him through the ASCII code system and also how char is also stored as 1 byte number.
He asked me why 'A' has been given ASCII code 65 and not anything else? Similarly why 'a' is given the code 97 specifically? Why is there a gap of 6 ASCII codes between the range of capital letters and small letters? I had no idea of this. Can you help me understand this, since this has created a great curiosity to me as well. I've never found any book so far that has discussed this topic.
What is the reason behind this? Are ASCII codes logically organized?
There are historical reasons, mainly to make ASCII codes easy to convert:
Digits (0x30 to 0x39) have the binary prefix 110000:
0 is 110000
1 is 110001
2 is 110010
etc.
So if you wipe out the prefix (the first two '1's), you end up with the digit in binary coded decimal.
Capital letters have the binary prefix 1000000:
A is 1000001
B is 1000010
C is 1000011
etc.
Same thing, if you remove the prefix (the first '1'), you end up with alphabet-indexed characters (A is 1, Z is 26, etc).
Lowercase letters have the binary prefix 1100000:
a is 1100001
b is 1100010
c is 1100011
etc.
Same as above. So if you add 32 (100000) to a capital letter, you have the lowercase version.
This chart shows it quite well from wikipedia: Notice the two columns of control 2 of upper 2 of lower, and then gaps filled in with misc.
Also bear in mind that ASCII was developed based on what had passed before. For more detail on the history of ASCII, see this superb article by Tom Jennings, which also includes the meaning and usage of some of the stranger control characters.
Here is very detailed history and description of ASCII codes: http://en.wikipedia.org/wiki/ASCII
In short:
ASCII is based on teleprinter encoding standards
first 30 characters are "nonprintable" - used for text formatting
then they continue with printable characters, roughly in order they are placed on keyboard. Check your keyboard:
space,
upper case sign on number caps: !, ", #, ...,
numbers
signs usually placed at the end of keyboard row with numbers - upper case
capital letters, alphabetically
signs usually placed at the end of keyboard rows with letters - upper case
small letters, alphabetically
signs usually placed at the end of keyboard rows with letters - lower case
The distance between A and a is 32. That's quite round number, isn't it?
The gap of 6 characters between capital letters and small letters is because (32 - 26) = 6. (Note: there are 26 letters in the English alphabet).
If you look at the binary representations for 'a' and 'A', you'll see that they only differ by 1 bit, which is pretty useful (turning upper case to lower case or vice-versa is just a matter of flipping a bit). Why start there specifically, I have no idea.
'A' is 0x41 in hexidecimal.
'a' is 0x61 in hexidecimal.
'0' thru '9' is 0x30 - 0x39 in hexidecimal.
So at least it is easy to remember the numbers for A, a and 0-9. I have no idea about the symbols. See The Wikipedia article on ASCII Ordering.
Wikipedia:
The code itself was structured so that
most control codes were together, and
all graphic codes were together. The
first two columns (32 positions) were
reserved for control characters.[14]
The "space" character had to come
before graphics to make sorting
algorithms easy, so it became position
0x20.[15] The committee decided it was
important to support upper case
64-character alphabets, and chose to
structure ASCII so it could easily be
reduced to a usable 64-character set
of graphic codes.[16] Lower case
letters were therefore not interleaved
with upper case. To keep options open
for lower case letters and other
graphics, the special and numeric
codes were placed before the letters,
and the letter 'A' was placed in
position 0x41 to match the draft of
the corresponding British
standard.[17] The digits 0–9 were
placed so they correspond to values in
binary prefixed with 011, making
conversion with binary-coded decimal
straightforward.

Resources