Efficient Means of Implementing Collation & Sorting? - sorting

I'm writing lexicography software, which may theoretically need to sort tens of thousands of strings with arbitrary (dictionary-project-specific) collations. There are two means of specifying custom collations:
a map of graphemes to unicode-style multi-level collation keys.
an array of alphabetic graphemes (possibly including digraphs, etc.) in sorting order, which can be internally transformed into a map of collation keys.
The naive method of comparing strings is to check grapheme-by-grapheme until you find a mismatch, and then look up the collation keys for the mismatched graphemes to compare, but I'm hoping there's a more efficient way of doing it.
The best idea I've got so far depends on noticing that strings of equal length can be treated as little-endian base-n numbers, so I can pre-compute an integer key for each string which turns collation into cheap integer comparison. But, this breaks for strings of different length (a big deal when sorting a dictionary), and there's no bound on the size of integers that could be generated.
To account for length differences, I thought I could compute a list of keys for all prefixes of each string, and then just compare the keys for prefixes of length equal to the shorter string being compared. That seems to do pretty well, but key sizes are still unbounded, and storing the keys could use a lot of memory.
Is there a way to improve that approach? Or am I just going about it entirely wrong, and there's a much better means of sorting strings with arbitrary collations?

How about a grapheme-by-grapheme Radix sort? You get Big O n(number of words) * m(length of longest word) sorting. The idea should be fairly simple put all the words that start with A in the A bucket, Bs in the B bucket and so on down the characters in the word.

I'm no expert but I might suggest some kind of hybrid between the naive approach and your approach. Where you look at a fixed number of bytes in each string, treat it as a little-endian number and use a pre-calculated collation. Then if they are the same move to the next set of the same length and do the same. The tricky part is dealing with variable length graphemes (such as UTF-8 or digraphs). The simplest solution would be to use a fixed-width representation in the dictionary, but there might be another, more sophisticated solution, which I can't think of right now.
Once you get to the end of the shorter string you zero extend it to meet the next boundary and then do the comparison.
You could also look at open-source implementations of collations, and see if they do something more sophisticated (for instance the GNU implementation of the strcoll C function).

Related

Data structure to store objects identified by unique 8digit hexadecimals for fast insertion and lookup

I have a bunch of objects with unique 8 digit hexadecimal identifiers ex[fd4786ac] that I need to construct and look up quickly. Deletion is not a priority. These hexadecimal values are currently being stored as strings.
Considered a trie(or some variation of a trie), a skip list, and a some variation of a hash table. Using a skip list over a AVL tree would be preferable since it is likely these strings will be sequential but not guaranteed and tree re-balancing would be often. How ever I'm open to other data structures if they better suit my need.
A good choice would be to convert your keys into 32-bit integers, and then use a hash table.
If you want to write your own just for this use case, then:
Instead of hashing keys all the time or storing hash values, use a bijective hash function and use the hashes instead of the keys.
Since your keys are very small you should probably use open addressing -- it will save space and it's a little faster. Wikipedia will give you lots of choices for probing schemes. I currently like robin hood hashing: https://www.sebastiansylvan.com/post/robin-hood-hashing-should-be-your-default-hash-table-implementation/
Your 8 digit hexadecimal identifiers represent a 4byte (32bit) integer so you can use that as an index for a (quite large) array with 2^32 entries.
If the array contains pointers this would cost 64GB.
Most likely too much to keep it in RAM.
So if the number of elements is orders of magnitudes below 2^32, use a Hash-Map or a sorted ist (access O(logn) ).

Can Anyone provide the algorithm of Bead Sort with respect to string array?

As I'm unable to code it on java with the examples in integer values to String array.
Please anyone could explain. Java would be better.
According to the Wikipedia article, the algorithm can only sort lists of positive integers and in the best case requires O(n^2) extra space.
If you want to sort a list of strings with bead sort, you need some way to convert those strings to positive integers such that the relationships of the numbers exactly match the relationships among the strings. So if you have the strings "ABC", "JKL" and "XYZ", then the number you generate for "XYZ" has to be greater than the number you generate for "JKL", which has to be greater than the number for the string "ABC".
That isn't a particularly difficult thing to do if your strings are all four bytes long (or shorter). Or even 8 bytes if you want to sort long integers: you just map each byte of the string to a byte in the integer. So "ABC" would become: 0x00414243.
But in the general case, if your strings are longer than 8 bytes, coming up with that mapping would be more expensive than using a different sorting algorithm. Well, unless you wanted to use some special BigInteger class.
Even then, the O(n^2) extra space is going to be a deal killer if you're trying to sort even a small array. Sorting an array of 32,000 integers would require four gigabytes of extra space.
In short, with the description you've given, what you're asking to do is not reasonably possible.

Transform two strings in such a way that the distance between the input strings is 'reflected' in the distance between the output strings?

I have a list of user identifiers which are pretty long. The identifiers may not be exactly identical each time they come with HTTP request therefore I use fuzzy string comparison to authenticate the user. For that very reason, I couldn't hash the identifier because my fuzzy string comparison algorithm won't work with the hashed values since even slightly different plain texts yield completely different values when hashed. Now is there some algorithm algx such that distance(s1,s1') is in some way proportional to distance (algx(s1),algx(s1'))? Or is there any other way to go about the problem?
Note: distance in this sense means the amount of editing needed to transform one text into another one.
Sounds like you are looking for locality-sensitive hashing.
You could use something like Levenshtein distance which measures the difference between 2 strings. There's also a PHP function of the same name.
One solution is to keep count of each alphabet and comparing the count arrays. A bad match between the counts means the strings are definitely not similar.

How to use distribution sort (radix sort, etc) to sort strings?

I know how to use radix sort to sort integers.
But how to use it to sort strings? or float numbers?
Radix sort or any other distribution sort may be used to sort floating point numbers if you ignore some peculiarities of them like infinity, not-a-number values and two different representations of zero. IEEE 754-2008 floating point numbers have binary representations, compatible in sorting order with integer numbers. So, if you exclude not-a-numbers and reinterpret float or double as int32 or int64, you can directly apply any distribution sort to them. Edit: Negative floating point numbers need special treatment (as pointed out by AShelly), because their sorting order is opposite to the sorting order of integer numbers.
With strings, it is more difficult because of their variable length. Other kind of distribution sort (bucket sort) may be used and is often used for strings. Several starting characters of the string are used for bucket indexing, then any comparative sort is used to sort strings inside the buckets.
If all strings have almost equal length and/or some technique is used to amplify differences between strings (like described in chapter 6 of "FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs"), then radix sort may be used as well: split the string to groups of characters (or better, to groups of bits) of equal length, reinterpret these groups as integers, and continue as if it is radix sort for integers.
Edit: All kinds of distribution sort are guaranteed to work properly only for ASCII strings. Other string encodings may require different sort order or may depend on the "collate" parameter of the locale.
Yes it is possible.
See Radix Sort, Sorting a float data for floats. It uses the fact that floats cast to integer types compare correctly (once negatives are corrected for). See this article for details
For strings, you can solve the variable-length problem by doing an MSD radix sort and ensuring that you stop descending when you encounter Nulls. See Radix sort implemented in c++ for string.

When is the appropriate time to use Radix Sort?

What are the constraints on your data for you to be able to use Radix sort?
If I'm sorting a large list of integers, would it be appropriate to use Radix sort? Why is Radix sort not used more?
It's great when you have a large set of data with keys that are somehow constrained. For example, when you need to order a 1-million array of 64-bit numbers, it can be used to sort by 8 least significant bits, then by the next 8, and so on (applied 8 times). That way this array can be sorted in 8*1M operations, rather than 1M*log(1M).
If you know the range of the integer values, and it's not too large,
maybe counting sort would be a better choice in your case.
One reason you might not see it as often as you'd think you would is that Radix sort is not as general purpose as comparison based sorts (quicksort/mergesort/heapsort). It requires that you can represent the items to be sorted as an integer, or something like an integer. When using a standard library, it is easy to define a comparison function that compares arbitrary objects. It might be harder to define an encoding that properly maps your arbitrary data type into an integer.
Bucket sorting is useful in situations where the number of discrete key values is small relative to the number of data items, and where the goal is to produce a re-sorted copy of a list without disturbing the original (so needing to maintain both the old and new versions of the list simultaneously is not a burden). If the number of possible keys is too large to handle in a single pass, one can extend bucket sort into radix sort by making multiple passes, but one loses much of the speed advantage that bucket sort could offer for small keys.
In some external-sorting scenarios, especially when the number of different key values is very small (e.g. two), a stable sort is required, and the I/O device can only operate efficiently with one sequential data stream, it may be useful to make K passes through the source data stream, where K is the number of key values. On the first pass, one copies all the items where the key is the minimum legitimate value and skips the rest, then copies all the items where the key is the next higher value, skipping the rest, etc. This approach will obviously be horribly efficient if there are very many different key values, but will be quite good if there are two.

Resources