Natural sorting of UTF-8 strings in DynamoDB - sorting

I'm storing file names (with extension) and directory names as UTF-8 strings in DynamoDB as sort keys.
As far as I know, file names + ext and directory names are unique within a directory, so I can use those strings as unique IDs within the parent directory.
These strings will, being UTF-8, be sorted alphabetically. 10 will come before 2, uppercase before lowercase and so on.
As I try to represent a file hierarchy, I would like to retrieve the items sorted in a natural order instead.
I could do some magic on the strings to have them sort naturally before I use them as sort keys, but then I would need to keep an attribute with the original name and those are bytes I would like to save, if possible.
If it matters, this is part of a single table design.
Are there any design patterns, hashing algorithms or other approaches I could use to solve this?

I don't know what "magic" you intend to do. Usually people will zero-pad the numbers to some arbitrary max length so that string sorting the numbers matches the numeric sort, for positive integers anyway. If you do that you could remove the padding on display.

Related

Hash Table and Substring Matching

I have hundreds of keys for example like:
redapple
maninred
foraman
blueapple
i have data related to these keys, data is a string and has related key at the end.
redapple: the-tree-has-redapple
maninred: she-saw-the-maninred
foraman: they-bought-the-present-foraman
blueapple: it-was-surprising-but-it-was-a-blueapple
i am expected to use hash table and hash function to record the data according to keys and i am expected to be able to retieve data from table.
i know to use hash function and hash table, there is no problem here.
But;
i am expected to give the program a string which takes place as a substring and retrieve the data for the matching keys.
For example:
i must give "red" and must be able to get
redapple: the-tree-has-redapple
maninred: she-saw-the-maninred
as output.
or
i must give "apple" and must be able to get
redapple: the-tree-has-redapple
blueapple: it-was-surprising-but-it-was-a-blueapple
as output.
i only can think to search all keys if they has a matching substring, is there some other solution? If i search all the key strings for every query, use of hashing is unneeded, meaningless, is it?
But, searching all keys for substring is O(N), i am expected to solve the problem with O(1).
With hashing i can hash a key e.g. "redapple" to e.g. 943, and "maninred" to e.g. 332.
And query man give the string "red" how can i found out from 943 and 332 that the keys has "red" substring? It is out of my cs thinking skills.
Thanks for any advise, idea.
Possible you should use the invert index for n-gramm, the same approach is used for spell correction. For word redapple you will have following set of 3-gramms red, eda, dap, app, ppl, ple. For each n-gramm you will have a list of string in which contains it. For example for red it will be
red -> maninred, redapple
words in this list must be ordered. When you want to find the all string that contains a a give substring, you dived the substring on n-gramm and intercept the list of words for n-gramm.
This alogriphm is not O(n), but it practice it has enough speed.
It cannot be nicely done in a hash table. Given a a substring - you cannot predict the hashed result of the entire string1
A reasonable alternative is using a suffix tree. Each terminal in the suffix tree will hold list of references of the complete strings, this suffix is related to.
Given a substring t, if it is indeed a substring of some s in your collection, then there is a suffix x of s - such that t is a prefix of x. By traversing the suffix tree while reading t, and find all the terminals reachable from the the node you reached from there. These terminals contain all the needed strings.
(1) assuming reasonable hash function, if hashCode() == 0 for each element, you can obviously predict the hash value.
I have researched this problem recently and i'm sure that this can not be done. I hope hash table will help me improve speed of searching like you but it makes me disapointed.

String comparison algorithm, relevancy, how much "alike" 2 strings are

I have 2 sources of information for the same data (companies), which I can join together via a unique ID (contract number). The presence of the second, different source, is due to the fact that the 2 sources are updated manually, independently. So what I have is an ID and a company Name in 2 tables.
I need to come up with an algorithm that would compare the Name in the 2 tables for the same ID, and order all the companies by a variable which indicates how different the strings are (to highlight the most different ones, to be placed at the top of the list).
I looked at the simple Levenshtein distance calculation algorithm, but it's at the letter level, so I am still looking for something better.
The reason why Levenshtein doesn't really do the job is this: companies have a name, prefixed or postfixed by the organizational form (LTD, JSC, co. etc). So we may have a lot of JSC "Foo" which will differ a lot from Foo JSC., but what I am really looking for in the database is pairs of different strings like SomeLongCompanyName JSC and JSC OtherName.
Are there any Good ways to do this? (I don't really like the idea of using regex to separate words in each string, then find matches for every word in the other string by using the Levenshtein distance, so I am searching for other ideas)
How about:
1. Replace all punctuation by whitespace.
2. Break the string up into whitespace-delimited words.
3. Move all words of <= 4 characters to the end, sorted alphabetically.
4. Levenshtein.
Could you filter out (remove) those "common words" (similar to removing stop words for fulltext indexing) and then search on that? If not, could you sort the words alphabetically before comparing?
As an alternative or in addition to the Levenshtein distance, you could use Soundex. It's not terribly good, but it can be used to index the data (which is not possible when using Levenshtein).
Thank you both for ideas.
I used 4 indices which are levenshtein distances divided by the sum of the length of both words (relative distances) of the following:
Just the 2 strings
The string composed of the result after separating the word sequences, eliminating the non-word chars, ordering ascending and joining with space as separator.
The string which is contained between quotes (if no such string is present, the original string is taken)
The string composed of alphabetically ordered first characters of each word.
each of these in return is an integer value between 1 and 1000. The resulting value is the product of:
X1^E1 * X2^E2 * X3^E3 * X4^E4
Where X1..X4 are the indices, and E1..E4 are user-provided preferences of valuable (significant) is each index. To keep the result inside reasonable values of 1..1000, the vector (E1..E4) is normalized.
The results are impressive. The whole thing works much faster than I've expected (built it as a CLR assembly in C# for Microsoft SQL Server 2008). After picking E1..E4 correctly, the largest index (biggest difference) on non-null values in the whole database is 765. Right untill about 300 there is virtually no matching company name. Around 200 there are companies that have kind of similar names, and some are the same names but written in very different ways, with abbreviations, additional words, etc. When it comes down to 100 and less - practically all the records contain names that are the same but written with slight differences, and by 30, only the order or the punctuation may differ.
Totally works, result is better than I've expected.
I wrote a post on my blog, to share this library in case someone else needs it.

file names based on file content

So iow, some algorithm to generate a unique, reasonable length filename based on binary file content. Two files that have the same binary content should have the same name. Obviously there would be limits to this, as presumably you couldn't have unique reasonable length filenames for each of a large set of large files only differing at a handful of bit positions. But presumably there is some heuristic, best approximation to this that for example exploits known attributes of typical image files. If I had the name of some algorithm that does this I can google it and find other approaches as well.
Use an MD5 hash of the contents of the file.
I guess MD5 is worth checking out. Of course it will give you same result if the content is the same but I guess you can increment it until you get unique one.

calculating a hash of a data structure?

Let's say I want to calculate a hash of a data structure, using a hash algorithm like MD5 which accepts a serial stream, for the purposes of equivalence checking. (I want to record the hash, then recalculate the hash on the same or an equivalent data structure later, and check the hashes to gauge equivalence with high probability.)
Are there standard methods of doing this?
Issues I can see that are problematic are
if the data structure contains an array of binary strings, I can't just concatenate them since ["abc","defg"] and ["ab","cdefg"] are not equivalent arrays
if the data structure contains a collection that isn't guaranteed to enumerate in the same order, e.g. a key-value dictionary {a: "bc", d: "efg", h: "ijkl"} which should be considered equivalent to a key-value pair {d: "efg", h: "ijkl", a: "bc"}.
For the first issue, also hash the lengths of the strings. This will differentiate their hashes.
For the second, sort the keys.
A "standard" way of doing this is to define a serialized form of the data structure, and digest the resulting byte stream.
For example, a TBSCertificate is a data structure comprising a subject name, extensions, and other information. This is converted to a string of octets in a deterministic way and hashed as part of a digital signature operation to produce a certificate.
There is also another problem with structs and it is the alignment of data members on different platforms.
If you want a stable and portable solution, you can solve this by implementing "serialize" method for your data structure in such a way that serialize will produce byte stream (or more commonly, output to the byte stream).
Then, you can use hash algorithm with the serialized stream. In such a way, you will be able to solve the problems you mentioned by explicit traversion of your data. As other additional features you will get ability to save your data onto hdd or to send it over the network.
For the strings, you can implement Pascal type storage where length comes first.
If the strings can't have any nul characters, you can use C strings to guarantee uniqueness, eg. "abc\0defg\0" is distinct from "cdefg\0".
For dictionaries, maybe you can sort before hashing.
This also reminds me of an issue I heard of once... I don't know what language you are using, but if you are also hashing C structs without filtering them in any way, be careful about the space between fields that the compiler might have introduced for alignment reasons. Sometimes those will not be zeroed out.

Algorithm that Generates Unique Serial Number for Each English Word

For an application I need to generate unique serial numbers for each English word.
What would be the best approach?
One constraint is serial number generation algorithm should be very effective in an ordinary desktop computer.
Thanks
Do you have a list of all possible words? If yes, start from 0 at the first word and increment the serial by 1 for each word.
If not then a simple way to guarantee they are unique is to use the word itself as the serial. For example, ABC = 0x41 0x42 0x43 = 4276803.
As suggested in the comments there are other ways (that however require more work), such as compressing the words first with, for example, Huffman.
This of course gets awkward with long words: The serial of Pneumonoultramicroscopicsilicovolcanoconiosis would require around 100 digits, for example.
Otherwise you can use a hash, but there is no guarantee it will be unique for all English words.
You appear to be asking about a perfect hashing function. If so, take a look at this Wikipedia article, and at the gperf utility.
Here is an algorithm (in python) that allows you to code and decode any combination of lowercase letters:
def encode(s):
r = 1
for i in len(s):
r = r * 26 + (ord(s[i]) - ord('a'))
return r
Using 64 bits you can code up to 12 letter words. You can use the remaining unused serials as in index to a table containing low-frequency very long words.
Just use a 64-bit hash function, like Fowler-Noll-Vo. You're not likely to get collisions using a 64-bit integer, as this gives you 2^64 possible values, and there are certainly way less than that many words in the English language. You'd need to normalize each word, of course, (convert to lower-case, etc.)
Do you really need it to be 'serial'? if not - did you try to use the various hash algorithms? Several of them are built into .NET (MD5 and SHA1 if I remember correctly). I am not sure which one will be good enough especially with short strings
Are you looking for every word, or every word in the English dictionary? Are you using standard words - i.e. from the Oxford English Dictionary or are slang words included too? I guess what I'm getting at is: "How big is your dictionary"? You could use an MD5 hash which has a theoretical possibility of collisions - albeit 1 in billions of hashes that may collide - although, I can't say I'd understand the purpose of using a hash over using the actual word. Unless perhaps you're wanting to calculate the serial client side so that it's referencing a correct dictionary item on the server side without having to parse the dictionary looking for its serial. Of course - the word obviously has to be sufficiently unique in order for us to understand it as humans, and we're way more efficient at parsing the meaning of words than a computer is at doing the same.
Are you looking to separate words that look the same but are pronounced differently? Words that look and sound the same but have different meanings? If so, then you're going to come unstuck with a hash, as the same spelling with a different semantic will produce the same hash, so it won't work for this scenario. In this case you'd need some kind of incremental system. If you add words after the fact to the dictionary, will they be added at the end and just given the next serial number in sequence? What if that word is spelled the same as another word but sounds different or sounds the same but has a different semantic? What then?
I guess it depends on the purpose of the serialization as to what would be the most suitable output for your serial number and hence what would be the most efficient algorithm.
The most efficient algorithm would probably be to split your dictionary into the same number of chunks as you have processors and have a thread on each processor serialize the words in its chunk recombining the output from each thread at the end. This (in theory) would work at a speed slightly slower than O(n/number of processors) in real world performance, however I think for mathematical correctness that's still O(n) because you still have to parse the whole dictionary once to serialize each word.
I think the safest way to go is:
Worry about what you've got now
Order them in the most logical sequence (alphabetically?)
Number them in sequence
Add new words (whether spelled the same or not and having different semantics) at the end; give them the next number in the sequence, regardless of their rightful place in the dictionary alphabetically.
This way you don't have to worry about leaving spaces in the serial numbers to account for insertions between words, you don't have to worry about reindexing any dependent data to account for changes in indexes when words are inserted, you just carry on as normal. You don't have to worry about collisions, and you still get the most efficient indexing mechanism for storage purposes meaning you're not storing MD5 hashes that are potentially longer than the original word - which makes no sense for real world use.
If you need to access the dictionary alphabetically, just sort by the word, otherwise, don't.
I still think I'm at a loss as to the necessity of serializing the word - except for storage purposes where you can store your dictionary and link tables by the word's key.
I wonder if an answer is even possible.
Are color and colour the same word? Do they get one serial number or two?
Are polish and Polish the same word?
Are watch (noun) and watch (verb) the same word?
Are multiply (verb) and multiply (adverb) the same word?
Analysis (singular noun) and analyses (plural noun) are not the same word. Are analyse (plural verb) and analyze (plural verb) the same word? Are analyses (singular verb) and analyzes (singular verb) the same word? Are analyses (singular verb) and analyses (plural noun) the same word?
Are wont and won't the same word?
Are Beijing and Peking the same word? Or maybe they aren't English, since Londres and Frankreich aren't English, but then what is the English word for the capital of the Middle Country?
About about MD5 hash algorithm. Do something like this:
serialNumber = MD5( ToLower ( english word ) )

Resources