Best way to implement partial key hashing - algorithm

I appeared for an interview where I was asked to write an algorithm for partial key hashing i.e; if ABCBC is inserted in the hash then searching for any of the sub strings should return the value stored.
My answer was to create a collection of all possible substrings of a given key and maintain a mapping between each substring to its one or more parent string. Then maintain a BST of the collection of all substrings. Each node will point to a collection of actual values which that substring might match to.
For eg.
A, AB, ABC, ABCB, ABCBC, B, BC, BCB, BCBC, C, CB, CBC are possible substrings for this string. There may be other strings also like BAB of which, AB and B are substring of.
So given AB, it will map to two strings BAB and ABCBC.
Is there any other more efficient way ?
Thanks

Store each substring in the hash, with a note for whether it is final, and the possible next characters and previous characters. Store previous characters for all words that could have this substring in the middle, and next characters for all words that have this substring as their start.
Thus the entry for a does not need to have all words with a in it. But it is easy enough to build that list if you want to. Also during an insert as soon as you are going down in size on substrings and you find that you already have the current substring with the current continuation, you can stop.
Assuming that you have many words with the same letters, this will save some on space and inserts, at the cost of making actually generating the list slower. Worst case is still O(n*n) for an n letter string though.
To delete you can follow a similar procedure, stopping deletes at any substring that has other substrings coming into it.

Related

Using primes to determine anagrams faster than looping through?

I had a telephone recently for a SE role and was asked how I'd determine if two words were anagrams or not, I gave a reply that involved something along the lines of getting the character, iterating over the word, if it exists exit loop and so on. I think it was a N^2 solution as one loop per word with an inner loop for the comparing.
After the call I did some digging and wrote a new solution; one that I plan on handing over tomorrow at the next stage interview, it uses a hash map with a unique prime number representing each character of the alphabet.
I'm then looping through the list of words, calculating the value of the word and checking to see if it compares with the word I'm checking. If the values match we have a winner (the whole mathematical theorem business).
It means one loop instead of two which is much better but I've started to doubt myself and am wondering if the additional operations of the hashmap and multiplication are more expensive than the original suggestion.
I'm 99% certain the hash map is going to be faster but...
Can anyone confirm or deny my suspicions? Thank you.
Edit: I forgot to mention that I check the size of the words first before even considering doing anything.
An anagram contains all the letters of the original word, in a different order. You are on the right track to use a HashMap to process a word in linear time, but your prime number idea is an unnecessary complication.
Your data structure is a HashMap that maintains the counts of various letters. You can add letters from the first word in O(n) time. The key is the character, and the value is the frequency. If the letter isn't in the HashMap yet, put it with a value of 1. If it is, replace it with value + 1.
When iterating over the letters of the second word, subtract one from your count instead, removing a letter when it reaches 0. If you attempt to remove a letter that doesn't exist, then you can immediately state that it's not an anagram. If you reach the end and the HashMap isn't empty, it's not an anagram. Else, it's an anagram.
Alternatively, you can replace the HashMap with an array. The index of the array corresponds to the character, and the value is the same as before. It's not an anagram if a value drops to -1, and it's not an anagram at the end if any of the values aren't 0.
You can always compare the lengths of the original strings, and if they aren't the same, then they can't possibly be anagrams. Including this check at the beginning means that you don't have to check if all the values are 0 at the end. If the strings are the same length, then either something will produce a -1 or there will be all 0s at the end.
The problem with multiplying is that the numbers can get big. For example, if letter 'c' was 11, then a word with 10 c's would overflow a 32bit integer.
You could reduce the result modulo some other number, but then you risk having false positives.
If you use big integers, then it will go slowly for long words.
Alternative solutions are to sort the two words and then compare for equality, or to use a histogram of letter counts as suggested by chrylis in the comments.
The idea is to have an array initialized to zero containing the number of times each letter appears.
Go through the letters in the first word, incrementing the count for each letter. Then go through the letters in the second word, decrementing the count.
If the counts reach zero at the end of this process, then the words are anagrams.

Make palindrome from given word

I have given word like abca. I want to know how many letters do I need to add to make it palindrome.
In this case its 1, because if I add b, I get abcba.
First, let's consider an inefficient recursive solution:
Suppose the string is of the form aSb, where a and b are letters and S is a substring.
If a==b, then f(aSb) = f(S).
If a!=b, then you need to add a letter: either add an a at the end, or add a b in the front. We need to try both and see which is better. So in this case, f(aSb) = 1 + min(f(aS), f(Sb)).
This can be implemented with a recursive function which will take exponential time to run.
To improve performance, note that this function will only be called with substrings of the original string. There are only O(n^2) such substrings. So by memoizing the results of this function, we reduce the time taken to O(n^2), at the cost of O(n^2) space.
The basic algorithm would look like this:
Iterate over the half the string and check if a character exists at the appropriate position at the other end (i.e., if you have abca then the first character is an a and the string also ends with a).
If they match, then proceed to the next character.
If they don't match, then note that a character needs to be added.
Note that you can only move backwords from the end when the characters match. For example, if the string is abcdeffeda then the outer characters match. We then need to consider bcdeffed. The outer characters don't match so a b needs to be added. But we don't want to continue with cdeffe (i.e., removing/ignoring both outer characters), we simply remove b and continue with looking at cdeffed. Similarly for c and this means our algorithm returns 2 string modifications and not more.

Checking if a word is made up of one or more concatenated dictionary words

Here's the scenario:
I have an array of millions of random strings of letters of length 3-32, and an array of words (the dictionary).
I need to test if a random string can be made up by concatenating 1, 2, or 3 different dictionary words or not.
As the dictionary words would be somewhat fixed, I can do any kind of pre-processing on them.
Ideally, I'd like something that optimizes lookup speeds by doing some kind of pre-processing on the dictionary.
What kind of data structures / algorithms should I be looking at to implement this?
First, Build a B-Tree like Trie structure from your dict. Each root would map to a letter. Each 2nd level subtree would then have all of the words that could be made with two letters, and so on.
Then take your word and start with the first letter and walk down the B-Tree Trie until you find a match and then recursively apply this algorithm to the rest of the word. If you don't find a match at any point you know you can't form the word via concats.
Store the dictionary strings in a hashed set data structure. Iterate through all possible splits of the string you want to check in 1, 2 or 3 parts, and for each such split look up all parts in the hash set.
Make a regex matching every word in your dictionary.
Put parentheses around it.
Put a + on the end.
Compile it with any correct (DFA-based) regex engine.

Searching strings with . wildcard

I have an array with so much strings and want to search for a pattern on it.
This pattern can have some "." wildcard who matches (each) 1 character (any).
For example:
myset = {"bar", "foo", "cya", "test"}
find(myset, "f.o") -> returns true (matches with "foo")
find(myset, "foo.") -> returns false
find(myset, ".e.t") -> returns true (matches with "test")
find(myset, "cya") -> returns true (matches with "cya")
I tried to find a way to implement this algorithm fast because myset actually is a very big array, but none of my ideas has satisfactory complexity (for example O(size_of(myset) * lenght(pattern)))
Edit:
myset is an huge array, the words in it aren't big.
I can do a slow preprocessing. But I'll have so much find() queries, so find() I want find() to be as fast as possible.
You could build a suffix tree of the corpus of all possible words in your set (see this link)
Using this data structure your complexity would include a one time cost of O(n) to build the tree, where n is the sum of the lengths of all your words.
Once the tree is built finding if a string matches should take just O(n) where n is length of the string.
If the set is fixed, you could pre-calculate frequencies of a character c being at position p (for as many p values as you consider worth-while), then search through the array once, for each element testing characters at specific positions in an order such that you are most likely to exit early.
First, divide the corpus into sets per word length. Then your find algorithm can search over the appropriate set, since the input to find() always requires the match to have a specific length, and the algorithm can be designed to work well with all words of the same length.
Next (for each set), create a hash map from a hash of character x position to a list of matching words. It is quite ok to have a large amount of hash collision. You can use delta and run-length encoding to reduce the size of the list of matching words.
To search, pick the appropriate hash map for the find input length, and for each non . character, calculate the hash for that character x position, and AND together the lists of words, to get a much reduced list.
Brute force search through that much smaller list.
If you are sure that the length of the words in your set are not large. You could probably create a table which holds the following:
List of Words which have first Character 'a' , List of Words which have first Character 'b', ..
List of Words which have second Character 'a', List of words which have second Character 'b', ..
and so on.
When you are searching for the word. You can look for the list of words which have the first character same as the search strings' first character. With this refined list, look for the words which have the second character same as the search strings' second character and so on. You can ignore '.' whenever you encounter them.
I understand that building the table may take a large amount of space but the time taken will come down significantly.
For example, if you have myset = {"bar", "foo", "cya", "test"} and you are searching for 'f.o'
The moment you check for list of words starting with f, you eliminate the rest of the set. Just an idea.. Hope it helps.
I had this same question, and I wasn't completely happy with most of the ideas/solutions I found on the internet. I think the "right" way to do this is to use a Directed Acyclic Word Graph. I didn't quite do that, but I added some additional logic to a Trie to get a similar effect.
See my isWord() implementation, analogous to your desired find() interface. It works by recursing down the Trie, branching on wildcard, and then collecting results back into a common set. (See findNodes().)
getMatchingWords() is similar in spirit, except that it returns the set of matching words, instead of just a boolean as to whether or not the query matches anything.

Efficient data structure for word lookup with wildcards

I need to match a series of user inputed words against a large dictionary of words (to ensure the entered value exists).
So if the user entered:
"orange" it should match an entry "orange' in the dictionary.
Now the catch is that the user can also enter a wildcard or series of wildcard characters like say
"or__ge" which would also match "orange"
The key requirements are:
* this should be as fast as possible.
* use the smallest amount of memory to achieve it.
If the size of the word list was small I could use a string containing all the words and use regular expressions.
however given that the word list could contain potentially hundreds of thousands of enteries I'm assuming this wouldn't work.
So is some sort of 'tree' be the way to go for this...?
Any thoughts or suggestions on this would be totally appreciated!
Thanks in advance,
Matt
Put your word list in a DAWG (directed acyclic word graph) as described in Appel and Jacobsen's paper on the World's Fastest Scrabble Program (free copy at Columbia). For your search you will traverse this graph maintaining a set of pointers: on a letter, you make a deterministic transition to children with that letter; on a wildcard, you add all children to the set.
The efficiency will be roughly the same as Thompson's NFA interpretation for grep (they are the same algorithm). The DAWG structure is extremely space-efficient—far more so than just storing the words themselves. And it is easy to implement.
Worst-case cost will be the size of the alphabet (26?) raised to the power of the number of wildcards. But unless your query begins with N wildcards, a simple left-to-right search will work well in practice. I'd suggest forbidding a query to begin with too many wildcards, or else create multiple dawgs, e.g., dawg for mirror image, dawg for rotated left three characters, and so on.
Matching an arbitrary sequence of wildcards, e.g., ______ is always going to be expensive because there are combinatorially many solutions. The dawg will enumerate all solutions very quickly.
I would first test the regex solution and see whether it is fast enough - you might be surprised! :-)
However if that wasn't good enough I would probably use a prefix tree for this.
The basic structure is a tree where:
The nodes at the top level are all the possible first letters (i.e. probably 26 nodes from a-z assuming you are using a full dictionary...).
The next level down contains all the possible second letters for each given first letter
And so on until you reach an "end of word" marker for each word
Testing whether a given string with wildcards is contained in your dictionary is then just a simple recursive algorithm where you either have a direct match for each character position, or in the case of the wildcard you check each of the possible branches.
In the worst case (all wildcards but only one word with the right number of letters right at the end of the dictionary), you would traverse the entire tree but this is still only O(n) in the size of the dictionary so no worse than a full regex scan. In most cases it would take very few operations to either find a match or confirm that no such match exists since large branches of the search tree are "pruned" with each successive letter.
No matter which algorithm you choose, you have a tradeoff between speed and memory consumption.
If you can afford ~ O(N*L) memory (where N is the size of your dictionary and L is the average length of a word), you can try this very fast algorithm. For simplicity, will assume latin alphabet with 26 letters and MAX_LEN as the max length of word.
Create a 2D array of sets of integers, set<int> table[26][MAX_LEN].
For each word in you dictionary, add the word index to the sets in the positions corresponding to each of the letters of the word. For example, if "orange" is the 12345-th word in the dictionary, you add 12345 to the sets corresponding to [o][0], [r][1], [a][2], [n][3], [g][4], [e][5].
Then, to retrieve words corresponding to "or..ge", you find the intersection of the sets at [o][0], [r][1], [g][4], [e][5].
You can try a string-matrix:
0,1: A
1,5: APPLE
2,5: AXELS
3,5: EAGLE
4,5: HELLO
5,5: WORLD
6,6: ORANGE
7,8: LONGWORD
8,13:SUPERLONGWORD
Let's call this a ragged index-matrix, to spare some memory. Order it on length, and then on alphabetical order. To address a character I use the notation x,y:z: x is the index, y is the length of the entry, z is the position. The length of your string is f and g is the number of entries in the dictionary.
Create list m, which contains potential match indexes x.
Iterate on z from 0 to f.
Is it a wildcard and not the latest character of the search string?
Continue loop (all match).
Is m empty?
Search through all x from 0 to g for y that matches length. !!A!!
Does the z character matches with search string at that z? Save x in m.
Is m empty? Break loop (no match).
Is m not empty?
Search through all elements of m. !!B!!
Does not match with search? Remove from m.
Is m empty? Break loop (no match).
A wildcard will always pass the "Match with search string?". And m is equally ordered as the matrix.
!!A!!: Binary search on length of the search string. O(log n)
!!B!!: Binary search on alphabetical ordering. O(log n)
The reason for using a string-matrix is that you already store the length of each string (because it makes it search faster), but it also gives you the length of each entry (assuming other constant fields), such that you can easily find the next entry in the matrix, for fast iterating. Ordering the matrix isn't a problem: since this has only be done once the dictionary updates, and not during search-time.
If you are allowed to ignore case, which I assume, then make all the words in your dictionary and all the search terms the same case before anything else. Upper or lower case makes no difference. If you have some words that are case sensitive and others that are not, break the words into two groups and search each separately.
You are only matching words, so you can break the dictionary into an array of strings. Since you are only doing an exact match against a known length, break the word array into a separate array for each word length. So byLength[3] is the array off all words with length 3. Each word array should be sorted.
Now you have an array of words and a word with potential wild cards to find. Depending on wether and where the wildcards are, there are a few approaches.
If the search term has no wild cards, then do a binary search in your sorted array. You could do a hash at this point, which would be faster but not much. If the vast majority of your search terms have no wildcards, then consider a hash table or an associative array keyed by hash.
If the search term has wildcards after some literal characters, then do a binary search in the sorted array to find an upper and lower bound, then do a linear search in that bound. If the wildcards are all trailing then finding a non empty range is sufficient.
If the search term starts with wild cards, then the sorted array is no help and you would need to do a linear search unless you keep a copy of the array sorted by backwards strings. If you make such an array, then choose it any time there are more trailing than leading literals. If you do not allow leading wildcards then there is no need.
If the search term both starts and ends with wildcards, then you are stuck with a linear search within the words with equal length.
So an array of arrays of strings. Each array of strings is sorted, and contains strings of equal length. Optionally duplicate the whole structure with the sorting based on backwards strings for the case of leading wildcards.
The overall space is one or two pointers per word, plus the words. You should be able to store all the words in a single buffer if your language permits. Of course, if your language does not permit, grep is probably faster anyway. For a million words, that is 4-16MB for the arrays and similar for the actual words.
For a search term with no wildcards, performance would be very good. With wildcards, there will occasionally be linear searches across large groups of words. With the breakdown by length and a single leading character, you should never need to search more than a few percent of the total dictionary even in the worst case. Comparing only whole words of known length will always be faster than generic string matching.
Try to build a Generalized Suffix Tree if the dictionary will be matched by sequence of queries. There is linear time algorithm that can be used to build such tree (Ukkonen Suffix Tree Construction).
You can easily match (it's O(k), where k is the size of the query) each query by traversing from the root node, and use the wildcard character to match any character like typical pattern finding in suffix tree.

Resources