Matching against longest leading substring - algorithm

I'm looking to return the best match of a list against a collection of lists.
x matches against a list in the collection if the list in the collection of length n matches the first n elements of x.
e.g. [1,2,3] matches against [1,2] but [1,2] does not match against [1,2,3].
I want the function to return the "best" match, that is, the match that is the longest.
e.g.
bestMatch [1,2,3,3] [[1],[1,2,3],[1,2],[1,2,3,2],[1,2,3,4]] == Just [1,2,3]
Obviously a list here isn't the best data structure, and I'd rather use a standard structure and search rather than roll my own, any ideas what I should be using and how?
I don't think hash tables will work because the matches aren't exact. I then thought about searching against an ordered tree, but it has the problem that if I search for [1,2,100], I'll get [1,2,99], [1,2,98], ... etc before getting the correct answer, [1,2]. Could use a hash of hashes (and so-on down the tree) but that seems like a lot of overhead.
(A linear search list based implementation is here)

A trie would be a good solution. In your case, values would be just (), marking that a given node corresponds to an end of a list. Then, given a list, you'll just traverse the trie as far down as possible, and the last encountered value will mark the longest matched list.
A ByteString based trie in Data.Trie offers match, which seems to be exactly what you're looking for (if 8-byte chars keys would be sufficient for you):
-- | Given a query, find the longest prefix with an associated value in
-- the trie, returning that prefix, it's value, and the remaining string.
match :: Trie a -> ByteString -> Maybe (ByteString, a, ByteString)
There is also another package list-tries, which has more generic keys. I'm not sure if there is an exact function like match above, but definitely it'd be possible to implement one.

Related

Fast prefix search with ordered dictionary

Given a dictionary of strings D and an input string S. I'm trying to find a certain string p from D that is a prefix of S.
For an unordered dictionary the fastest way seems to be building a trie for D and traversing the trie along with the initial characters of S. As the strings in D are unordered, the most natural search algorithm here would be the one that finds the longest prefix p.
However, I need to preserve a special input order for the strings in D. For example, for D = [bar, foo, foobar] and S = foobariously, the above search would yield p = foobar, as it is the longest prefix. But instead I would like to get p = foo, because foo occurs earlier in the input list.
What is the fastest algorithm for that kind of prefix search? I presume that the basic approach still involves a trie, but I don't know how to integrate the original ordering into it.
Just build a trie, but when adding an element, if you find one already there on the way, drop this because that one is better.
That is, when trying to add 'foobar' you'd traverse the trie to 'foo' and realize that you'll never want 'foobar' so drop it.

How to Modify a Suffix Array to search multiple strings?

I've recently been updating my knowledge of algorithms and have been reading up on suffix arrays. Every text I've read has defined them as an array of suffixes over a single search string, but some articles have mentioned its 'trivial' to generalize to an entire list of search strings, but I can't see how.
Assume I'm trying to implement a simple substring search over a word list and wish to return a list of words matching a given substring. The naive approach would appear to be to insert the lexicographic end character '$' between words in my list, concatenate them all together, and produce a suffix tree from the result. But this would seem to generate large numbers of irrelevant entries. If I create a source string of 'banana$muffin' then I'll end up generating suffixes for 'ana$muffin' which I'll never use.
I'd appreciate any hints as to how to do this right, or better yet, a pointer to some algorithm texts that handle this case.
In Suffix-Arrays you usually don't use strings, just one string. That will be the concatenated version of several strings with some endtoken (a different one for every string). For the Suffix Arrays, you use pointers (or the array index) to reference the suffix (only the position for the first token/character is needed).
So the space required is the array + for each suffix the pointer. (that is just a pretty simple implementation, you should do more, to get more performance).
In that case you could optimise the sorting algorithm for the suffixes, since you only need to sort those suffixes the pointers are referencing to, till the endtokens. Everything behind the endtoken does not need to be used in the sorting algorithm.
After having now read through most of the book Algorithms on Strings, Trees and Sequences by Dan Gusfield, the answer seems clear.
If you start with a multi-string suffix tree, one of the standard conversion algorithms will still work. However, instead of having getting an array of integers, you end up with an array of lists. Each lists contains one or more pairs of a string identifier and a starting offset in that string.
The resulting structure is still useful, but not as efficient as a normal suffix array.
From Iowa State University, taken from Prefix.pdf:
Suffix trees and suffix arrays can be generalized to multiple strings.
The generalized suffix tree of a set of strings S = {s1, s2, . . . ,
sk}, denoted GST(S) or simply GST, is a compacted trie of all suffixes
of each string in S. We assume that the unique termination character $
is appended to the end of each string. A leaf label now consists of a
pair of integers (i, j), where i denotes the suffix is from string si
and j denotes the starting position of the suffix in si . Similarly,
an edge label in a GST is a substring of one of the strings. An edge
label is represented by a triplet of integers (i, j, l), where i
denotes the string number, and j and l denote the starting and ending
positions of the substring in si . For convenience of understanding,
we will continue to show the actual edge labels. Note that two strings
may have identical suffixes. This is compensated by allowing leaves in
the tree to have multiple labels. If a leaf is multiply labelled, each
suffix should come from a different string. If N is the total number
of characters (including the $ in each string) of all strings in S,
the GST has at most N leaf nodes and takes up O(N) space. The
generalized suffix array of S, denoted GSA(S) or simply GSA, is a
lexicographically sorted array of all suffixes of each string in S.
Each suffix is represented by an integer pair (i, j) denoting suffix
starting from position j in si . If suffixes from different strings
are identical, they occupy consecutive positions in the GSA. For
convenience, we make an exception for the suffix $ by listing it only
once, though it occurs in each string. The GST and GSA of strings
apple and maple are shown in Figure 1.2.
Here you have an article about an algorithm to construct a GSA:
Generalized enhanced suffix array construction in external memory

Programming : find the first unique string in a file in just 1 pass

Given a very long list of Product Names, find the first product name which is unique (occurred exactly once ). You can only iterate once in the file.
I am thinking of taking a hashmap and storing the (keys,count) in a doubly linked list.
basically a linked hashmap
can anyone optimize this or give a better approach
Since you can only iterate the list once, you have to store
each string that occurs exactly once, because it could be the output
their relative position within the list
each string that occurs more than once (or their hash, if you're not afraid)
Notably, you don't have to store the relative positions of strings that occur more than once.
You need
efficient storage of the set of strings. A hash set is a good candidate, but a trie could offer better compression depending on the set of strings.
efficient lookup by value. This rules out a bare list. A hash-set is the clear winner, but a trie also performs well. You can store the leaves of the trie in a hash set.
efficient lookup of the minimum. This asks for a linked list.
Conclusion:
Use a linked hash-set for the set of strings, and a flag indicating if they're unique. If you're fighting for memory, use a linked trie. If a linked trie is too slow, store the trie leaves in a hash map for look-up. Include only the unique strings in the linked list.
In total, your nodes could look like: Node:{Node[] trieEdges, Node trieParent, String inEdge, Node nextUnique, Node prevUnique}; Node firstUnique, Node[] hashMap
If you strive for ease of implementation, you can have two hash-sets instead (one linked).
The following algorithm solves it in O(N+M) time.
where
N=number of strings
M=total number of characters put together in all strings.
The steps are as follows:
`1. Create a hash value for each string`
`2. Xor it and find the one which didn't have a pair`
Xor has this useful property that if you do a xor a=0 and b xor 0=b.
Tips to generate the hash value for a string:
Use a 27 base number system, and give a a value of 1, b a value of 2 and so on till z which gets 26, and so if string is "abc" , we compute hash value as:
H=3*(27 power 0)+2*(27 power 1)+ 1(27 power 2)
=786
You could use modulus operator to make hash values small enough to fit in 32-bit integers.If you do that keep an eye out for collisions, which are basically two strings which are different but get the same hash value due to the modulus operation.
Mostly I guess you won't be needing it.
So compute the hash for each string, and then start from the first hash and keep xor-ing, the result will hold the hash value of the string which din't have a pair.
Caution:This is useful only when strings occur in pairs.Still this is a good idea to start with, that's why I answered it.
Using a linked hashmap is obvious enough. Otherwise, you could use a TreeMap style data structure where the strings are ordered by count. So as soon as you are done reading the input, the root of your tree is unique if a unique string exists. Unlike a linked hash map, insertion takes at most O(log n) as opposed to O(n). You can read up on TreeMaps for insight on how to augment a basic TreeMap into what you need. Also in your linked hashmap you may have to travel O(n) to find your first unique key. With a TreeMap style data structure, your look up is O(1) -- the root. Even if more unique keys exist, the first one you encountered will be the root. The subsequent ones will be children of the root.

Searching strings with . wildcard

I have an array with so much strings and want to search for a pattern on it.
This pattern can have some "." wildcard who matches (each) 1 character (any).
For example:
myset = {"bar", "foo", "cya", "test"}
find(myset, "f.o") -> returns true (matches with "foo")
find(myset, "foo.") -> returns false
find(myset, ".e.t") -> returns true (matches with "test")
find(myset, "cya") -> returns true (matches with "cya")
I tried to find a way to implement this algorithm fast because myset actually is a very big array, but none of my ideas has satisfactory complexity (for example O(size_of(myset) * lenght(pattern)))
Edit:
myset is an huge array, the words in it aren't big.
I can do a slow preprocessing. But I'll have so much find() queries, so find() I want find() to be as fast as possible.
You could build a suffix tree of the corpus of all possible words in your set (see this link)
Using this data structure your complexity would include a one time cost of O(n) to build the tree, where n is the sum of the lengths of all your words.
Once the tree is built finding if a string matches should take just O(n) where n is length of the string.
If the set is fixed, you could pre-calculate frequencies of a character c being at position p (for as many p values as you consider worth-while), then search through the array once, for each element testing characters at specific positions in an order such that you are most likely to exit early.
First, divide the corpus into sets per word length. Then your find algorithm can search over the appropriate set, since the input to find() always requires the match to have a specific length, and the algorithm can be designed to work well with all words of the same length.
Next (for each set), create a hash map from a hash of character x position to a list of matching words. It is quite ok to have a large amount of hash collision. You can use delta and run-length encoding to reduce the size of the list of matching words.
To search, pick the appropriate hash map for the find input length, and for each non . character, calculate the hash for that character x position, and AND together the lists of words, to get a much reduced list.
Brute force search through that much smaller list.
If you are sure that the length of the words in your set are not large. You could probably create a table which holds the following:
List of Words which have first Character 'a' , List of Words which have first Character 'b', ..
List of Words which have second Character 'a', List of words which have second Character 'b', ..
and so on.
When you are searching for the word. You can look for the list of words which have the first character same as the search strings' first character. With this refined list, look for the words which have the second character same as the search strings' second character and so on. You can ignore '.' whenever you encounter them.
I understand that building the table may take a large amount of space but the time taken will come down significantly.
For example, if you have myset = {"bar", "foo", "cya", "test"} and you are searching for 'f.o'
The moment you check for list of words starting with f, you eliminate the rest of the set. Just an idea.. Hope it helps.
I had this same question, and I wasn't completely happy with most of the ideas/solutions I found on the internet. I think the "right" way to do this is to use a Directed Acyclic Word Graph. I didn't quite do that, but I added some additional logic to a Trie to get a similar effect.
See my isWord() implementation, analogous to your desired find() interface. It works by recursing down the Trie, branching on wildcard, and then collecting results back into a common set. (See findNodes().)
getMatchingWords() is similar in spirit, except that it returns the set of matching words, instead of just a boolean as to whether or not the query matches anything.

Efficient data structure for word lookup with wildcards

I need to match a series of user inputed words against a large dictionary of words (to ensure the entered value exists).
So if the user entered:
"orange" it should match an entry "orange' in the dictionary.
Now the catch is that the user can also enter a wildcard or series of wildcard characters like say
"or__ge" which would also match "orange"
The key requirements are:
* this should be as fast as possible.
* use the smallest amount of memory to achieve it.
If the size of the word list was small I could use a string containing all the words and use regular expressions.
however given that the word list could contain potentially hundreds of thousands of enteries I'm assuming this wouldn't work.
So is some sort of 'tree' be the way to go for this...?
Any thoughts or suggestions on this would be totally appreciated!
Thanks in advance,
Matt
Put your word list in a DAWG (directed acyclic word graph) as described in Appel and Jacobsen's paper on the World's Fastest Scrabble Program (free copy at Columbia). For your search you will traverse this graph maintaining a set of pointers: on a letter, you make a deterministic transition to children with that letter; on a wildcard, you add all children to the set.
The efficiency will be roughly the same as Thompson's NFA interpretation for grep (they are the same algorithm). The DAWG structure is extremely space-efficient—far more so than just storing the words themselves. And it is easy to implement.
Worst-case cost will be the size of the alphabet (26?) raised to the power of the number of wildcards. But unless your query begins with N wildcards, a simple left-to-right search will work well in practice. I'd suggest forbidding a query to begin with too many wildcards, or else create multiple dawgs, e.g., dawg for mirror image, dawg for rotated left three characters, and so on.
Matching an arbitrary sequence of wildcards, e.g., ______ is always going to be expensive because there are combinatorially many solutions. The dawg will enumerate all solutions very quickly.
I would first test the regex solution and see whether it is fast enough - you might be surprised! :-)
However if that wasn't good enough I would probably use a prefix tree for this.
The basic structure is a tree where:
The nodes at the top level are all the possible first letters (i.e. probably 26 nodes from a-z assuming you are using a full dictionary...).
The next level down contains all the possible second letters for each given first letter
And so on until you reach an "end of word" marker for each word
Testing whether a given string with wildcards is contained in your dictionary is then just a simple recursive algorithm where you either have a direct match for each character position, or in the case of the wildcard you check each of the possible branches.
In the worst case (all wildcards but only one word with the right number of letters right at the end of the dictionary), you would traverse the entire tree but this is still only O(n) in the size of the dictionary so no worse than a full regex scan. In most cases it would take very few operations to either find a match or confirm that no such match exists since large branches of the search tree are "pruned" with each successive letter.
No matter which algorithm you choose, you have a tradeoff between speed and memory consumption.
If you can afford ~ O(N*L) memory (where N is the size of your dictionary and L is the average length of a word), you can try this very fast algorithm. For simplicity, will assume latin alphabet with 26 letters and MAX_LEN as the max length of word.
Create a 2D array of sets of integers, set<int> table[26][MAX_LEN].
For each word in you dictionary, add the word index to the sets in the positions corresponding to each of the letters of the word. For example, if "orange" is the 12345-th word in the dictionary, you add 12345 to the sets corresponding to [o][0], [r][1], [a][2], [n][3], [g][4], [e][5].
Then, to retrieve words corresponding to "or..ge", you find the intersection of the sets at [o][0], [r][1], [g][4], [e][5].
You can try a string-matrix:
0,1: A
1,5: APPLE
2,5: AXELS
3,5: EAGLE
4,5: HELLO
5,5: WORLD
6,6: ORANGE
7,8: LONGWORD
8,13:SUPERLONGWORD
Let's call this a ragged index-matrix, to spare some memory. Order it on length, and then on alphabetical order. To address a character I use the notation x,y:z: x is the index, y is the length of the entry, z is the position. The length of your string is f and g is the number of entries in the dictionary.
Create list m, which contains potential match indexes x.
Iterate on z from 0 to f.
Is it a wildcard and not the latest character of the search string?
Continue loop (all match).
Is m empty?
Search through all x from 0 to g for y that matches length. !!A!!
Does the z character matches with search string at that z? Save x in m.
Is m empty? Break loop (no match).
Is m not empty?
Search through all elements of m. !!B!!
Does not match with search? Remove from m.
Is m empty? Break loop (no match).
A wildcard will always pass the "Match with search string?". And m is equally ordered as the matrix.
!!A!!: Binary search on length of the search string. O(log n)
!!B!!: Binary search on alphabetical ordering. O(log n)
The reason for using a string-matrix is that you already store the length of each string (because it makes it search faster), but it also gives you the length of each entry (assuming other constant fields), such that you can easily find the next entry in the matrix, for fast iterating. Ordering the matrix isn't a problem: since this has only be done once the dictionary updates, and not during search-time.
If you are allowed to ignore case, which I assume, then make all the words in your dictionary and all the search terms the same case before anything else. Upper or lower case makes no difference. If you have some words that are case sensitive and others that are not, break the words into two groups and search each separately.
You are only matching words, so you can break the dictionary into an array of strings. Since you are only doing an exact match against a known length, break the word array into a separate array for each word length. So byLength[3] is the array off all words with length 3. Each word array should be sorted.
Now you have an array of words and a word with potential wild cards to find. Depending on wether and where the wildcards are, there are a few approaches.
If the search term has no wild cards, then do a binary search in your sorted array. You could do a hash at this point, which would be faster but not much. If the vast majority of your search terms have no wildcards, then consider a hash table or an associative array keyed by hash.
If the search term has wildcards after some literal characters, then do a binary search in the sorted array to find an upper and lower bound, then do a linear search in that bound. If the wildcards are all trailing then finding a non empty range is sufficient.
If the search term starts with wild cards, then the sorted array is no help and you would need to do a linear search unless you keep a copy of the array sorted by backwards strings. If you make such an array, then choose it any time there are more trailing than leading literals. If you do not allow leading wildcards then there is no need.
If the search term both starts and ends with wildcards, then you are stuck with a linear search within the words with equal length.
So an array of arrays of strings. Each array of strings is sorted, and contains strings of equal length. Optionally duplicate the whole structure with the sorting based on backwards strings for the case of leading wildcards.
The overall space is one or two pointers per word, plus the words. You should be able to store all the words in a single buffer if your language permits. Of course, if your language does not permit, grep is probably faster anyway. For a million words, that is 4-16MB for the arrays and similar for the actual words.
For a search term with no wildcards, performance would be very good. With wildcards, there will occasionally be linear searches across large groups of words. With the breakdown by length and a single leading character, you should never need to search more than a few percent of the total dictionary even in the worst case. Comparing only whole words of known length will always be faster than generic string matching.
Try to build a Generalized Suffix Tree if the dictionary will be matched by sequence of queries. There is linear time algorithm that can be used to build such tree (Ukkonen Suffix Tree Construction).
You can easily match (it's O(k), where k is the size of the query) each query by traversing from the root node, and use the wildcard character to match any character like typical pattern finding in suffix tree.

Resources