Fast prefix search with ordered dictionary - algorithm

Given a dictionary of strings D and an input string S. I'm trying to find a certain string p from D that is a prefix of S.
For an unordered dictionary the fastest way seems to be building a trie for D and traversing the trie along with the initial characters of S. As the strings in D are unordered, the most natural search algorithm here would be the one that finds the longest prefix p.
However, I need to preserve a special input order for the strings in D. For example, for D = [bar, foo, foobar] and S = foobariously, the above search would yield p = foobar, as it is the longest prefix. But instead I would like to get p = foo, because foo occurs earlier in the input list.
What is the fastest algorithm for that kind of prefix search? I presume that the basic approach still involves a trie, but I don't know how to integrate the original ordering into it.

Just build a trie, but when adding an element, if you find one already there on the way, drop this because that one is better.
That is, when trying to add 'foobar' you'd traverse the trie to 'foo' and realize that you'll never want 'foobar' so drop it.

Related

Data structure for fast searching words consisting of given letters

Given a random string, I want to find every word in a dictionary that consists of only those letters. Input characters can be ignored, so for the string "ccta" we could find "act" or "cat".
How should I implement a data structure to accomplish this goal?
It could be just a plain text file, but that would be slow and not interesting. My thoughts are to first build a frequency map for the given string:
pub trait FreqMap {
type Content;
type Count;
fn frequency_map(&self) -> BTreeMap<Self::Content, Self::Count>;
}
impl FreqMap for str {
type Content = char;
type Count = usize;
fn frequency_map(&self) -> BTreeMap<char, usize> {
let mut freqmap = BTreeMap::new();
for c in self.chars() {
*freqmap.entry(c).or_insert(0) += 1;
}
freqmap
}
}
Then I would build some data structure which could be "indexed" by such frequency maps. I could convert a dictionary into such structure and searching will be very fast.
What is the best way for indexing a dictionary by such a frequency map?
For the dictionary part, I think you may use Trie data structure.
You can know more about it here and a good implementation (in C, though) and tutorial here.
It is essentially a search tree which can store strings, or rather string prefixes, making it perfect for implementing dictionaries.
You can first build Tries for the words in your dictionary. For instance, one trie for each alphabet so that all the words starting with that letter are stored together.
For the searching part, a solution (somewhat inefficient, though) might be to generate all the permutations of your given string, and search them in the created tries. If a match is found for any prefix of the permuted string, it can also be returned.
1) Sort the unique letters in each word. Then make a dictionary that maps each sorted-letter-string to the list of words contain exactly the same letters.
2) Make a Patricia trie (https://en.wikipedia.org/wiki/Radix_tree) containing all the sorted-letter-strings.
To do a search, first make a set of the valid letters. Then you can do a depth-first search on the Patricia trie to find all the entries containing only those letters, and expand the associated lists of words. This is a normal depth-first search, except you stop following a path when it contains a letter that's not in the valid set.
When you sort the word strings, use an ordering that puts the least-common letters first. That way the trie will be shallower and your DFS will have to search fewer branches on average.
In terms of algorithms, having a way to reduce each word into a key which:
equals the key of another word with the same letters (shuffled)
differs from the key of another word with at least one different letter
and then using a dictionary key -> [word] seems like a reasonable choice.
For the key I would propose using a sorted Vec<char>as it is likely more efficient than a BTreeMap. Most notably, the vector only has a single memory allocation and its comparison is a straightforward memcmp.
For the dictionary, I would propose using a HashMap: HashMap<FreqMap, Vec<String>>.
How to go from actt to act and cat?
Search for actt, find tact (and maybe others).
Search for act, att and ctt (removing one letter each time) and find act, cat and tat.
...
Not the most efficient, but you don't have any way to store every possible input in memory.
Remark: char are unicode code points, but not graphemes. Depending on the languages/strings you process this may matter; for example if the letters á é are encoded as a + ´ and e + ´ respectively, then aé and áe would both yield the same key (a: 1, e: 1, ´: 1) even though they differ.

Matching against longest leading substring

I'm looking to return the best match of a list against a collection of lists.
x matches against a list in the collection if the list in the collection of length n matches the first n elements of x.
e.g. [1,2,3] matches against [1,2] but [1,2] does not match against [1,2,3].
I want the function to return the "best" match, that is, the match that is the longest.
e.g.
bestMatch [1,2,3,3] [[1],[1,2,3],[1,2],[1,2,3,2],[1,2,3,4]] == Just [1,2,3]
Obviously a list here isn't the best data structure, and I'd rather use a standard structure and search rather than roll my own, any ideas what I should be using and how?
I don't think hash tables will work because the matches aren't exact. I then thought about searching against an ordered tree, but it has the problem that if I search for [1,2,100], I'll get [1,2,99], [1,2,98], ... etc before getting the correct answer, [1,2]. Could use a hash of hashes (and so-on down the tree) but that seems like a lot of overhead.
(A linear search list based implementation is here)
A trie would be a good solution. In your case, values would be just (), marking that a given node corresponds to an end of a list. Then, given a list, you'll just traverse the trie as far down as possible, and the last encountered value will mark the longest matched list.
A ByteString based trie in Data.Trie offers match, which seems to be exactly what you're looking for (if 8-byte chars keys would be sufficient for you):
-- | Given a query, find the longest prefix with an associated value in
-- the trie, returning that prefix, it's value, and the remaining string.
match :: Trie a -> ByteString -> Maybe (ByteString, a, ByteString)
There is also another package list-tries, which has more generic keys. I'm not sure if there is an exact function like match above, but definitely it'd be possible to implement one.

How to Modify a Suffix Array to search multiple strings?

I've recently been updating my knowledge of algorithms and have been reading up on suffix arrays. Every text I've read has defined them as an array of suffixes over a single search string, but some articles have mentioned its 'trivial' to generalize to an entire list of search strings, but I can't see how.
Assume I'm trying to implement a simple substring search over a word list and wish to return a list of words matching a given substring. The naive approach would appear to be to insert the lexicographic end character '$' between words in my list, concatenate them all together, and produce a suffix tree from the result. But this would seem to generate large numbers of irrelevant entries. If I create a source string of 'banana$muffin' then I'll end up generating suffixes for 'ana$muffin' which I'll never use.
I'd appreciate any hints as to how to do this right, or better yet, a pointer to some algorithm texts that handle this case.
In Suffix-Arrays you usually don't use strings, just one string. That will be the concatenated version of several strings with some endtoken (a different one for every string). For the Suffix Arrays, you use pointers (or the array index) to reference the suffix (only the position for the first token/character is needed).
So the space required is the array + for each suffix the pointer. (that is just a pretty simple implementation, you should do more, to get more performance).
In that case you could optimise the sorting algorithm for the suffixes, since you only need to sort those suffixes the pointers are referencing to, till the endtokens. Everything behind the endtoken does not need to be used in the sorting algorithm.
After having now read through most of the book Algorithms on Strings, Trees and Sequences by Dan Gusfield, the answer seems clear.
If you start with a multi-string suffix tree, one of the standard conversion algorithms will still work. However, instead of having getting an array of integers, you end up with an array of lists. Each lists contains one or more pairs of a string identifier and a starting offset in that string.
The resulting structure is still useful, but not as efficient as a normal suffix array.
From Iowa State University, taken from Prefix.pdf:
Suffix trees and suffix arrays can be generalized to multiple strings.
The generalized suffix tree of a set of strings S = {s1, s2, . . . ,
sk}, denoted GST(S) or simply GST, is a compacted trie of all suffixes
of each string in S. We assume that the unique termination character $
is appended to the end of each string. A leaf label now consists of a
pair of integers (i, j), where i denotes the suffix is from string si
and j denotes the starting position of the suffix in si . Similarly,
an edge label in a GST is a substring of one of the strings. An edge
label is represented by a triplet of integers (i, j, l), where i
denotes the string number, and j and l denote the starting and ending
positions of the substring in si . For convenience of understanding,
we will continue to show the actual edge labels. Note that two strings
may have identical suffixes. This is compensated by allowing leaves in
the tree to have multiple labels. If a leaf is multiply labelled, each
suffix should come from a different string. If N is the total number
of characters (including the $ in each string) of all strings in S,
the GST has at most N leaf nodes and takes up O(N) space. The
generalized suffix array of S, denoted GSA(S) or simply GSA, is a
lexicographically sorted array of all suffixes of each string in S.
Each suffix is represented by an integer pair (i, j) denoting suffix
starting from position j in si . If suffixes from different strings
are identical, they occupy consecutive positions in the GSA. For
convenience, we make an exception for the suffix $ by listing it only
once, though it occurs in each string. The GST and GSA of strings
apple and maple are shown in Figure 1.2.
Here you have an article about an algorithm to construct a GSA:
Generalized enhanced suffix array construction in external memory

Algorithm to match sequential subset from a list

I am trying to remember the right algorithm to find a subset within a set that matches an element of a list of possible subsets. For example, given the input:
aehfaqptpzzy
and the subset list:
{ happy, sad, indifferent }
we can see that the word "happy" is a match because it is inside the input:
a e h f a q p t p z z y
I am pretty sure there is a specific algorithm to find all such matches, but I cannot remember what it is called.
UPDATE
The above example is not very good because it has letter repetitions, in fact in my problem both the dictionary entries and the input string are sortable sets. For example,
input: acegimnrqvy
dictionary:
{ cgn,
dfr,
lmr,
mnqv,
eg }
So in this example the algorithm would return cgn, mnqv and eg as matches. Also, I would like to find the best set of complementary matches where "best" means longest. So, in the example above the "best" answer would be "cgn mnqv", eg would not be a match because it conflicts with cgn which is a longer match.
I realize that the problem can be done by brute force scan, but that is undesirable because there could be thousands of entries in the dictionary and thousands of values in the input string. If we are trying to find the best set of matches, computability will become an issue.
You can use the Aho - Corrasick algorithm with more than one current states. For each of the input letters, you either stay (skip the letter) or move using the appropriate edge. If two or more "actors" meet at the same place, just merge them to one (if you're interested just in the presence and not counts).
About the complexity - this could be as slow as the naive O(MN) approach, because there can be up to size of dictionary actors. However, in practice, we can make a good use of the fact that many words are substrings of others, because there never won't be more than size of the trie actors, which - compared to the size of the dictionary - tends to be much smaller.

Searching strings with . wildcard

I have an array with so much strings and want to search for a pattern on it.
This pattern can have some "." wildcard who matches (each) 1 character (any).
For example:
myset = {"bar", "foo", "cya", "test"}
find(myset, "f.o") -> returns true (matches with "foo")
find(myset, "foo.") -> returns false
find(myset, ".e.t") -> returns true (matches with "test")
find(myset, "cya") -> returns true (matches with "cya")
I tried to find a way to implement this algorithm fast because myset actually is a very big array, but none of my ideas has satisfactory complexity (for example O(size_of(myset) * lenght(pattern)))
Edit:
myset is an huge array, the words in it aren't big.
I can do a slow preprocessing. But I'll have so much find() queries, so find() I want find() to be as fast as possible.
You could build a suffix tree of the corpus of all possible words in your set (see this link)
Using this data structure your complexity would include a one time cost of O(n) to build the tree, where n is the sum of the lengths of all your words.
Once the tree is built finding if a string matches should take just O(n) where n is length of the string.
If the set is fixed, you could pre-calculate frequencies of a character c being at position p (for as many p values as you consider worth-while), then search through the array once, for each element testing characters at specific positions in an order such that you are most likely to exit early.
First, divide the corpus into sets per word length. Then your find algorithm can search over the appropriate set, since the input to find() always requires the match to have a specific length, and the algorithm can be designed to work well with all words of the same length.
Next (for each set), create a hash map from a hash of character x position to a list of matching words. It is quite ok to have a large amount of hash collision. You can use delta and run-length encoding to reduce the size of the list of matching words.
To search, pick the appropriate hash map for the find input length, and for each non . character, calculate the hash for that character x position, and AND together the lists of words, to get a much reduced list.
Brute force search through that much smaller list.
If you are sure that the length of the words in your set are not large. You could probably create a table which holds the following:
List of Words which have first Character 'a' , List of Words which have first Character 'b', ..
List of Words which have second Character 'a', List of words which have second Character 'b', ..
and so on.
When you are searching for the word. You can look for the list of words which have the first character same as the search strings' first character. With this refined list, look for the words which have the second character same as the search strings' second character and so on. You can ignore '.' whenever you encounter them.
I understand that building the table may take a large amount of space but the time taken will come down significantly.
For example, if you have myset = {"bar", "foo", "cya", "test"} and you are searching for 'f.o'
The moment you check for list of words starting with f, you eliminate the rest of the set. Just an idea.. Hope it helps.
I had this same question, and I wasn't completely happy with most of the ideas/solutions I found on the internet. I think the "right" way to do this is to use a Directed Acyclic Word Graph. I didn't quite do that, but I added some additional logic to a Trie to get a similar effect.
See my isWord() implementation, analogous to your desired find() interface. It works by recursing down the Trie, branching on wildcard, and then collecting results back into a common set. (See findNodes().)
getMatchingWords() is similar in spirit, except that it returns the set of matching words, instead of just a boolean as to whether or not the query matches anything.

Resources