Minimum number of deletions for a given word to become a dictionary word - algorithm

Given a dictionary as a hashtable. Find the minimum # of
deletions needed for a given word in order to make it match any word in the
dictionary.
Is there some clever trick to solve this problem in less than exponential complexity (trying all possible combinations)?

For starters, suppose that you have a single word w in the the hash table and that your word is x. You can delete letters from x to form w if and only if w is a subsequence of x, and in that case the number of letters you need to delete from x to form w is given by |x - w|. So certainly one option would be to just iterate over the hash table and, for each word, to see if x is a subsequence of that word, taking the best match you find across the table.
To analyze the runtime of this operation, let's suppose that there are n total words in your hash table and that their total length is L. Then the runtime of this operation is O(L), since you'll process each character across all the words at most once. The complexity of your initial approach is O(|x| · 2|x|) because there are 2|x| possible words you can make by deleting letters from x and you'll spend O(|x|) time processing each one. Depending on the size of your dictionary and the size of your word, one algorithm might be better than the other, but we can say that the runtime is O(min{L, |x|·2|x|) if you take the better of the two approaches.

You can build a trie and then see where your given word fits into it. The difference in the depth of your word and the closest existing parent is the number of deletions required.

Related

scrabble solving with maximum score

I was asked a question
You are given a list of characters, a score associated with each character and a dictionary of valid words ( say normal English dictionary ). you have to form a word out of the character list such that the score is maximum and the word is valid.
I could think of a solution involving a trie made out of dictionary and backtracking with available characters, but could not formulate properly. Does anyone know the correct approach or come up with one?
First iterate over your letters and count how many times do you have each of the characters in the English alphabet. Store this in a static, say a char array of size 26 where first cell corresponds to a second to b and so on. Name this original array cnt. Now iterate over all words and for each word form a similar array of size 26. For each of the cells in this array check if you have at least as many occurrences in cnt. If that is the case, you can write the word otherwise you can't. If you can write the word you compute its score and maximize the score in a helper variable.
This approach will have linear complexity and this is also the best asymptotic complexity you can possibly have(after all the input you're given is of linear size).
Inspired by Programmer Person's answer (initially I thought that approach was O(n!) so I discarded it). It needs O(nr of words) setup and then O(2^(chars in query)) for each question. This is exponential, but in Scrabble you only have 7 letter tiles at a time; so you need to check only 128 possibilities!
First observation is that the order of characters in query or word doesn't matter, so you want to process your list into a set of bag of chars. A way to do that is to 'sort' the word so "bac", "cab" become "abc".
Now you take your query, and iterate all possible answers. All variants of keep/discard for each letter. It's easier to see in binary form: 1111 to keep all, 1110 to discard the last letter...
Then check if each possibility exists in your dictionary (hash map for simplicity), then return the one with the maximum score.
import nltk
from string import ascii_lowercase
from itertools import product
scores = {c:s for s, c in enumerate(ascii_lowercase)}
sanitize = lambda w: "".join(c for c in w.lower() if c in scores)
anagram = lambda w: "".join(sorted(w))
anagrams = {anagram(sanitize(w)):w for w in nltk.corpus.words.words()}
while True:
query = input("What do you have?")
if not query: break
# make it look like our preprocessed word list
query = anagram(sanitize(query))
results = {}
# all variants for our query
for mask in product((True, False), repeat=len(query)):
# get the variant given the mask
masked = "".join(c for i, c in enumerate(query) if mask[i])
# check if it's valid
if masked in anagrams:
# score it, also getting the word back would be nice
results[sum(scores[c] for c in masked)] = anagrams[masked]
print(*max(results.items()))
Build a lookup trie of just the sorted-anagram of each word of the dictionary. This is a one time cost.
By sorted anagram I mean: if the word is eat you represent it as aet. It the word is tea, you represent it as aet, bubble is represent as bbbelu etc
Since this is scrabble, assuming you have 8 tiles (say you want to use one from the board), you will need to maximum check 2^8 possibilities.
For any subset of the tiles from the set of 8, you sort the tiles, and lookup in the anagram trie.
There are at most 2^8 such subsets, and this could potentially be optimized (in case of repeating tiles) by doing a more clever subset generation.
If this is a more general problem, where 2^{number of tiles} could be much higher than the total number of anagram-words in the dictionary, it might be better to use frequency counts as in Ivaylo's answer, and the lookups potentially can be optimized using multi-dimensional range queries. (In this case 26 dimensions!)
Sorry, this might not help you as-is (I presume you are trying to do some exercise and have constraints), but I hope this will help the future readers who don't have those constraints.
If the number of dictionary entries is relatively small (up to a few million) you can use brute force: For each word, create a 32 bit mask. Preprocess the data: Set one bit if the letter a/b/c/.../z is used. For the six most common English characters etaoin set another bit if the letter is used twice.
Create a similar bitmap for the letters that you have. Then scan the dictionary for words where all bits that are needed for the word are set in the bitmap for the available letters. You have reduced the problem to words where you have all needed characters once, and the six most common characters twice if the are needed twice. You'll still have to check if a word can be formed in case you have a word like "bubble" and the first test only tells you that you have letters b,u,l,e but not necessarily 3 b's.
By also sorting the list of words by point values before doing the check, the first hit is the best one. This has another advantage: You can count the points that you have, and don't bother checking words with more points. For example, bubble has 12 points. If you have only 11 points, then there is no need to check this word at all (have a small table with the indexes of the first word with any given number of points).
To improve anagrams: In the table, only store different bitmasks with equal number of points (so we would have entries for bubble and blue because they have different point values, but not for team and mate). Then store all the possible words, possibly more than one, for each bit mask and check them all. This should reduce the number of bit masks to check.
Here is a brute force approach in python, using an english dictionary containing 58,109 words. This approach is actually quite fast timing at about .3 seconds on each run.
from random import shuffle
from string import ascii_lowercase
import time
def getValue(word):
return sum(map( lambda x: key[x], word))
if __name__ == '__main__':
v = range(26)
shuffle(v)
key = dict(zip(list(ascii_lowercase), v))
with open("/Users/james_gaddis/PycharmProjects/Unpack Sentance/hard/words.txt", 'r') as f:
wordDict = f.read().splitlines()
f.close()
valued = map(lambda x: (getValue(x), x), wordDict)
print max(valued)
Here is the dictionary I used, with one hyphenated entry removed for convenience.
Can we assume that the dictionary is fixed and the score are fixed and that only the letters available will change (as in scrabble) ? Otherwise, I think there is no better than looking up each word of the dictionnary as previously suggested.
So let's assume that we are in this setting. Pick an order < that respects the costs of letters. For instance Q > Z > J > X > K > .. > A >E >I .. > U.
Replace your dictionary D with a dictionary D' made of the anagrams of the words of D with letters ordered by the previous order (so the word buzz is mapped to zzbu, for instance), and also removing duplicates and words of length > 8 if you have at most 8 letters in your game.
Then construct a trie with the words of D' where the children nodes are ordered by the value of their letters (so the first child of the root would be Q, the second Z, .., the last child one U). On each node of the trie, also store the maximal value of a word going through this node.
Given a set of available characters, you can explore the trie in a depth first manner, going from left to right, and keeping in memory the current best value found. Only explore branches whose node's value is larger than you current best value. This way, you will explore only a few branches after the first ones (for instance, if you have a Z in your game, exploring any branch that start with a one point letter as A is discarded, because it will score at most 8x1 which is less than the value of Z). I bet that you will explore only a very few branches each time.

Weighted unordered string edit distance

I need an efficient way of calculating the minimum edit distance between two unordered collections of symbols. Like in the Levenshtein distance, which only works for sequences, I require insertions, deletions, and substitutions with different per-symbol costs. I'm also interested in recovering the edit script.
Since what I'm trying to accomplish is very similar to calculating string edit distance, I figured it might be called unordered string edit distance or maybe just set edit distance. However, Google doesn't turn up anything with those search terms, so I'm interested to learn if the problem is known by another name?
To clarify, the problem would be solved by
def unordered_edit_distance(target, source):
return min(edit_distance(target, source_perm)
for source_perm in permuations(source))
So for instance, the unordered_edit_distance('abc', 'cba') would be 0, whereas edit_distance('abc', 'cba') is 2. Unfortunately, the number of permutations grows large very quickly and is not practical even for moderately sized inputs.
EDIT Make it clearer that operations are associated with different costs.
Sort them (not necessary), then remove items which are same (and in equal numbers!) in both sets.
Then if the sets are equal in size, you need that numer of substitutions; if one is greater, then you also need some insertions or deletions. Anyway you need the number of operations equal the size of the greater set remaining after the first phase.
Although your observation is kind of correct, but you are actually make a simple problem more complex.
Since source can be any permutation of the original source, you first need check the difference in character level.
Have two map each map count the number of individual characters in your target and source string:
for example:
a: 2
c: 1
d: 100
Now compare two map, if you missing any character of course you need to insert it, and if you have extra character you delete it. Thats it.
Let's ignore substitutions for a moment.
Now it becomes a fairly trivial problem of determining the elements only in the first set (which would count as deletions) and those only in the second set (which would count as insertions). This can easily be done by either:
Sorting the sets and iterating through both at the same time, or
Inserting each element from the first set into a hash table, then removing each element from the second set from the hash table, with each element not found being an insertion and each element remaining in the hash table after we're done being a deletion
Now, to include substitutions, all that remains is finding the optimal pairing of inserted elements to deleted elements. This is actually the stable marriage problem:
The stable marriage problem (SMP) is the problem of finding a stable matching between two sets of elements given a set of preferences for each element. A matching is a mapping from the elements of one set to the elements of the other set. A matching is stable whenever it is not the case that both:
Some given element A of the first matched set prefers some given element B of the second matched set over the element to which A is already matched, and
B also prefers A over the element to which B is already matched
Which can be solved with the Gale-Shapley algorithm:
The Gale–Shapley algorithm involves a number of "rounds" (or "iterations"). In the first round, first a) each unengaged man proposes to the woman he prefers most, and then b) each woman replies "maybe" to her suitor she most prefers and "no" to all other suitors. She is then provisionally "engaged" to the suitor she most prefers so far, and that suitor is likewise provisionally engaged to her. In each subsequent round, first a) each unengaged man proposes to the most-preferred woman to whom he has not yet proposed (regardless of whether the woman is already engaged), and then b) each woman replies "maybe" to her suitor she most prefers (whether her existing provisional partner or someone else) and rejects the rest (again, perhaps including her current provisional partner). The provisional nature of engagements preserves the right of an already-engaged woman to "trade up" (and, in the process, to "jilt" her until-then partner).
We just need to get the cost correct. To pair an insertion and deletion, making it a substitution, we'll lose both the cost of the insertion and the deletion, and gain the cost of the substitution, so the net cost of the pairing would be substitutionCost - insertionCost - deletionCost.
Now the above algorithm guarantees that all insertion or deletions gets paired - we don't necessarily want this, but there's an easy fix - just create a bunch of "stay-as-is" elements (on both the insertion and deletion side) - any insertion or deletion paired with a "stay-as-is" element would have a cost of 0 and would result in it remaining an insertion or deletion and nothing would happen for two "stay-as-is" elements ending up paired.
Key observation: you are only concerned with how many 'a's, 'b's, ..., 'z's or other alphabet characters are in your strings, since you can reorder all the characters in each string.
So, the problem boils down to the following: having s['a'] characters 'a', s['b'] characters 'b', ..., s['z'] characters 'z', transform them into t['a'] characters 'a', t['b'] characters 'b', ..., t['z'] characters 'z'. If your alphabet is short, s[] and t[] can be arrays; generally, they are mappings from the alphabet to integers, like map <char, int> in C++, dict in Python, etc.
Now, for each character c, you know s[c] and t[c]. If s[c] > t[c], you must remove s[c] - t[c] characters c from the first unordered string (s). If s[c] < t[c], you must add t[c] - s[c] characters c to the second unordered string (t).
Take X, the sum of s[c] - t[c] for all c such that s[c] > t[c], and you will get the number of characters you have to remove from s in total. Take Y, the sum of t[c] - s[c] for all c such that s[c] < t[c], and you will get the number of characters you have to remove from t in total.
Now, let Z = min (X, Y). We can have Z substitutions, and what's left is X - Z insertions and Y - Z deletions. Thus the total number of operations is Z + (X - Z) + (Y - Z), or X + Y - min (X, Y).

Algorithm to match sequential subset from a list

I am trying to remember the right algorithm to find a subset within a set that matches an element of a list of possible subsets. For example, given the input:
aehfaqptpzzy
and the subset list:
{ happy, sad, indifferent }
we can see that the word "happy" is a match because it is inside the input:
a e h f a q p t p z z y
I am pretty sure there is a specific algorithm to find all such matches, but I cannot remember what it is called.
UPDATE
The above example is not very good because it has letter repetitions, in fact in my problem both the dictionary entries and the input string are sortable sets. For example,
input: acegimnrqvy
dictionary:
{ cgn,
dfr,
lmr,
mnqv,
eg }
So in this example the algorithm would return cgn, mnqv and eg as matches. Also, I would like to find the best set of complementary matches where "best" means longest. So, in the example above the "best" answer would be "cgn mnqv", eg would not be a match because it conflicts with cgn which is a longer match.
I realize that the problem can be done by brute force scan, but that is undesirable because there could be thousands of entries in the dictionary and thousands of values in the input string. If we are trying to find the best set of matches, computability will become an issue.
You can use the Aho - Corrasick algorithm with more than one current states. For each of the input letters, you either stay (skip the letter) or move using the appropriate edge. If two or more "actors" meet at the same place, just merge them to one (if you're interested just in the presence and not counts).
About the complexity - this could be as slow as the naive O(MN) approach, because there can be up to size of dictionary actors. However, in practice, we can make a good use of the fact that many words are substrings of others, because there never won't be more than size of the trie actors, which - compared to the size of the dictionary - tends to be much smaller.

Anagram generation - Isnt it kind of subset sum?

Anagram:
An anagram is a type of word play, the result of rearranging the
letters of a word or phrase to produce a new word or phrase, using
all the original letters exactly once;
Subset Sum problem:
The problem is this: given a set of integers, is there a non-empty
subset whose sum is zero?
For example, given the set { −7, −3, −2, 5, 8}, the answer is yes
because the subset { −3, −2, 5} sums to zero. The problem is
NP-complete.
Now say we have a dictionary of n words. Now Anagram Generation problem can be stated as to find a set of words in dictionary(of n words) which use up all letters of the input. So does'nt it becomes a kind of subset sum problem.
Am I wrong?
The two problems are similar but are not isomorphic.
In an anagram the order of the letters matters. In a subset sum, the order does not matter.
In an anagram, all the letters must be used. In a subset sum, any subset will do.
In an anagram, the subgroups must form words taken from a comparatively small dictionary of allowable words (the dictionary). In a subset sum, the groups are unrestricted (no dictionary of allowable groupings).
If you'd prove that solving anagram finding (not more than polynomial number of times) solves subset sum problem - it would be a revolution in computer science (you'd prove P=NP).
Clearly finding anagrams is polynomial-time problem:
Checking if two records are anagrams of each other is as simple as sorting letters and compare the resulting strings (that is C*s*log(s) time, where s - number of letters in a record). You'll have at most n such checks, where n - number of records in a dictionary. So obviously the running time ~ C*s*log(s)*n is limited by a polynomial of input size - your input record and dictionary combined.
EDIT:
All the above is valid only if the anagram finding problem is defined as finding anagram of the input phrase in a dictionary of possible complete phrases.
While the wording of the anagram finding problem in the original question above...
Now say we have a dictionary of n words. Now Anagram Generation problem can be stated as to find a set of words in dictionary(of n words) which use up all letters of the input.
...seems to imply something different - e.g. a possibility that some sort of composition of more than one entry in a dictionary is also a valid choice for a possible anagram of the input.
This however seems immediately problematic and unclear because (1) usually phrase is not just sequence of random words (it should make sense as a whole phrase), (2) usually words in a phrase require separators that are also symbols - so it is not clear if the separators (whitespace characters) are required in the input to allow the separate entries in a dictionary and if separators are allowed in a single dictionary entry.
So in my initial answer above I applied a "semantic razor" by interpreting the problem definition the only way it is unambiguous and makes sense as an "anagram finding".
But also we might interpret the authors definition like this:
Given the dictionary of n letter sequences (separate dictionary entries may contain same sequences) and one target letter sequence - find any subset of the dictionary entries that if concatenated together would be exact rearrangement of the target letter sequence OR determine that such subset does not exist.
^^^- Even though this problem no longer really makes perfect sense as an "anagram finding problem" still it is interesting. It is very different problem to what I considered above.
One more thing remains unclear - the alphabet flexibility. To be specific the problem definition must also specify whether set of letters is fixed OR it is allowed to redefine it for each new solution of the problem when specifying dictionary and target sequence of said letters. That's important - capabilities and complexity depends on that.
The variant of this problem with the ability to define the alphabet (available number of letters) for each solution individually actually is equivalent to a subset sum problem. That makes it NP-complete.
I can prove the equivalence of our problem to a natural number variant of subset sum problem defined as
Given the collection (multiset) of natural numbers (repeated numbers allowed) and the target natural number - find any sub-collection that sums exactly to the target number OR determine that such sub-collection does not exist.
It is not hard to see that mostly linear number of steps is enough to translate one problem input to another and vice versa. So the solution of one problem translates to exactly one solution of another problem plus mostly linear overhead.
This positive-only variant of subset-sum is equivalent to zero-sum subset-sum variant given by the author (see e.g. Subset Sum Wikipedia article).
I think you are wrong.
Anagram Generation must be simpler than Subset Sum, because I can devise a trivial O(n) algorithm to solve it (as defined):
initialize the list of anagrams to an empty list
iterate the dictionary word by word
if all the input letters are used in the ith word
add the word to the list of anagrams
return the list of anagrams
Also, anagrams consist of valid words that are permutations of the input word (i.e. rearrangements) whereas subsets have no concept of order. They may actually include less elements than the input set (hence sub set) but an anagram must always be the same length as the input word.
It isn't NP-Complete because given a single set of letters, the set of anagrams remains identical regardless.
There is always a single mapping that transforms the letters of the input L to a set of anagrams A. so we can say that f(L) = A for any execution of f. I believe, if I understand correctly, that this makes the function deterministic. The order of a Set is irrelevant, so considering a differently ordered solution non-deterministic is invalid, it is also invalid because all entries in a dictionary are unique, and thus can be deterministically ordered.

Efficient data structure for word lookup with wildcards

I need to match a series of user inputed words against a large dictionary of words (to ensure the entered value exists).
So if the user entered:
"orange" it should match an entry "orange' in the dictionary.
Now the catch is that the user can also enter a wildcard or series of wildcard characters like say
"or__ge" which would also match "orange"
The key requirements are:
* this should be as fast as possible.
* use the smallest amount of memory to achieve it.
If the size of the word list was small I could use a string containing all the words and use regular expressions.
however given that the word list could contain potentially hundreds of thousands of enteries I'm assuming this wouldn't work.
So is some sort of 'tree' be the way to go for this...?
Any thoughts or suggestions on this would be totally appreciated!
Thanks in advance,
Matt
Put your word list in a DAWG (directed acyclic word graph) as described in Appel and Jacobsen's paper on the World's Fastest Scrabble Program (free copy at Columbia). For your search you will traverse this graph maintaining a set of pointers: on a letter, you make a deterministic transition to children with that letter; on a wildcard, you add all children to the set.
The efficiency will be roughly the same as Thompson's NFA interpretation for grep (they are the same algorithm). The DAWG structure is extremely space-efficient—far more so than just storing the words themselves. And it is easy to implement.
Worst-case cost will be the size of the alphabet (26?) raised to the power of the number of wildcards. But unless your query begins with N wildcards, a simple left-to-right search will work well in practice. I'd suggest forbidding a query to begin with too many wildcards, or else create multiple dawgs, e.g., dawg for mirror image, dawg for rotated left three characters, and so on.
Matching an arbitrary sequence of wildcards, e.g., ______ is always going to be expensive because there are combinatorially many solutions. The dawg will enumerate all solutions very quickly.
I would first test the regex solution and see whether it is fast enough - you might be surprised! :-)
However if that wasn't good enough I would probably use a prefix tree for this.
The basic structure is a tree where:
The nodes at the top level are all the possible first letters (i.e. probably 26 nodes from a-z assuming you are using a full dictionary...).
The next level down contains all the possible second letters for each given first letter
And so on until you reach an "end of word" marker for each word
Testing whether a given string with wildcards is contained in your dictionary is then just a simple recursive algorithm where you either have a direct match for each character position, or in the case of the wildcard you check each of the possible branches.
In the worst case (all wildcards but only one word with the right number of letters right at the end of the dictionary), you would traverse the entire tree but this is still only O(n) in the size of the dictionary so no worse than a full regex scan. In most cases it would take very few operations to either find a match or confirm that no such match exists since large branches of the search tree are "pruned" with each successive letter.
No matter which algorithm you choose, you have a tradeoff between speed and memory consumption.
If you can afford ~ O(N*L) memory (where N is the size of your dictionary and L is the average length of a word), you can try this very fast algorithm. For simplicity, will assume latin alphabet with 26 letters and MAX_LEN as the max length of word.
Create a 2D array of sets of integers, set<int> table[26][MAX_LEN].
For each word in you dictionary, add the word index to the sets in the positions corresponding to each of the letters of the word. For example, if "orange" is the 12345-th word in the dictionary, you add 12345 to the sets corresponding to [o][0], [r][1], [a][2], [n][3], [g][4], [e][5].
Then, to retrieve words corresponding to "or..ge", you find the intersection of the sets at [o][0], [r][1], [g][4], [e][5].
You can try a string-matrix:
0,1: A
1,5: APPLE
2,5: AXELS
3,5: EAGLE
4,5: HELLO
5,5: WORLD
6,6: ORANGE
7,8: LONGWORD
8,13:SUPERLONGWORD
Let's call this a ragged index-matrix, to spare some memory. Order it on length, and then on alphabetical order. To address a character I use the notation x,y:z: x is the index, y is the length of the entry, z is the position. The length of your string is f and g is the number of entries in the dictionary.
Create list m, which contains potential match indexes x.
Iterate on z from 0 to f.
Is it a wildcard and not the latest character of the search string?
Continue loop (all match).
Is m empty?
Search through all x from 0 to g for y that matches length. !!A!!
Does the z character matches with search string at that z? Save x in m.
Is m empty? Break loop (no match).
Is m not empty?
Search through all elements of m. !!B!!
Does not match with search? Remove from m.
Is m empty? Break loop (no match).
A wildcard will always pass the "Match with search string?". And m is equally ordered as the matrix.
!!A!!: Binary search on length of the search string. O(log n)
!!B!!: Binary search on alphabetical ordering. O(log n)
The reason for using a string-matrix is that you already store the length of each string (because it makes it search faster), but it also gives you the length of each entry (assuming other constant fields), such that you can easily find the next entry in the matrix, for fast iterating. Ordering the matrix isn't a problem: since this has only be done once the dictionary updates, and not during search-time.
If you are allowed to ignore case, which I assume, then make all the words in your dictionary and all the search terms the same case before anything else. Upper or lower case makes no difference. If you have some words that are case sensitive and others that are not, break the words into two groups and search each separately.
You are only matching words, so you can break the dictionary into an array of strings. Since you are only doing an exact match against a known length, break the word array into a separate array for each word length. So byLength[3] is the array off all words with length 3. Each word array should be sorted.
Now you have an array of words and a word with potential wild cards to find. Depending on wether and where the wildcards are, there are a few approaches.
If the search term has no wild cards, then do a binary search in your sorted array. You could do a hash at this point, which would be faster but not much. If the vast majority of your search terms have no wildcards, then consider a hash table or an associative array keyed by hash.
If the search term has wildcards after some literal characters, then do a binary search in the sorted array to find an upper and lower bound, then do a linear search in that bound. If the wildcards are all trailing then finding a non empty range is sufficient.
If the search term starts with wild cards, then the sorted array is no help and you would need to do a linear search unless you keep a copy of the array sorted by backwards strings. If you make such an array, then choose it any time there are more trailing than leading literals. If you do not allow leading wildcards then there is no need.
If the search term both starts and ends with wildcards, then you are stuck with a linear search within the words with equal length.
So an array of arrays of strings. Each array of strings is sorted, and contains strings of equal length. Optionally duplicate the whole structure with the sorting based on backwards strings for the case of leading wildcards.
The overall space is one or two pointers per word, plus the words. You should be able to store all the words in a single buffer if your language permits. Of course, if your language does not permit, grep is probably faster anyway. For a million words, that is 4-16MB for the arrays and similar for the actual words.
For a search term with no wildcards, performance would be very good. With wildcards, there will occasionally be linear searches across large groups of words. With the breakdown by length and a single leading character, you should never need to search more than a few percent of the total dictionary even in the worst case. Comparing only whole words of known length will always be faster than generic string matching.
Try to build a Generalized Suffix Tree if the dictionary will be matched by sequence of queries. There is linear time algorithm that can be used to build such tree (Ukkonen Suffix Tree Construction).
You can easily match (it's O(k), where k is the size of the query) each query by traversing from the root node, and use the wildcard character to match any character like typical pattern finding in suffix tree.

Resources