I'm sure there is a post on this, but I couldn't find one asking this exact question. Consider the following:
We have a word dictionary available
We are fed many paragraphs of words, and I wish to be able to predict the next word in a sentence given this input.
Say we have a few sentences such as "Hello my name is Tom", "His name is jerry", "He goes where there is no water". We check a hash table if a word exists. If it does not, we assign it a unique id and put it in the hash table. This way, instead of storing a "chain" of words as a bunch of strings, we can just have a list of uniqueID's.
Above, we would have for instance (0, 1, 2, 3, 4), (5, 2, 3, 6), and (7, 8, 9, 10, 3, 11, 12). Note that 3 is "is" and we added new unique id's as we discovered new words. So say we are given a sentence "her name is", this would be (13, 2, 3). We want to know, given this context, what the next word should be. This is the algorithm I thought of, but I dont think its efficient:
We have a list of N chains (observed sentences) where a chain may be ex. 3,6,2,7,8.
Each chain is on average size M, where M is the average sentence length
We are given a new chain of size S, ex. 13, 2, 3, and we wish to know what is the most probable next word?
Algorithm:
First scan the entire list of chains for those who contain the full S input(13,2,3, in this example). Since we have to scan N chains, each of length M, and compare S letters at a time, its O(N*M*S).
If there are no chains in our scan which have the full S, next scan by removing the least significant word (ie. the first one, so remove 13). Now, scan for (2,3) as in 1 in worst case O(N*M*S) which is really S-1.
Continue scanning this way until we get results > 0 (if ever).
Tally the next words in all of the remaining chains we have gathered. We can use a hash table which counts every time we add, and keeps track of the most added word. O(N) worst case build, O(1) to find max word.
The max word found is the the most likely, so return it.
Each scan takes O(M*N*S) worst case. This is because there are N chains, each chain has M numbers, and we must check S numbers for overlaying a match. We scan S times worst case (13,2,3,then 2,3, then 3 for 3 scans = S). Thus, the total complexity is O(S^2 * M * N).
So if we have 100,000 chains and an average sentence length of 10 words, we're looking at 1,000,000*S^2 to get the optimal word. Clearly, N >> M, since sentence length does not scale with number of observed sentences in general, so M can be a constant. We can then reduce the complexity to O(S^2 * N). O(S^2 * M * N) may be more helpful for analysis though, since M can be a sizeable "constant".
This could be the complete wrong approach to take for this type of problem, but I wanted to share my thoughts instead of just blatantly asking for assitance. The reason im scanning the way I do is because I only want to scan as much as I have to. If nothing has the full S, just keep pruning S until some chains match. If they never match, we have no idea what to predict as the next word! Any suggestions on a less time/space complex solution? Thanks!
This is the problem of language modeling. For a baseline approach, The only thing you need is a hash table mapping fixed-length chains of words, say of length k, to the most probable following word.(*)
At training time, you break the input into (k+1)-grams using a sliding window. So if you encounter
The wrath sing, goddess, of Peleus' son, Achilles
you generate, for k=2,
START START the
START the wrath
the wrath sing
wrath sing goddess
goddess of peleus
of peleus son
peleus son achilles
This can be done in linear time. For each 3-gram, tally (in a hash table) how often the third word follows the first two.
Finally, loop through the hash table and for each key (2-gram) keep only the most commonly occurring third word. Linear time.
At prediction time, look only at the k (2) last words and predict the next word. This takes only constant time since it's just a hash table lookup.
If you're wondering why you should keep only short subchains instead of full chains, then look into the theory of Markov windows. If your model were to remember all the chains of words that it has seen in its input, then it would badly overfit its training data and only reproduce its input at prediction time. How badly depends on the training set (more data is better), but for k>4 you'd really need smoothing in your model.
(*) Or to a probability distribution, but this is not needed for your simple example use case.
Yeh Whye Teh also has some recent interesting work that addresses this problem. The "Sequence Memoizer" extends the traditional prediction-by-partial-matching scheme to take into account arbitrarily long histories.
Here is a link the original paper: http://www.stats.ox.ac.uk/~teh/research/compling/WooGasArc2011a.pdf
It is also worth reading some of the background work, which can be found in the paper "A Bayesian Interpretation of Interpolated Kneser-Ney"
Related
I am trying to find a dynamic approach to multiply each element in a linear sequence to the following element, and do the same with the pair of elements, etc. and find the sum of all of the products. Note that any two elements cannot be multiplied. It must be the first with the second, the third with the fourth, and so on. All I know about the linear sequence is that there are an even amount of elements.
I assume I have to store the numbers being multiplied, and their product each time, then check some other "multipliable" pair of elements to see if the product has already been calculated (perhaps they possess opposite signs compared to the current pair).
However, by my understanding of a linear sequence, the values must be increasing or decreasing by the same amount each time. But since there are an even amount of numbers, I don't believe it is possible to have two "multipliable" pairs be the same (with potentially opposite signs), due to the issue shown in the following example:
Sequence: { -2, -1, 0, 1, 2, 3 }
Pairs: -2*-1, 0*1, 2*3
Clearly, since there are an even amount of pairs, the only case in which the same multiplication may occur more than once is if the elements are increasing/decreasing by 0 each time.
I fail to see how this is a dynamic programming question, and if anyone could clarify, it would be greatly appreciated!
A quick google for define linear sequence gave
A number pattern which increases (or decreases) by the same amount each time is called a linear sequence. The amount it increases or decreases by is known as the common difference.
In your case the common difference is 1. And you are not considering any other case.
The same multiplication may occur in the following sequence
Sequence = {-3, -1, 1, 3}
Pairs = -3 * -1 , 1 * 3
with a common difference of 2.
However this is not necessarily to be solved by dynamic programming. You can just iterate over the numbers and store the multiplication of two numbers in a set(as a set contains unique numbers) and then find the sum.
Probably not what you are looking for, but I've found a closed solution for the problem.
Suppose we observe the first two numbers. Note the first number by a, the difference between the numbers d. We then count for a total of 2n numbers in the whole sequence. Then the sum you defined is:
sum = na^2 + n(2n-1)ad + (4n^2 - 3n - 1)nd^2/3
That aside, I also failed to see how this is a dynamic problem, or at least this seems to be a problem where dynamic programming approach really doesn't do much. It is not likely that the sequence will go from negative to positive at all, and even then the chance that you will see repeated entries decreases the bigger your difference between two numbers is. Furthermore, multiplication is so fast the overhead from fetching them from a data structure might be more expensive. (mul instruction is probably faster than lw).
An autogram is a sentence which describes the characters it contains, usually enumerating each letter of the alphabet, but possibly also the punctuation it contains. Here is the example given in the wiki page.
This sentence employs two a’s, two c’s, two d’s, twenty-eight e’s, five f’s, three g’s, eight h’s, eleven i’s, three l’s, two m’s, thirteen n’s, nine o’s, two p’s, five r’s, twenty-five s’s, twenty-three t’s, six v’s, ten w’s, two x’s, five y’s, and one z.
Coming up with one is hard, because you don't know how many letters it contains until you finish the sentence. Which is what prompts me to ask: is it possible to write an algorithm which could create an autogram? For example, a given parameter would be the start of the sentence as an input e.g. "This sentence employs", and assuming that it uses the same format as the above "x a's, ... y z's".
I'm not asking for you to actually write an algorithm, although by all means I'd love to see if you know one to exist or want to try and write one; rather I'm curious as to whether the problem is computable in the first place.
You are asking two different questions.
"is it possible to write an algorithm which could create an autogram?"
There are algorithms to find autograms. As far as I know, they use randomization, which means that such an algorithm might find a solution for a given start text, but if it doesn't find one, then this doesn't mean that there isn't one. This takes us to the second question.
"I'm curious as to whether the problem is computable in the first place."
Computable would mean that there is an algorithm which for a given start text either outputs a solution, or states that there isn't one. The above-mentioned algorithms can't do that, and an exhaustive search is not workable. Therefore I'd say that this problem is not computable. However, this is rather of academic interest. In practice, the randomized algorithms work well enough.
Let's assume for the moment that all counts are less than or equal to some maximum M, with M < 100. As mentioned in the OP's link, this means that we only need to decide counts for the 16 letters that appear in these number words, as counts for the other 10 letters are already determined by the specified prefix text and can't change.
One property that I think is worth exploiting is the fact that, if we take some (possibly incorrect) solution and rearrange the number-words in it, then the total letter counts don't change. IOW, if we ignore the letters spent "naming themselves" (e.g. the c in two c's) then the total letter counts only depend on the multiset of number-words that are actually present in the sentence. What that means is that instead of having to consider all possible ways of assigning one of M number-words to each of the 16 letters, we can enumerate just the (much smaller) set of all multisets of number-words of size 16 or less, having elements taken from the ground set of number-words of size M, and for each multiset, look to see whether we can fit the 16 letters to its elements in a way that uses each multiset element exactly once.
Note that a multiset of numbers can be uniquely represented as a nondecreasing list of numbers, and this makes them easy to enumerate.
What does it mean for a letter to "fit" a multiset? Suppose we have a multiset W of number-words; this determines total letter counts for each of the 16 letters (for each letter, just sum the counts of that letter across all the number-words in W; also add a count of 1 for the letter "S" for each number-word besides "one", to account for the pluralisation). Call these letter counts f["A"] for the frequency of "A", etc. Pretend we have a function etoi() that operates like C's atoi(), but returns the numeric value of a number-word. (This is just conceptual; of course in practice we would always generate the number-word from the integer value (which we would keep around), and never the other way around.) Then a letter x fits a particular number-word w in W if and only if f[x] + 1 = etoi(w), since writing the letter x itself into the sentence will increase its frequency by 1, thereby making the two sides of the equation equal.
This does not yet address the fact that if more than one letter fits a number-word, only one of them can be assigned it. But it turns out that it is easy to determine whether a given multiset W of number-words, represented as a nondecreasing list of integers, simultaneously fits any set of letters:
Calculate the total letter frequencies f[] that W implies.
Sort these frequencies.
Skip past any zero-frequency letters. Suppose there were k of these.
For each remaining letter, check whether its frequency is equal to one less than the numeric value of the number-word in the corresponding position. I.e. check that f[k] + 1 == etoi(W[0]), f[k+1] + 1 == etoi(W[1]), etc.
If and only if all these frequencies agree, we have a winner!
The above approach is naive in that it assumes that we choose words to put in the multiset from a size M ground set. For M > 20 there is a lot of structure in this set that can be exploited, at the cost of slightly complicating the algorithm. In particular, instead of enumerating straight multisets of this ground set of all allowed numbers, it would be much better to enumerate multisets of {"one", "two", ..., "nineteen", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"}, and then allow the "fit detection" step to combine the number-words for multiples of 10 with the single-digit number-words.
I'am looking for an algorithm with which it is possible to derive a key from an already happened shuffling-process.
Assume we've got the string "Hello" which was shuffled:
"hello" -> "loelh"
Now I would like to derive a key k from it which i could use to undo the shuffling. So if we use k as input parameter for a deterministic shuffling-algorithm like for example Fisher-Yates and shuffle "loelh" again, we would restore the initial string "hello".
What i do not mean is to simply use one and the same deterministic shuffling algorithm to shuffle and de-shuffle. That's because in my case the first string would not have been really shuffled in the classical sense. Actually there would be two sets of data (byte or bit-arrays) which are just given and we want to get from the first to the second one with just a key which has been derived before.
I hope it's clear what I want to achieve and I would appreciate all hints or proposed solutions.
Regards,
Merrit
UPDATE:
Another attemp:
basically, one could also call it deterministic transformation of a bunch of data e.g. a byte-array, but I will stick with the "hello"-string example.
Assume we've got a transformation-algorithm transform(data, "unknown seed") where data is "hello" and unknown seed is what we are looking for. The result of transform is "loelh". We are looking for this "unknown seed" which we could use to reverse the process. At the time of the "unknown seed"-generation, both, the input data AND the result are known of course.
Later on I want to use the "unknown seed" (which should be known already ;-) to get the original string again: so this transform("loelh", seed) should lead to "hello" again.
So you could also see it as a form of equation like data*["unknown value"]=resultdata and we are trying to find the unknown value (the operator * could be any kind of operation).
First of all, let's simplify the problem greatly. Instead of permuting "hello", let's assume that you are always permuting "abcde", as that will make it easier to understand.
A shuffle is the random generation of a permutation. How the shuffle generates the permutation is irrelevant; shuffles generate permutations, that's all we need to know.
Let's state a permutation as a string containing the numbers 1 through 5. Suppose the shuffle produces permutation "21453". That is, we take the first letter and put it in position 2: _a___. We take the second letter and put it in position 1, ba___. We take the 3rd letter and put it in position 5: ab__c. We take the fourth letter and put it in position 3, bad_c, and we take the fifth letter and put it in position 4, badec.
Now you wish to deduce a "key" which allows you to "unpermute" the permutation. Well, that's just another permutation, called the inverse permutation. To compute the inverse permutation of "21453" you do the following:
find "1". It's in the 2nd spot.
find "2". It's in the 1st spot.
find "3". It's in the 5th spot.
find "4". It's in the 3rd spot.
Find "5". It's in the 4th spot.
And now read down the second column; the inverse permutation of "21453" is "21534". We are unpermuting "badec". We put the first letter in position 2: _b___. We put the second letter in position 1: ab___. We put the third letter in position 4: ab_d_. We put the fourth letter in position 5: ab_de. And we put the fifth letter in position 3: abcde.
Shuffling is just creating a random permutation of a given sequence. The typical way to do that is something like the Fisher-Yates Shuffle that you pointed out. The problem is that the shuffle program generates multiple random numbers based on a seed, and unless you implement the random number generator there's no easy way to reverse the sequence of random numbers.
There is another way to do it. What if you could generate the nth permutation of a sequence directly? That is, given the string "Fast", you define the first few permutations as:
0 Fast
1 Fats
2 Fsat
3 Fsta
... etc. for all 24 permutations
You want a random permutation of those four characters. Select a random number from 0 to 23 and then call a function to generate that permutation.
If you know the key, you can call a different function, again passing that key, to have it reverse the permutation back to the original.
In the fourth article in his series on permutations, Eric Lippert showed how to generate the nth permutation without having to generate all of the permutations that come before it. He doesn't show how to reverse the process, but doing so shouldn't be difficult if you understand how the generator works. It's well worth the time to study the entire series of articles.
If you don't know what the key (i.e. the random number used) is, then deriving the sequence of swaps required to get to the original order is expensive.
Edit
Upon reflection, it just might be possible to derive the key if you're given the original sequence and the transformed sequence. Since you know how far each symbol has moved, you should be able to derive the key. Consider the possible permutations of two letters:
0. ab 1. ba
Now, assign the letter b the value of 0, and the letter a the value of 1. What permutation number is ba? Find a in the string, swap to the left until it gets to the proper position, and multiply the number of swaps by one.
That's too easy. Consider the next one:
0. abc 1. acb 2. bac
3. cab 4. bca 5. cba
a is now 2, b is 1, and c is 0. Given cab:
swap a left one space. 1x2 = 2. Result is `acb`
swap b left one space. 1x1 = 1. Result is `abc`
So cab is permutation #3.
This does assume that your permutation generator numbers the permutations in the same way. It's also not a terribly efficient way of doing things. Worst case will require n(n-1)/2 swaps. You can optimize the swaps by moving things in an array, but it's still an O(n^2) algorithm. Where n is the length of the sequence. Not terrible for 100 or maybe even 1,000 items. Pretty bad after that, though.
Suppose you are given a range and a few numbers in the range (exceptions). Now you need to generate a random number in the range except the given exceptions.
For example, if range = [1..5] and exceptions = {1, 3, 5} you should generate either 2 or 4 with equal probability.
What logic should I use to solve this problem?
If you have no constraints at all, i guess this is the easiest way: create an array containing the valid values, a[0]...a[m] . Return a[rand(0,...,m)].
If you don't want to create an auxiliary array, but you can count the number of exceptions e and of elements n in the original range, you can simply generate a random number r=rand(0 ... n-e), and then find the valid element with a counter that doesn't tick on exceptions, and stops when it's equal to r.
Depends on the specifics of the case. For your specific example, I'd return a 2 if a Uniform(0,1) was below 1/2, 4 otherwise. Similarly, if I saw a pattern such as "the exceptions are odd numbers", I'd generate values for half the range and double. In general, though, I'd generate numbers in the range, check if they're in the exception set, and reject and re-try if they were - a technique known as acceptance/rejection for obvious reasons. There are a variety of techniques to make the exception-list check efficient, depending on how big it is and what patterns it may have.
Let's assume, to keep things simple, that arrays are indexed starting at 1, and your range runs from 1 to k. Of course, you can always shift the result by a constant if this is not the case. We'll call the array of exceptions ex_array, and let's say we have c exceptions. These need to be sorted, which shall turn out to be pretty important in a while.
Now, you only have k-e useful numbers to work with, so it'll be meaningful to find a random number in the range 1 to k-e. Say we end up with the number r. Now, we just need to find the r-th valid number in your array. Simple? Not so much. Remember, you can never simply walk over any of your arrays in a linear fashion, because that can really slow down your implementation when you have a lot of numbers. You have do some sort of binary search, say, to come up with a fast enough algorithm.
So let's try something better. The r-th number would nominally have lied at index r in your original array had you had no exceptions. The number at index r is r, of course, since your range and your array indices start from 1. But, you have a bunch of invalid numbers between 1 and r, and you want to somehow get to the r-th valid number. So, lets do a binary search on the array of exceptions, ex_array, to find how many invalid numbers are equal to or less than r, because we have these many invalid numbers lying between 1 and r. If this number is 0, we're all done, but if it isn't, we have a bit more work to do.
Assume you found there were n invalid numbers between 1 and r after the binary search. Let's advance n indices in your array to the index r+n, and find the number of invalid numbers lying between 1 and r+n, using a binary search to find how many elements in ex_array are less than or equal to r+n. If this number is exactly n, no more invalid numbers were encountered, and you've hit upon your r-th valid number. Otherwise, repeat again, this time for the index r+n', where n' is the number of random numbers that lay between 1 and r+n.
Repeat till you get to a stage where no excess exceptions are found. The important thing here is that you never once have to walk over any of the arrays in a linear fashion. You should optimize the binary searches so they don't always start at index 0. Say if you know there are n random numbers between 1 and r. Instead of starting your next binary search from 1, you could start it from one index after the index corresponding to n in ex_array.
In the worst case, you'll be doing binary searches for each element in ex_array, which means you'll do c binary searches, the first starting from index 1, the next from index 2, and so on, which gives you a time complexity of O(log(n!)). Now, Stirling's approximation tells us that O(ln(x!)) = O(xln(x)), so using the algorithm above only makes sense if c is small enough that O(cln(c)) < O(k), since you can achieve O(k) complexity using the trivial method of extracting valid elements from your array first.
In Python the solution is very simple (given your example):
import random
rng = set(range(1, 6))
ex = {1, 3, 5}
random.choice(list(rng-ex))
To optimize the solution, one needs to know how long is the range and how many exceptions there are. If the number of exceptions is very low, it's possible to generate a number from the range and just check if it's not an exception. If the number of exceptions is dominant, it probably makes sense to gather the remaining numbers into an array and generate random index for fetching non-exception.
In this answer I assume that it is known how to get an integer random number from a range.
Here's another approach...just keep on generating random numbers until you get one that isn't excluded.
Suppose your desired range was [0,100) excluding 25,50, and 75.
Put the excluded values in a hashtable or bitarray for fast lookup.
int randNum = rand(0,100);
while( excludedValues.contains(randNum) )
{
randNum = rand(0,100);
}
The complexity analysis is more difficult, since potentially rand(0,100) could return 25, 50, or 75 every time. However that is quite unlikely (assuming a random number generator), even if half of the range is excluded.
In the above case, we re-generate a random value for only 3/100 of the original values.
So 3% of the time you regenerate once. Of those 3%, only 3% will need to be regenerated, etc.
Suppose the initial range is [1,n] and and exclusion set's size is x. First generate a map from [1, n-x] to the numbers [1,n] excluding the numbers in the exclusion set. This mapping with 1-1 since there are equal numbers on both sides. In the example given in the question the mapping with be as follows - {1->2,2->4}.
Another example suppose the list is [1,10] and the exclusion list is [2,5,8,9] then the mapping is {1->1, 2->3, 3->4, 4->6, 5->7, 6->10}. This map can be created in a worst case time complexity of O(nlogn).
Now generate a random number between [1, n-x] and map it to the corresponding number using the mapping. Map looks can be done in O(logn).
You can do it in a versatile way if you have enumerators or set operations. For example using Linq:
void Main()
{
var exceptions = new[] { 1,3,5 };
RandomSequence(1,5).Where(n=>!exceptions.Contains(n))
.Take(10)
.Select(Console.WriteLine);
}
static Random r = new Random();
IEnumerable<int> RandomSequence(int min, int max)
{
yield return r.Next(min, max+1);
}
I would like to acknowledge some comments that are now deleted:
It's possible that this program never ends (only theoretically) because there could be a sequence that never contains valid values. Fair point. I think this is something that could be explained to the interviewer, however I believe my example is good enough for the context.
The distribution is fair because each of the elements has the same chance of coming up.
The advantage of answering this way is that you show understanding of modern "functional-style" programming, which may be interesting to the interviewer.
The other answers are also correct. This is a different take on the problem.
For fun I wrote an anagram generator. It takes some input word or phrase and rearranges the letters in different combinations to generate a new word or phrase. For example, if you enter "cat and dog", it will return things like "can dad got" or "ant cog dad".
A friend asked what the running time was, and I realized that I wasn't sure how to calculate it in this case. At startup, I read in a list of words (a dictionary). In my case it's about 200,000 words (it's the standard unix /usr/share/dict/web2 dictionary). That doesn't really factor into the running time as that's a one-time thing at app startup and it takes well under a second to read in and index the dictionary.
When the user enters a word, the application searches the dictionary for a list of candidate words. A word is a candidate if it contains only a subset of letters from the input word or phrase. Generating the candidates is an insignificant part of the process and can be ignored for now.
Then it starts searching. It chooses the first word in the list of candidates. Next it removes the letters of that word from the remaining letters in the input string. It then searches the candidates for any words remaining that contain only a subset of the newly reduced input string. It then recurses with the new reduced input word and reduced candidate list. It repeats this until there are either no candidates left, or the input string is all used up.
So it may start with 100 candidates it has to search. It chooses one and after removing any others with the same letters, there could be 90 left, or there could be 50 left, or there could be 10 left, so when we recurse, there's a different number left to search each time. This is why I'm having trouble understanding the running time.
If we never removed any words from the list, it would be O(n!) where n is the number of candidates. But since we aggressively trim the list on each iteration, it works out to far less than n!. For example, one phrase I tried generates over 4,000 candidates, and ends up finding over 600,000 combinations. It only takes about 30 seconds to do so on a recent notebook computer (utilizing only a single core), so clearly it's not O(n!).
In order to understand the running time, would I need to have some statistics about how much the list of candidates gets trimmed on average with each iteration or something like that?
I was thinking that if each iteration removed 10 candidates from the list, then we'd have something like this for a 100 candidate list: 100 * 90 * 80 * 70... Or more generally, n * (n - 10) * (n - 20) * (n - 30)... In the case of a 100 candidate list that would work out to O(n^10 - a*n^9 - b*n^8 ...).
Have I calculated that correctly, or is there more to it than that?
First of all, note that the running time depends on the length of the input: O(m). If a user enters a very long phrase that contains all letters of the alphabet many times:
A quick brown fix jumps over the lazy dog; a quick brown fix jumps over the lazy dog; a quick brown fix jumps over the lazy dog, ...
your algorithm will consider the full dictionary (of size n) in the first O(m) iterations, so the running time is n^O(m).
Here, the statement n^O(m) is rather weak even though it's correct: the exact running time might look like n^0.01m or n^0.1m; you may consider both smaller than n^O(m), but you cannot find exactly which factor there is (it depends on the structure of English language), so n^O(m) here means "exponential running time in worst case; algorithm won't finish for large values of m".
Of course, you are probably interested in the running time for small values of m. If you assume m<20, it's pretty clear that the running time is O(n^20); you may consider this a better estimation than O(n!) or O(n^(n/10)).
To get better estimates, one would have to consider the structure of the dictionary; the running time depends really strongly on the dictionary. For example, if all words in the dictionary contain at least 2 letters (not sure about that), the running time can be estimated as O(n^(m/2)).
Anyway, the big-O notation doesn't seem to fit this problem in any useful manner.
You are in the right direction. Consider the only highest degree of the polynomial you get upon evaluation. So in your case:
n*(n-10)*(n-20)*...10
will give (n)^(n/10).
So the running time of your algorithm is O( (n)^(n/10) ).
Also see this for better understanding of running time.
If average length of a candidate is k, and the source phrase is such that all candidates are removed only one by one, then the complexity will be O((n/k)!).
If initial number of candidates is M, and each step removes s words from the list of candidates, then the complexity is O(M * (M-s) * (M-2s) * ...) = O((M/s)! * sM/s).
In the worst case, you still have O(n!).
But, well, n! is what one could expect for such a task. I suppose most optimizations should be performed at the code that searches and removes candidates.