Choosing an alphabet that covers the most words? [closed]

Choosing an alphabet that covers the most words? [closed] - algorithm

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Given a list of words and an alphabet that has at most P letters, how can we choose the optimal alphabet that covers the most words?
For example: Given words "aaaaaa" "bb" "bb" with P=1, the optimal alphabet is "b" since "b" covers two words.
Another example: Given words "abmm" "abaa" "mnab" "bbcc" "mnnn" with P=4, the optimal alphabet is "abmn", since that covers 4 of the 5 words.
Are there any known algorithms, or can someone suggest an algorithm that solves this problem?

This problem is NP-hard by reduction from CLIQUE (it's sort of a densest k-sub(hyper)graph problem). Given a graph, label its vertices with distinct letters and, for each edge, make a two-letter word. There exists a k-clique if and only if we can cover k choose 2 words with k letters.
The algorithm situation even for CLIQUE is grim (running times must be n^Theta(k) under a plausible hypothesis), so I'm not sure what to recommend other than brute force with primitive bit arrays.

I'm not yet sure that this is correct, but hopefully it's at least close. We consider a dynamic programming solution. Enumerate the words 1 through N, the letters in our alphabet 1 through P. We want to be able to solve (n, p) in terms of all sub solutions. We consider several cases.
The simplest case is where the nth word is already in the dictionary given in the solution to (n-1, p). We then count ourselves lucky, up the words covered by one, and leave the dictionary unchanged (ddictionary refers to some subset of letters here).
Suppose instead that the nth word is not in the dictionary given by (n-1, p). Then either the dictionary solving (n-1, p) is the dictionary for (n, p), OR the nth word is in the solution. So we look for solutions that explicitly involve the nth word. So, we add all the letters in the nth word to the dictionary we are considering. We now search through all previous subsolutions of the form (n-1, i), where i is p-1 or less. We are looking for the largest value of i such that |d(n-1, i) U d(n)| <= p. Where d(n-1, i) means the dictionary associated with that solution, and d(n) simply means the dictionary associated with all letters of the nth word. In plain English, we use our subsolutions to find the best solution with a smaller value of p that allows us to fit the new word. Once we have found that value of i, we combine the dictionaries whose magnitude we were measuring. If the magnitude of this set is still not p, we repeat the process described before. When we have created a dictionary with magnitude p that covers the nth word with this technique (or iterated through all previous solutions), we compute its coverage and compare it to the coverage we would get by simply using the dictionary from (n-1, p), and we pick the better. If theres a tie, we pick both.
Im not completely convinced of the correctness of this solution, but I think it may be right. Thoughts?

I'd do this:
Use the input words to create a data structure that maps lists of letters (strings) to the number of words that they cover. You can do this by extracting unique letters that make up a word, sorting them and using the result as a hash map key.
Disregard any entries whose keys are longer than P (can't cover those words by our limited alphabet).
For all remaining entires, compute a list of entries that are contained by them (an alphabet 'ab' contains the alphabet 'b' and 'a'). Sum the number of words covered by those entries.
Find the entry with the highest number of keys.

As David has shown above (with an excellent proof!), this is NP-hard so you won't get a perfect answer in every situation.
One approach to add to the other answers is to express this as a max-flow problem.
Define a source node S, a sink node D, a node for each words, and a node for each letter.
Add edges from S to each word of capacity 1.
Add edges from each word to the letters it contains of infinite capacity.
Add edges from each letter to D of capacity x (where we will define x in a moment).
Then solve for the min-cut of this graph (by using a maximum flow algorithm from S to D). A cut edge to a letter represents that that letter is not being included in the solution.
This can be thought of as solving the problem where we get a reward of 1 for each word, but it costs us x for each new letter we use.
Then the idea is to vary x (e.g. by bisection) to try and find a value for x where exactly k letter edges are cut. If you manage this then you will have identified the exact solution to your problem.
This approach is reasonably efficient, but it depends on your input data whether or not it will find the answer. For certain examples (e.g. David's construction to find cliques) you will find that as you vary x you will suddenly jump from including fewer than k letters to including more than k letters. However, even in this case you may find that it helps in that it will provide some lower and upper bounds for the maximum number of words in the exact solution.

Related

Minimum number of deletions for a given word to become a dictionary word

Given a dictionary as a hashtable. Find the minimum # of
deletions needed for a given word in order to make it match any word in the
dictionary.
Is there some clever trick to solve this problem in less than exponential complexity (trying all possible combinations)?

For starters, suppose that you have a single word w in the the hash table and that your word is x. You can delete letters from x to form w if and only if w is a subsequence of x, and in that case the number of letters you need to delete from x to form w is given by |x - w|. So certainly one option would be to just iterate over the hash table and, for each word, to see if x is a subsequence of that word, taking the best match you find across the table.
To analyze the runtime of this operation, let's suppose that there are n total words in your hash table and that their total length is L. Then the runtime of this operation is O(L), since you'll process each character across all the words at most once. The complexity of your initial approach is O(|x| · 2|x|) because there are 2|x| possible words you can make by deleting letters from x and you'll spend O(|x|) time processing each one. Depending on the size of your dictionary and the size of your word, one algorithm might be better than the other, but we can say that the runtime is O(min{L, |x|·2|x|) if you take the better of the two approaches.

You can build a trie and then see where your given word fits into it. The difference in the depth of your word and the closest existing parent is the number of deletions required.

Algorithm to match sequential subset from a list

I am trying to remember the right algorithm to find a subset within a set that matches an element of a list of possible subsets. For example, given the input:
aehfaqptpzzy
and the subset list:
{ happy, sad, indifferent }
we can see that the word "happy" is a match because it is inside the input:
a e h f a q p t p z z y
I am pretty sure there is a specific algorithm to find all such matches, but I cannot remember what it is called.
UPDATE
The above example is not very good because it has letter repetitions, in fact in my problem both the dictionary entries and the input string are sortable sets. For example,
input: acegimnrqvy
dictionary:
{ cgn,
dfr,
lmr,
mnqv,
eg }
So in this example the algorithm would return cgn, mnqv and eg as matches. Also, I would like to find the best set of complementary matches where "best" means longest. So, in the example above the "best" answer would be "cgn mnqv", eg would not be a match because it conflicts with cgn which is a longer match.
I realize that the problem can be done by brute force scan, but that is undesirable because there could be thousands of entries in the dictionary and thousands of values in the input string. If we are trying to find the best set of matches, computability will become an issue.

You can use the Aho - Corrasick algorithm with more than one current states. For each of the input letters, you either stay (skip the letter) or move using the appropriate edge. If two or more "actors" meet at the same place, just merge them to one (if you're interested just in the presence and not counts).
About the complexity - this could be as slow as the naive O(MN) approach, because there can be up to size of dictionary actors. However, in practice, we can make a good use of the fact that many words are substrings of others, because there never won't be more than size of the trie actors, which - compared to the size of the dictionary - tends to be much smaller.

Check if given string can be created by a set of characters cut out from magazine article

"Observe that when you cut a character out of a magazine, the character on the reverse side of the page is also removed. Give an algorithm to determine whether you can generate a given string by pasting cutouts from a given magazine. Assume that you are given a function that will identify the character and its position on the reverse side of the page for any given character position."
How can I do it?
I can do some initial pruning so that if a needed character has only one way of getting picked up, its taken initially before turning the sub-problem for dynamic technique, but what after this initial pruning?
What is the time and space complexity?

As #LiKao suggested, this can be solved using max flow. To construct the network we make two "layers" of vertices: one with all the distinct characters in the input string and one with each position on the page. Make an edge with capacity 1 from a character to a position if that position has that character on one side. Make edges of capacity 1 from each position to the sink, and make edges from the source to each character with capacity equal to the multiplicity of that character in the input string.
For example, let's say we're searching for the word "FOO" on a page with four positions:
pos 1 2 3 4
front F C O Z
back O O K Z
We then generate the following network, ignoring position 4 since it does not provide any of the required characters.
Now, we only need to determine if there is a flow from the source to the sink of length("FOO") = 3 or more.

You can use dynamic programming directly.
We are given string s with n letters. We are given a set of pieces P = {p_1, ..., p_k}. Each piece has one letter in the front p_i.f and one in the back p_i.b.
Denote with f(j, p) the function that returns true if it is feasible to create substring s_1...s_j using pieces in p \subseteq P, and false otherwise.
The following recurrence holds:
f(n, P) = f(n-1, P-p_1) | f(n-1, P-p_2) | ... | f(n-1, P-p_k)
In plain English the feasibility of s using all pieces in P, depends on the feasibility of the substring s_1...s_n-1 given one less piece, and we try removing all possible pieces (of course in practice we do not have to remove all pieces one by one; we only need to remove those pieces for which p_i.f == s_n || p_i.b == s_n).
The initial condition is that f(1, P-p_1) = f(1, P-p2) = ... = true, assuming that we have already checked a-priori (in linear time) that there are enough letters in P to cover all the letters in s.

While this problem can be formulated as a Maxflow problem as shown in the accepted answer, it is simpler and more efficient to formulate it as a maximum cardinality matching problem in a bipartite graph. Maxflow algorithms like Dinic's are slower than the special case algorithms like Hopcroft–Karp algorithm.
The bipartite graph is formed by adding two edges from every character of the given string to a cutout, one edge for each side. We then run Hopcroft–Karp. In the end, we simply check whether the cardinality of the matching is equal to the length of the string.
For a working implementation (in Scala) using JGraphT, see my GitHub.
I'd like to come up with a more efficient DP solution, since Skiena's book has this problem in the DP section, but so far haven't found any.

How to find the best possible answer to a really large seeming problem?

First off, this is NOT a homework problem. I haven't had to do homework since 1988!
I have a list of words of length N
I have a max of 13 characters to choose from.
There can be multiples of the same letter
Given the list of words, which 13 characters would spell the most possible words. I can throw out words that make the problem harder to solve, for example:
speedometer has 4 e's in it, something MOST words don't have,
so I could toss that word due to a poor fit characteristic, or it might just
go away based on the algorithm
I've looked # letter distributions, I've built a graph of the words (letter by letter). There is something I'm missing, or this problem is a lot harder than I thought. I'd rather not totally brute force it if that is possible, but I'm down to about that point right now.
Genetic algorithms come to mind, but I've never tried them before....
Seems like I need a way to score each letter based upon its association with other letters in the words it is in....

It sounds like a hard combinatorial problem. You are given a dictionary D of words, and you can select N letters (possible with repeats) to cover / generate as many of the words in D as possible. I'm 99.9% certain it can be shown to be an NP-complete optimization problem in general (assuming possibly alphabet i.e. set of letters that contains more than 26 items) by reduction of SETCOVER to it, but I'm leaving the actual reduction as an exercise to the reader :)
Assuming it's hard, you have the usual routes:
branch and bound
stochastic search
approximation algorithms

Best I can come up with is branch and bound. Make an "intermediate state" data structure that consists of
Letters you've already used (with multiplicity)
Number of characters you still get to use
Letters still available
Words still in your list
Number of words still in your list (count of the previous set)
Number of words that are not possible in this state
Number of words that are already covered by your choice of letters
You'd start with
Empty set
13
{A, B, ..., Z}
Your whole list
N
0
0
Put that data structure into a queue.
At each step
Pop an item from the queue
Split into possible next states (branch)
Bound & delete extraneous possibilities
From a state, I'd generate possible next states as follows:
For each letter L in the set of letters left
Generate a new state where:
you've added L to the list of chosen letters
the least letter is L
so you remove anything less than L from the allowed letters
So, for example, if your left-over set is {W, X, Y, Z}, I'd generate one state with W added to my choice, {W, X, Y, Z} still possible, one with X as my choice, {X, Y, Z} still possible (but not W), one with Y as my choice and {Y, Z} still possible, and one with Z as my choice and {Z} still possible.
Do all the various accounting to figure out the new states.
Each state has at minimum "Number of words that are already covered by your choice of letters" words, and at maximum that number plus "Number of words still in your list." Of all the states, find the highest minimum, and delete any states with maximum higher than that.
No special handling for speedometer required.
I can't imagine this would be fast, but it'd work.
There are probably some optimizations (e.g., store each word in your list as an array of A-Z of number of occurrances, and combine words with the same structure: 2 occurrances of AB.....T => BAT and TAB). How you sort and keep track of minimum and maximum can also probably help things somewhat. Probably not enough to make an asymptotic difference, but maybe for a problem this big enough to make it run in a reasonable time instead of an extreme time.

Total brute forcing should work, although the implementation would become quite confusing.
Instead of throwing words like speedometer out, can't you generate the association graphs considering only if the character appears in the word or not (irrespective of the no. of times it appears as it should not have any bearing on the final best-choice of 13 characters). And this would also make it fractionally simpler than total brute force.
Comments welcome. :)

Removing the bounds on each parameter including alphabet size, there's an easy objective-preserving reduction from the maximum coverage problem, which is NP-hard and hard to approximate with a ratio better than (e - 1) / e ≈ 0.632 . It's fixed-parameter tractable in the alphabet size by brute force.
I agree with Nick Johnson's suggestion of brute force; at worst, there are only (13 + 26 - 1) choose (26 - 1) multisets, which is only about 5 billion. If you limit the multiplicity of each letter to what could ever be useful, this number gets a lot smaller. Even if it's too slow, you should be able to recycle the data structures.

I did not understand this completely "I have a max of 13 characters to choose from.". If you have a list of 1000 words, then did you mean you have to reduce that to just 13 chars?!
Some thoughts based on my (mis)understanding:
If you are only handling English lang words, then you can skip vowels because consonants are just as descriptive. Our brains can sort of fill in the vowels - a.k.a SMS/Twitter language :)
Perhaps for 1-3 letter words, stripping off vowels would loose too much info. But still:
spdmtr hs 4 's n t, smthng
MST wrds dn't hv, s cld
tss tht wrd d t pr ft
chrctrstc, r t mght jst g
wy bsd n th lgrthm
Stemming will cut words even shorter. Stemming first, then strip vowels. Then do a histogram....

Ordering a dictionary to maximize common letters between adjacent words

This is intended to be a more concrete, easily expressable form of my earlier question.
Take a list of words from a dictionary with common letter length.
How to reorder this list tto keep as many letters as possible common between adjacent words?
Example 1:
AGNI, CIVA, DEVA, DEWA, KAMA, RAMA, SIVA, VAYU
reorders to:
AGNI, CIVA, SIVA, DEVA, DEWA, KAMA, RAMA, VAYU
Example 2:
DEVI, KALI, SHRI, VACH
reorders to:
DEVI, SHRI, KALI, VACH
The simplest algorithm seems to be: Pick anything, then search for the shortest distance?
However, DEVI->KALI (1 common) is equivalent to DEVI->SHRI (1 common)
Choosing the first match would result in fewer common pairs in the entire list (4 versus 5).
This seems that it should be simpler than full TSP?

What you're trying to do, is calculate the shortest hamiltonian path in a complete weighted graph, where each word is a vertex, and the weight of each edge is the number of letters that are differenct between those two words.
For your example, the graph would have edges weighted as so:
DEVI KALI SHRI VACH
DEVI X 3 3 4
KALI 3 X 3 3
SHRI 3 3 X 4
VACH 4 3 4 X
Then it's just a simple matter of picking your favorite TSP solving algorithm, and you're good to go.

My pseudo code:
Create a graph of nodes where each node represents a word
Create connections between all the nodes (every node connects to every other node). Each connection has a "value" which is the number of common characters.
Drop connections where the "value" is 0.
Walk the graph by preferring connections with the highest values. If you have two connections with the same value, try both recursively.
Store the output of a walk in a list along with the sum of the distance between the words in this particular result. I'm not 100% sure ATM if you can simply sum the connections you used. See for yourself.
From all outputs, chose the one with the highest value.
This problem is probably NP complete which means that the runtime of the algorithm will become unbearable as the dictionaries grow. Right now, I see only one way to optimize it: Cut the graph into several smaller graphs, run the code on each and then join the lists. The result won't be as perfect as when you try every permutation but the runtime will be much better and the final result might be "good enough".
[EDIT] Since this algorithm doesn't try every possible combination, it's quite possible to miss the perfect result. It's even possible to get caught in a local maximum. Say, you have a pair with a value of 7 but if you chose this pair, all other values drop to 1; if you didn't take this pair, most other values would be 2, giving a much better overall final result.
This algorithm trades perfection for speed. When trying every possible combination would take years, even with the fastest computer in the world, you must find some way to bound the runtime.
If the dictionaries are small, you can simply create every permutation and then select the best result. If they grow beyond a certain bound, you're doomed.
Another solution is to mix the two. Use the greedy algorithm to find "islands" which are probably pretty good and then use the "complete search" to sort the small islands.

This can be done with a recursive approach. Pseudo-code:
Start with one of the words, call it w
FindNext(w, l) // l = list of words without w
Get a list l of the words near to w
If only one word in list
Return that word
Else
For every word w' in l do FindNext(w', l') //l' = l without w'
You can add some score to count common pairs and to prefer "better" lists.

You may want to take a look at BK-Trees, which make finding words with a given distance to each other efficient. Not a total solution, but possibly a component of one.

This problem has a name: n-ary Gray code. Since you're using English letters, n = 26. The Wikipedia article on Gray code describes the problem and includes some sample code.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio