I ran into this exercise and i thought about it for a few hours and got to nothing.
our alphabet is {1...n} and our language Ln contains all the words under Σ* so that each word in the language doesn't contain at least one letter from the alphabet.
for example: if n=5, the word w={111223432} is in the language because '5' is missing in the word. the word w={1352224} is not in the language because all the letters 1...n are in the word.
I need to design an NFA for this language that has n+1 states.
Again, I tried a few things and don't exactly have a good idea.
For simplicity, let's do this for the alphabet {a, b, c}. Imagine that you have a string in your language. That means that it's missing a, or it's missing b, or it's missing c (inclusive or). If we knew which character was missing, it would be really easy to check whether a string never had a copy of that character using a single-state NFA consisting of an accepting state that transitions back to itself on everything except that character.
Since there are only finitely many characters in the alphabet, we can build three one-state NFAs, each of which are designed to check whether a string is missing a particular character.
To build the overall machine, have the start state of the NFA nondeterministically guess which character is missing by adding ε-transitions from the start state to each of the individual one-state NFAs we built earlier. You now have a four-state NFA for this language. (You can see a picture of it here.) Hopefully it's not too hard to see how to generalize this up to larger alphabet sizes!
Related
I'm looking for an algorithm which produces the smallest DFA which matches any string from a given finite set of concrete strings, and nothing else. (Smallest as in fewest terminal symbols.)
Examples:
a, b -> a|b
a, ab -> a(|b)
ab, ac -> a(b|c)
aa, ab, ba, bb -> (a|b)(a|b)
x, xa, xb, xc, xac, xbc -> x(|ab)(|c)
I tried a naive algorithm which does repeated prefix/suffix extraction, but that cannot handle the last case, and does not produce the minimal result.
I'm sure this is a common problem but I haven't been able to find the proper terminology for it. Apologies for improper terminology and the ad-hoc notation.
Minimal DFA is pretty straightforward:
Create an NFA which has one branch for each string in the language
Determinize the NFA to get a DFA
Minimize the DFA
Each of these steps is easy to understand and automate; steps 2 and 3 have known algorithms, step 1 should be easy too.
This is not a particularly efficient algorithm, but it might serve as a useful starting point. To improve performance, you'd want to try to build some DFA directly and then minimize that; perhaps running the Myhill-Nerode theorem as a construction could work here. But this is performance, not correctness... for small DFAs there will be no issue just running as above.
Minimal Regular Expression is a harder problem, I think; you could use Arden's lemma as a starting point to get some regular expression for the language of the DFA generated using the above described technique. Then, in the absolute worst case, you could check whether any valid regular expression of shorter length gives your language exactly. Note that because your language is finite and you want to match it exactly, your regular expressions will not have Kleene star in them, so this may not be as horrible as it sounds; the only operations remaining are concatenation and union. This might be feasible, if not terribly efficient. A lot of these options would probably be pretty easy to rule out; you know, for instance, you need at least as many concatenations as the longest string in your collection, so that gives an easy lower bound; there are probably tighter bounds you could find. The regular expression from Arden's lemma should give you a good upper bound.
How can I get started at finding a recursive/dynamic solution to the problem?
For example, how many adjacent swaps is at least needed to convert the given string abaaccbabaabcab (representing all other characters as c) to one without any instances of "ab"? It's elusive for me to come up with a way of breaking the problem into (independent) sub-problems.
For strings consisting of only as and bs, the problem reduces to taking all bs before as, but it becomes complicated with other characters involved.
PS: Can we assert that in at least one scenario of minimum swapping, swaps do not go beyond any block of c...c, where ... simply consists of a pile of as and bs? In that case, as either place after bs, or before first c, depending on which is closer. Vice versa for bs.
Is it a good place to start?
Thanks for the help.
I would suggest tackling this not as a dynamic programming problem, but as a pathfinding algorithm.
Consider the graph whose nodes are all possible arrangements of the letters, with edges connecting arrangements that can be reached from each other with an adjacent swap. You want to find the shortest path from your starting arrangement to any arrangement with no instances of the word.
We obviously don't want to write down the whole graph, but https://en.wikipedia.org/wiki/A*_search_algorithm doesn't require us to do so. It only requires us to write down the nodes that we visit. (Which we can do in a hash/dictionary/whatever your language calls it.) The heuristic function that we can use is the number of instances of the forbidden word that are separated by at least one space. This will work very well in simple cases. If there is no solution, though, or on pathological cases like removing ab from aaaaabbbbb, we will visit the whole graph. I don't know whether there is a way to avoid it.
But for a random English word in random gibberish, you should very quickly find provably minimal solutions.
Clearly, if we only have instances of single ABs, separated by, an optimal solution could be to just reverse each one:
ABcABdhlkAB -> BAcBAdhlkBA
And if we have large separated blocks of As followed by one or more Bs, separated by another character, placing another character in between the blocks would be optimal:
AAAABBBBBcAAAABBBBBBBd -> AAAAcBBBBBAAAAdBBBBBBB
The challenge then is only when multiple sequential blocks of As followed by Bs are not surrounded by different characters:
AAAABBBBBAAAABBBBBAAAABBBBBAAAABBBBBAAAABBBBBqwer
If there are no other characters, the only solution is move all Bs left of the As. Otherwise, we have to find the optimal distribution of other characters to separate blocks of As from blocks of Bs that would result in the least cost coming from movement of B blocks left of A blocks.
I'm trying to write a program in JAVA that stores a dictionary in a hashmap (each word under a different key) and compares a given word to the words in the dictionary and comes up with a spelling suggestion if it is not found in the dictionary -- basically a spell check program.
I already came up with the comparison algorithm (i.e. Needleman-Wunsch then Levenshtein distance), etc., but got stuck when it came figuring out what words in the dictionary-hashmap to compare the word to i.e. "hellooo".
I cannot compare "ohelloo" [should be corrected to "hello" to each word in the dictionary b/c that would take too long and I cannot compare it to all words int the dictionary starting with 'o' b/c it's supposed to be "hello".
Any ideas?
The most common spelling mistakes are
Delete a letter (smaller word OR word split)
Swap adjacent letters
Alter letter (QWERTY adjacent letters)
Insert letter
Some reports say that 70-90% of mistakes fall in the above categories (edit distance 1)
Take a look on the url below that provides a solution for single or double mistakes (edit distance 1 or 2). Almost everything you'll need is there!
How to write a spelling corrector
FYI: You can find implementation in various programming languages in the bottom of the aforementioned article. I've used it in some of my projects, practical accuracy is really good, sometimes more than 95+% as claimed by the author.
--Based on OP's comment--
If you don't want to pre-compute every possible alteration and then search on the map, I suggest that you use a patricia trie (radix tree) instead of a HashMap. Unfortunately, you will again need to handle the "first-letter mistake" (eg remove first letter or swap first with the second, or just replace it with a Qwerty adjacent) and you can limit your search with high probability.
You can even combine it with an extra Index Map or Trie with "reversed" words or an extra index that omits first N characters (eg first 2), so you can catch errors occurred on prefix only.
I understand the ukkonen's algorithm. I am only curious how to extend it to have more than one string in it (ending with a special character say "$").
I read somewhere that Given strings s1(say "abcddefx$") and s2(say "abddefgh$"), I should insert the s1 normally by ukkonen's algo. Then traverse down the tree with s2. That is I should search for s2 in the tree.
Once I get to a node where the search ends ("ab", after 'b') I should resume the ukkonen's algorithm from there.
I understand the basic logic behind this. But what I am curious about is, what happens to the old suffix links. Are they still valid???
Also I am confused about my triple (active_node,active_length,remainder) should it be (node representing "ab",0,0) as I start the new pass???
For dealing with special characters you can use the Unicode Private Use Areas. These are a few special ranges of characters reserved for your own use, however the ranges are only around 4000 characters in size. Depending on the unicode support of the language you are using this can be really easy or difficult.
If that does not work, instead of inserting characters into your tree, wrap them in some other sort of variable (struct, object, dictionary) to 'extend' their meaning. That way you can provide the extra information needed (is this the end of a string? which string is this the end of?). Then you can provide custom operators for equality on this new wrapper instead of using characters directly.
General idea is that do two for loop, carry out every character from string 1, compare to every character from string2, if all finded, that will indicate Include.
so we need to loop all the char from string1, and compare all look all the character from string2, that will O sqaure runing time.
Which interviewer says it is not good idea.
after it, i am thinking for it. i cannot generate one idea that did not do two loop.
perhaps i can first get all the character from string1, convert into asc2, the number built into a tree. so when do the compare to the string2, it will make search very fast.
Or any folk has better idea?
Like string1 is abc but string2 is cbattt that means every character is included in string2.
not substring,
as iccthedral says, boyer moore is probably what the interviewer was looking for.
searching a text for a given pattern (pattern matching) is a very known problem. known solutions:
KMP
witness table
boyer-moore
suffix tree
all solutions vary in some minor aspects, like if it can be generalized for 2D pattern matching, or more. if it needs pre-processing, if it can be generalized for unbound alphabet, running time, etc'...
EDIT:
if you just want to know if all the letters of some string appear in some other string, why not use a table the size of your alphabet, indicating if a given char can be found in the string. if the alphabet is unbounded or extremely large (more than O(1)), use hash table.