How to understand the process of DFA construction in KMP algorithms - algorithm

I'm learning KMP algorithm in the Book Algorithms 4th. I could understand most of the algorithm but have been stuck at process of dfa construction for a couple of days.
Take the pattern ABABAC for example. When there's a mismatch at C(state of dfa is 5), we should shift right a character in the text. So the pattern characters that we have known are BABA. However, how to figure out the next state of the dfa during construction? I failed to understand the text below:
For example, to decide what the DFA should do when we have a mismatch at j=5, for ABABAC, we use the DFA to learn that a full backup would leave us in state 3 for BABA, so we can copy dfa[][3] to dfa[][5].
What does it mean by "a full backup would leave us in state 3 for BABA", and how to get this conclusion while there's no specified input? And I can't understand the graph left to the text. Could anyone explain what it means? I have tried to understand it by myself for a couple days, but still could not get it. Thank you!
And you can read the segment of Algorithms 4th here.

When you're matching the input string, you can only get into state 5 after matching the first 5 characters of the pattern, and the first 5 characters of the pattern are ABABA. So no matter which input string you use, you know that the text preceding state 5 is "ABABA".
So you if you get a mismatch in state 5, you could back up 4 characters and try matching again. But since you know what text has to appear before state 5, you don't actually need the input text to figure out what would happen. You can figure out beforehand what state you'd end up in when you got back to the same place.
backup 4 characters and go to state 0:
0 : BABA
A doesn't match, so advance and go to state 0
0: ABA
A matches, so go to state 1
1: BA
B matches, go to state 2
2: A
A matches, go to state 3
3:
now we're back to the place in the input where we saw state 5 before, but now we're in state 3.
This will always happen when we get a mismatch in state 5, so instead of actually doing this, we just make a note that says "when we get a mismatch in state 5, go to state 3".
Note that most KMP implementations will actually make a failure table where failure_table[5]=3. Your example implementation is building the full DFA[char][state] instead, so it copies all the transitions from state 3 into state 5 for the failure cases. That says "when we get a mismatch in state 5, do whatever state 3 does", which works out the same.
UNDERSTAND EVERYTHING ABOVE BEFORE MOVING ON
Now let's speed up the calculation of those failure states...
When we get a mismatch in state 5, we can use the DFA we have so far to figure out what would happen if we backed up and rescanned the input starting at the next possible match, by applying the DFA to "BABA". We end up in state 3, so let's call state 3 the "failure state" for state 5.
It looks like we have to process 4 pattern charcters to calculate the failure state for state 5, but we already did most of that work when we calculated the failure state for state 4 -- we applied the DFA to "BAB" and ended up in state 2.
So to figure out the failure state for state 5, we just start in the failure state for state 4 (state 2), and process the next character in the pattern -- the "A" that came after state 4 in the input.

Related

Is this NFA correctly accepting inputs that end with a 00?

In a lecture it was said that this NFA accepts inputs ending with two zeros or inputs=0: https://ibb.co/9Wt0j7J .
The alphabet is {0,1}
But if the input would be 001 we would on some path also end up in the acceptance state (z2) but it would not be possible to go back to another state when reading the last character, the one. That would mean, a wrong input was accepted. So, my question is: Is the NFA really constructed correctly without changing anything? And if yes, why? Can I just assume that we go to an "empty (invisible) state (error state without mentioning it explicitly)" or something like that if there is no other arrow to another state?
Yes, it is a correct NFA for the given definition.
But if the input would be 001 we would on some path also end up in the acceptance state (z2) but it would not be possible to go back to another state when reading the last character, the one. That would mean, a wrong input was accepted.
If you put 001, it will not accept it since NFAs checks all possible paths and eliminates the paths which stuck. So you will go to z2 after the first two 0s but after reading the 1, it will stuck and be eliminated.
Edit:
... an NFA accepts a string w if it is possible to make any
sequence of choices of next state, while reading the characters of w and go from the start state to any accepting state.
from the book Introduction to Automata Theory, Languages, and Computations by John E. Hopcroft, Rajeew Motwani, Jeffret D. Ullman, (2006, p.59).

How to use machine learning to count words in text

Question:
Given a piece of text like "This is a test"; how to build a machine learning model to get the number of word occurrences for example in this piece, word count is 4. After training, it is possible to predict text word count.
I know it is easy to write a program (like below pseudo code),
data: memory.punctuation['~', '`', '!', '#', '#', '$', '%', '^', '&', '*', ...]
f: count.word(text) -> count =
f: tokenize(text) --list-->
f: count.token(list, filter) where filter(token)<not in memory.punctuation> -> count
however in this question, we require to use machine learning algorithm. I wonder how machine can learn the concept of count (currently, we know machine learning is good at classification). Any idea and suggestions? Thanks in advance.
Failures:
We can use sth like word2vec (encoder) to build word vectors; if we consider seq2seq approach, we can train sth like This is a test <s> 4 <e> This is very very long sentence and the word count is greater than ten <s> 4 1 <e> (4 1 to represent the number 14). However, it does not work since attention model is used to get similar vector for example text translating (This is a test --> 这(this) 是(is) 一个(a) 测试(test)). It is hard to find relationship between [this ...] and 4 which is an aggregated number (i.e. model not convergent).
We know machine learning is good at classification. If we treat "4" as a class, the number of classes is infinite; if we do a tricky and use count/text.length as prediction, i have not got a model that fit even training data set (model not convergent); for example, if we use many short sentence to train the model, it will fail to predict long sentence length. And it may be related to an information paradox: we can encode data in a book as 0.x and use a machine to to mark a position on a rod to split it into 2 parts with length a and b, where a/b = 0.x; but we cannot find a machine.
What about a regression problem?
I think it would work quite well and that at the end it will output a nearly whole numbers all the time.
Also you can train a simple RNN to do the job, assuming you are using a hot one encoding and take an output from the last state.
If V_h is all zeros but the space index (which will be 1) and V_x as well, than the network will actually sum the spaces, and if c is 1 at the end so the output will be the number of words - For every length!
I think we can take it as a classification problem for a character being the input and if word breaker as the output.
In other words, at some time point t, we output whether the input character at the same time point is a word breaker (YES) or not (NO). If yes, then increase the word count. If no, then read the next character.
In modern English language I don't think there are going to be long words. So simple RNN model should do perhaps without the concern of vanishing gradient.
Let me know what you think!
Use NLTK for counting words,
from nltk.tokenize import word_tokenize
text = "God is Great!"
word_count = len(word_tokenize(text))
print(word_count)

Interview Q: Detecting a fighting game moveset

I got this as an interview question and I'm wondering what the optimal way of designing this system would be. The problem:
Say you have a fighting game where certain button combinations represent a special move. Implement 2 functions register_move([button combo],movename) which takes in a list of button inputs and a movename string and on_keypress(button) which registers the current keypress and prints a movename if a button combo has been activated. The buttons are represented as characters: 'U','D','L','R','A','B'
Example:
register_move(['A','B','U'],"Uppercut")
on_keypress('A')
on_keypress('B')
on_keypress('U') -> print "Uppercut"
you can assume moves are registered before on_keypress so you don't have to retroactively look back at the previous keypresses. You can use any language you like
Build a Deterministic Finite State Automaton. The initial state is "no keys recognised". On each keypress, transition into a new state; if it is a final state you have a move. All undefined transitions transition into the starting state. For your example,
S --(a)--> A
A --(b)--> AB
AB --(u) --> ABU: process "Uppercut", move to S
X --(x)--> S
where X is any state, x is any input not otherwise covered by the rules.
More practically and less theoretically, you will end up with a trie, so using a trie library should be sufficient. Root is "no input", walk it until a leaf, or restart on a mispress.
Considering the limited number of moves, you don't need a super efficient finite state machine to handle this.
You could simply store the strings in register_move, and have on_keypress memorize the last potentially valid sequence.
If the current key sequence is the prefix of at least one move (for instance "AB" being a prefix of "ABU"), you're done (just wait for the next keypress to see if a combo is reached).
If the sequence is no prefix, reset the sequence to the last keypress (for instance "ABD" -> "D"). This clears previous keypresses that correspond to no moves.
If the sequence corresponds to a move, perform the move (well, print it at least) and reset the sequence.
This would require to do a prefix search on every possible move combo, which is very quick if you have only a dozen or so. If for some reason you want to be quicker, you can indeed turn your list of combos into a prefix tree, but it would require a lot more code for little gain.

Computability: Is the language of DFAs that receive even-length words in P?

I've been struggling with this one for a while and am not able to come up with anything. Any pointers would be really appreciated.
The problem is: given the language of all DFAs that receive only words of even-length, prove whether it is in P or not.
I've considered making a turing machine that goes over the given DFA in something like BFS/Dijkstra's algorithm in order to find all the paths from the starting state to the accepting one, but have no idea how to handle loops?
Thanks!
I think it's in P, at worst quadratic. Each state of the DFA can have four parity states
unvisited -- state 0
known to be reachable in an odd number of steps -- state 1
known to be reachable in an even number of steps -- state 2
known to be reachable in both, odd and even numbers of steps -- state 3
Mark all states as unvisited, put the starting state in a queue (FIFO, priority, whatever), set its parity state to 2.
child_parity(n)
switch(n)
case 0: error
case 1: return 2
case 2: return 1
case 3: return 3
while(queue not empty)
dfa_state <- queue
step_parity = child_parity(dfa_state.parity_state)
for next_state in dfa_state.children
old_parity = next_state.parity_state
next_state.parity_state |= step_parity
if old_parity != next_state.parity_state // we have learnt something new
add next_state to queue // remove duplicates if applicable
for as in accept_states
if as.parity_state & 1 == 1
return false
return true
Unless I'm overlooking something, each DFA state is treated at most twice, each time checking at most size children for required action.
It would seem this only requires two states.
Your entry state would be empty string, and would also be an accept state. Adding anythign to the string would move it to the next state, which we can call the 'odd' state, and not make it an accept state. Adding another string puts us back to the original state.
I guess I'm not sure on the terminology anymore of whether a language is in P or not, so if you gave me a definition there I could tell you if this fits it, but this is one of the simplest DFA's around...

First-Occurrence Parallel String Matching Algorithm

To be up front, this is homework. That being said, it's extremely open ended and we've had almost zero guidance as to how to even begin thinking about this problem (or parallel algorithms in general). I'd like pointers in the right direction and not a full solution. Any reading that could help would be excellent as well.
I'm working on an efficient way to match the first occurrence of a pattern in a large amount of text using a parallel algorithm. The pattern is simple character matching, no regex involved. I've managed to come up with a possible way of finding all of the matches, but that then requires that I look through all of the matches and find the first one.
So the question is, will I have more success breaking the text up between processes and scanning that way? Or would it be best to have process-synchronized searching of some sort where the j'th process searches for the j'th character of the pattern? If then all processes return true for their match, the processes would change their position in matching said pattern and move up again, continuing until all characters have been matched and then returning the index of the first match.
What I have so far is extremely basic, and more than likely does not work. I won't be implementing this, but any pointers would be appreciated.
With p processors, a text of length t, and a pattern of length L, and a ceiling of L processors used:
for i=0 to t-l:
for j=0 to p:
processor j compares the text[i+j] to pattern[i+j]
On false match:
all processors terminate current comparison, i++
On true match by all processors:
Iterate p characters at a time until L characters have been compared
If all L comparisons return true:
return i (position of pattern)
Else:
i++
I am afraid that breaking the string will not do.
Generally speaking, early escaping is difficult, so you'd be better off breaking the text in chunks.
But let's ask Herb Sutter to explain searching with parallel algorithms first on Dr Dobbs. The idea is to use the non-uniformity of the distribution to have an early return. Of course Sutter is interested in any match, which is not the problem at hand, so let's adapt.
Here is my idea, let's say we have:
Text of length N
p Processors
heuristic: max is the maximum number of characters a chunk should contain, probably an order of magnitude greater than M the length of the pattern.
Now, what you want is to split your text into k equal chunks, where k is is minimal and size(chunk) is maximal yet inferior to max.
Then, we have a classical Producer-Consumer pattern: the p processes are feeded with the chunks of text, each process looking for the pattern in the chunk it receives.
The early escape is done by having a flag. You can either set the index of the chunk in which you found the pattern (and its position), or you can just set a boolean, and store the result in the processes themselves (in which case you'll have to go through all the processes once they have stop). The point is that each time a chunk is requested, the producer checks the flag, and stop feeding the processes if a match has been found (since the processes have been given the chunks in order).
Let's have an example, with 3 processors:
[ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ]
x x
The chunks 6 and 8 both contain the string.
The producer will first feed 1, 2 and 3 to the processes, then each process will advance at its own rhythm (it depends on the similarity of the text searched and the pattern).
Let's say we find the pattern in 8 before we find it in 6. Then the process that was working on 7 ends and tries to get another chunk, the producer stops it --> it would be irrelevant. Then the process working on 6 ends, with a result, and thus we know that the first occurrence was in 6, and we have its position.
The key idea is that you don't want to look at the whole text! It's wasteful!
Given a pattern of length L, and searching in a string of length N over P processors I would just split the string over the processors. Each processor would take a chunk of length N/P + L-1, with the last L-1 overlapping the string belonging to the next processor. Then each processor would perform boyer moore (the two pre-processing tables would be shared). When each finishes, they will return the result to the first processor, which maintains a table
Process Index
1 -1
2 2
3 23
After all processes have responded (or with a bit of thought you can have an early escape), you return the first match. This should be on average O(N/(L*P) + P).
The approach of having the i'th processor matching the i'th character would require too much inter process communication overhead.
EDIT: I realize you already have a solution, and are figuring out a way without having to find all solutions. Well I don't really think this approach is necessary. You can come up with some early escape conditions, they aren't that difficult, but I don't think they'll improve your performance that much in general (unless you have some additional knowledge the distribution of matches in your text).

Resources