Shift rules in Boyer-Moore algorithm - algorithm

There is something I could not figure out about the two shifting rules (bad character and good suffix) in this algorithm. Are they working together and what exactly decide which one to deploy in each case or shift. This comprehensive explanation ended with an example of SSIMPLE EXAMPLE which confused me, my question here, if the algorithm moves backward, why would the algorithm will need good suffix shift to move to the right? I am sure I miss something here. Would you help me to explain the aforementioned example.

The missing point is the algorithm moves backward on the pattern not the string, so the comparison starts from the character of index n ( n is pattern length) not from the index 1. the following visual example is very helpful to clarify that.

Related

Closure Number Method for Generate Parenthesis Problem

The standard Generate Parenthesis question on Leetcode is as follows
Given n pairs of parentheses, write a function to generate all combinations of well-formed parentheses.
For example, given n = 3, a solution set is:
[
"((()))",
"(()())",
"(())()",
"()(())",
"()()()"
]
In the solution tab they have explained Closure Number Method which I am finding it difficult to understand.
I did a dry run of the code and even got the correct answer but can't seem to understand why it works? What is the intuition behind this method?
Any help would be greatly appreciated!
The basic idea of this algorithm is dynamic programming. So you try to divide your problem into smaller problems that are easy to solve. In this example you make the sub-problems so small that the solution is either an empty string (if the size is 0) or the solution is "()" (for the size 1).
You start using the knowledge that if you want the parenthesis of a given length then the first character needs to be "(" and in some later place of the string there needs to be this character: ")". Otherwhise the output is not valid.
Now you don't know the position of the closing parenthesis so you just try every position (the first for loop).
The second thing you know, is that between the opening and the closing parenthesis and after the closing parenthesis there has to be something, that you don't realy know how it looks (because there are many possibilities), but it has to be a valid parenthesis pair again.
Now this problem is just the problem you already solved. So you just put in every possibility of valid parenthesis (using a smaller input size). Because this is just what your algorithm already does you can use the recursive function call to do this.
So summarized: You know a part of the problem, and that the rest of the problem is just the same problem with a smaller size. So you solve the small part of the problem you know and recursively call the same method to do this on the rest of the problem. Afterwards you just put it all together and got your solution.
Dynamic programming is usually not that easy to understand but very powerfull. So don't wory if you don't understand it directly. Solving puzzles like these is the best way to learn dynamic programming.
The closure number of a sequence in the size of the smallest prefix of the sequence which is a valid sequence on its own.
If a sequence has a closure number of k, than you know that in index 0 there is '(' and in index k there is ')'
The method solves the problem by checking all possible sizes of such prefix, for each one it breaks the sequence to the prefix (removing the 0 and k element) and all the rest of the sequence and solving the two sub problems recursively.

Minimum-size Trie Construction by Dynamic Programming (War Story: What’s Past is Prolog from The Algorithm Design Manual by Steven S Skiena)

I'm reading the The Algorithm Design Manual by Steven S Skiena, and trying to understand the solution to the problem of War Story: What’s Past is Prolog.
The problem is also well described here.
Basically, the problem is, given an ordered list of strings, give a optimal solution to construct a trie with minimum size (the string character as the node), with the constraint that the order of the strings must be reserved, while the character index can be reordered.
Maybe this is not an appropriate question here for stackoverflow, still I'm wondering if anyone could give me some hint on the solution, especially what this recurrence means by its arguments:
the recurrence for the Dynamic Programming algorithm
You can think about it this way:
Let's assume that we fix the index of the first character. All strings get split into r bins based on the value of the character in this position (bins are essentially subtrees).
We can work with each bin independently. It won't change the order across different bins because two strings in different bins are different in the first character.
Thus, we can solve the problem for each bin independently. After that, we need exactly one edge to connect the root to each bin (that is, subtree). That's where the formula
C[i_k, j_k] + 1 comes from.
As we want to minimize the total number of edges and we're free to pick the first position, we just try all possible options among m positions.
Note: this algorithm is correct under assumption that we can reorder the rest of the characters in each subtree independently. If it's not the case, the dynamic programming solution is incorrect.

What are the main differences between the Knuth-Morris-Pratt and Boyer-Moore search algorithms?

What are the main differences between the Knuth-Morris-Pratt search algorithm and the Boyer-Moore search algorithm?
I know KMP searches for Y in X, trying to define a pattern in Y, and saves the pattern in a vector. I also know that BM works better for small words, like DNA (ACTG).
What are the main differences in how they work? Which one is faster? Which one is less computer-greedy? In which cases?
Moore's UTexas webpage walks through both algorithms in a step-by-step fashion (he also provides various technical sources):
Knuth-Morris-Pratt
Boyer-Moore
According to the man himself,
The classic Boyer-Moore algorithm suffers from the phenomenon that it
tends not to work so efficiently on small alphabets like DNA. The skip
distance tends to stop growing with the pattern length because
substrings re-occur frequently. By remembering more of what has
already been matched, one can get larger skips through the text. One
can even arrange ``perfect memory'' and thus look at each character at
most once, whereas the Boyer-Moore algorithm, while linear, may
inspect a character from the text multiple times. This idea of
remembering more has been explored in the literature by others. It
suffers from the need for very large tables or state machines.
However, there have been some modifications of BM that have made small-alphabet searching viable.
In an rough explanation
Boyer-Moore's approach is to try to match the last character of the pattern instead of the first one with the assumption that if there's not match at the end no need to try to match at the beginning. This allows for "big jumps" therefore BM works better when the pattern and the text you are searching resemble "natural text" (i.e. English)
Knuth-Morris-Pratt searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters. (Source: Wiki)
This means KMP is better suited for small sets like DNA (ACTG)
Boyer-Moore technique match the characters from right to left, works well on long patterns.
knuth moris pratt match the characters from left to right, works fast on short patterns.

Lexicographically smallest perfect matching

I want to find the lexicographically smallest perfect matching in two partial graphs. I'm supposed to use Kuhn's algorithm, but I don't understand how to make matching lexicographically smallest. Is it at all possible in Kuhn's algorithm? I can provide my code, but it's classic enough.
As a hint, consider how you could determine where just the first node should be matched in the lex-min matching.
In most cases like this it is usually easier to make a reduction instead of modifying the algorithm:
Find a way to change the input in your problem so that lexicographical order breaks any ties (but in a manner that perfect matchings still have a higher score than imperfect ones)
Run the modified graph through Kuhn's algorithm.
If needed, translate the answer back to the original problem.
I haven't tried to actually solve this myself or read the problem in detail. But this seems to be a textbook exercise and I feel this answer is enough :-)
Think about how you can create assignment prices that encourage lexicographically early matchings.

Number of simple mutations to change one string to another?

I'm sure you've all heard of the "Word game", where you try to change one word to another by changing one letter at a time, and only going through valid English words. I'm trying to implement an A* Algorithm to solve it (just to flesh out my understanding of A*) and one of the things that is needed is a minimum-distance heuristic.
That is, the minimum number of one of these three mutations that can turn an arbitrary string a into another string b:
1) Change one letter for another
2) Add one letter at a spot before or after any letter
3) Remove any letter
Examples
aabca => abaca:
aabca
abca
abaca
= 2
abcdebf => bgabf:
abcdebf
bcdebf
bcdbf
bgdbf
bgabf
= 4
I've tried many algorithms out; I can't seem to find one that gives the actual answer every time. In fact, sometimes I'm not sure if even my human reasoning is finding the best answer.
Does anyone know any algorithm for such purpose? Or maybe can help me find one?
(Just to clarify, I'm asking for an algorithm that can turn any arbitrary string to any other, disregarding their English validity-ness.)
You want the minimum edit distance (or Levenshtein distance):
The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965.
And one algorithm to determine the editing sequence is on the same page here.
An excellent reference on "Edit distance" is section 6.3 of the Algorithms textbook by S. Dasgupta, C. H. Papadimitriou, and U. V. Vazirani, a draft of which is available freely here.
If you have a reasonably sized (small) dictionary, a breadth first tree search might work.
So start with all words your word can mutate into, then all those can mutate into (except the original), then go down to the third level... Until you find the word you are looking for.
You could eliminate divergent words (ones further away from the target), but doing so might cause you to fail in a case where you must go through some divergent state to reach the shortest path.

Resources