Boyer-Moore Algorithm: Reaching the End of the Text - algorithm

I'm working through this algorithm and the pattern and text are not matching up.
The text is: AADBCCAAA
The pattern is: CCAAA
I've created both the bad-symbol table and good-suffix table.
Bad Symbol:
! A C (! represents letters not in the pattern)
5 1 3
Good-suffix:
k d2
1 2
2 6
3 6
4 6
5 6
As for my searching:
AADBCCAAA
CCAAA Since there is no match, shift 3 because C causes the mismatch
This lines up the right-most A in the pattern with the 2nd to last A in the text. This means
d1 = 3-2 = 1 and d2 = 6. The max of the two is 6 so shift 6.
With Boyer-Moore, does this mean since you can't shift 6, it will just compare the pattern with the end of the text and find the match or am I doing something wrong?

Related

How to extract vectors from a given condition matrix in Octave

I'm trying to extract a matrix with two columns. The first column is the data that I want to group into a vector, while the second column is information about the group.
A =
1 1
2 1
7 2
9 2
7 3
10 3
13 3
1 4
5 4
17 4
1 5
6 5
the result that i seek are
A1 =
1
2
A2 =
7
9
A3 =
7
10
13
A4=
1
5
17
A5 =
1
6
as an illustration, I used the eval function but it didn't give the results I wanted
Assuming that you don't actually need individually named separated variables, the following will put the values into separate cells of a cell array, each of which can be an arbitrary size and which can be then retrieved using cell index syntax. It makes used of logical indexing so that each iteration of the for loop assigns to that cell in B just the values from the first column of A that have the correct number in the second column of A.
num_cells = max (A(:,2));
B = cell (num_cells,1);
for idx = 1:max(A(:,2))
B(idx) = A((A(:,2)==idx),1);
end
B =
{
[1,1] =
1
2
[2,1] =
7
9
[3,1] =
7
10
13
[4,1] =
1
5
17
[5,1] =
1
6
}
Cell arrays are accessed a bit differently than normal numeric arrays. Array indexing (with ()) will return another cell, e.g.:
>> B(1)
ans =
{
[1,1] =
1
2
}
To get the contents of the cell so that you can work with them like any other variable, index them using {}.
>> B{1}
ans =
1
2
How it works:
Use max(A(:,2)) to find out how many array elements are going to be needed. A(:,2) uses subscript notation to indicate every value of A in column 2.
Create an empty cell array B with the right number of cells to contain the separated parts of A. This isn't strictly necessary, but with large amounts of data, things can slow down a lot if you keep adding on to the end of an array. Pre-allocating is usually better.
For each iteration of the for loop, it determines which elements in the 2nd column of A have the value matching the value of idx. This returns a logical array. For example, for the third time through the for loop, idx = 3, and:
>> A_index3 = A(:,2)==3
A_index3 =
0
0
0
0
1
1
1
0
0
0
0
0
That is a logical array of trues/falses indicating which elements equal 3. You are allowed to mix both logical and subscripts when indexing. So using this we can retrieve just those values from the first column:
A(A_index3, 1)
ans =
7
10
13
we get the same result if we do it in a single line without the A_index3 intermediate placeholder:
>> A(A(:,2)==3, 1)
ans =
7
10
13
Putting it in a for loop where 3 is replaced by the loop variable idx, and we assign the answer to the idx location in B, we get all of the values separated into different cells.

Bipartie Matching to form an Array

I am given a number from 1 to N , and there are M relationship given in the form a and b where we can connect number a and b.
We have to form the valid array , A array is said to be valid if for any two consecutive indexes A[i] and A[i+1] is one of the M relationship
We have to construct a valid Array of Size N, it's always possible to construct that.
Solution: Make A Bipartite Graph of the following , but there is a loophole on this,
let N=6
M=6
1 2
2 3
1 3
4 5
5 6
3 4
So Bipartite Matching gives this:
Match[1]=2
Match[2]=3
Match[3]=1 // Here it form a Loop
Match[4]=5
Match[5]=6
So how to i print a valid Array of size N , since N can be very large so many loops can be formed ? Is there any other solution ?
Another Example:
let N=6
M=6
1 3
3 5
2 5
5 1
4 2
6 4
It's will form a loop 1->3->5->1
1 3 5 2 4 6

Strategy with regard to how to approach this algorithm?

I was asked this question in a test and I need help with regards to how I should approach the solution, not the actual answer. The question is
You have been given a 7 digit number(with each digit being distinct and 0-9). The number has this property
product of first 3 digits = product of last 3 digits = product of central 3 digits
Identify the middle digit.
Now, I can do this on paper by brute force(trial and error), the product is 72 and digits being
8,1,9,2,4,3,6
Now how do I approach the problem in a no brute force way?
Let the number is: a b c d e f g
So as per the rule(1):
axbxc = cxdxe = exfxg
more over we have(2):
axb = dxe and
cxd = fxg
This question can be solved with factorization and little bit of hit/trial.
Out of the digits from 1 to 9, 5 and 7 can rejected straight-away since these are prime numbers and would not fit in the above two equations.
The digits 1 to 9 can be factored as:
1 = 1, 2 = 2, 3 = 3, 4 = 2X2, 6 = 2X3, 8 = 2X2X2, 9 = 3X3
After factorization we are now left with total 7 - 2's, 4 - 3's and the number 1.
As for rule 2 we are left with only 4 possibilities, these 4 equations can be computed by factorization logic since we know we have overall 7 2's and 4 3's with us.
1: 1X8(2x2x2) = 2X4(2x2)
2: 1X6(3x2) = 3X2
3: 4(2x2)X3 = 6(3x2)X2
4: 9(3x3)X2 = 6(3x2)X3
Skipping 5 and 7 we are left with 7 digits.
With above equations we have 4 digits with us and are left with remaining 3 digits which can be tested through hit and trial. For example, if we consider the first case we have:
1X8 = 2X4 and are left with 3,6,9.
we have axbxc = cxdxe we can opt c with these 3 options in that case the products would be 24, 48 and 72.
24 cant be correct since for last three digits we are left with are 6,9,4(=216)
48 cant be correct since for last three digits we are left with 3,9,4(=108)
72 could be a valid option since the last three digits in that case would be 3,6,4 (=72)
This question is good to solve with Relational Programming. I think it very clearly lets the programmer see what's going on and how the problem is solved. While it may not be the most efficient way to solve problems, it can still bring desired clarity and handle problems up to a certain size. Consider this small example from Oz:
fun {FindDigits}
D1 = {Digit}
D2 = {Digit}
D3 = {Digit}
D4 = {Digit}
D5 = {Digit}
D6 = {Digit}
D7 = {Digit}
L = [D1 D2 D3] M = [D3 D4 D5] E= [D5 D6 D7] TotL in
TotL = [D1 D2 D3 D4 D5 D6 D7]
{Unique TotL} = true
{ProductList L} = {ProductList M} = {ProductList E}
TotL
end
(Now this would be possible to parameterize furthermore, but non-optimized to illustrate the point).
Here you first pick 7 digits with a function Digit/0. Then you create three lists, L, M and E consisting of the segments, as well as a total list to return (you could also return the concatenation, but I found this better for illustration).
Then comes the point, you specify relations that have to be intact. First, that the TotL is unique (distinct in your tasks wording). Then the next one, that the segment products have to be equal.
What now happens is that a search is conducted for your answers. This is a depth-first search strategy, but could also be breadth-first, and a solver is called to bring out all solutions. The search strategy is found inside the SolveAll/1 function.
{Browse {SolveAll FindDigits}}
Which in turns returns this list of answers:
[[1 8 9 2 4 3 6] [1 8 9 2 4 6 3] [3 6 4 2 9 1 8]
[3 6 4 2 9 8 1] [6 3 4 2 9 1 8] [6 3 4 2 9 8 1]
[8 1 9 2 4 3 6] [8 1 9 2 4 6 3]]
At least this way forward is not using brute force. Essentially you are searching for answers here. There might be heuristics that let you find the correct answer sooner (some mathematical magic, perhaps), or you can use genetic algorithms to search the space or other well-known strategies.
Prime factor of distinct digit (if possible)
0 = 0
1 = 1
2 = 2
3 = 3
4 = 2 x 2
5 = 5
6 = 2 x 3
7 = 7
8 = 2 x 2 x 2
9 = 3 x 3
In total:
7 2's + 4 3's + 1 5's + 1 7's
With the fact that When A=B=C, composition of prime factor of A must be same as composition of prime factor of B and that of C, 0 , 5 and 7 are excluded since they have unique prime factor that can never match with the fact.
Hence, 7 2's + 4 3's are left and we have 7 digit (1,2,3,4,6,8,9). As there are 7 digits only, the number is formed by these digits only.
Recall the fact, A, B and C must have same composition of prime factors. This implies that A, B and C have same number of 2's and 3's in their composition. So, we should try to achieve (in total for A and B and C):
9 OR 12 2's AND
6 3's
(Must be product of 3, lower bound is total number of prime factor of all digits, upper bound is lower bound * 2)
Consider point 2 (as it has one possibility), A has 2 3's and same for B and C. To have more number of prime factor in total, we need to put digit in connection digit between two product (third or fifth digit). Extract digits with prime factor 3 into two groups {3,6} and {9} and put digit into connection digit. The only possible way is to put 9 in connection digit and 3,6 on unconnected product. That mean xx9xx36 or 36xx9xx (order of 3,6 is not important)
With this result, we get 9 x middle x connection digit = connection digit x 3 x 6. Thus, middle = (3 x 6) / 9 = 2
My answer actually extends #Ansh's answer.
Let abcdefg be the digits of the number. Then
ab=de
cd=fg
From these relations we can exclude 0, 5 and 7 because there are no other multipliers of these numbers between 0 and 9. So we are left with seven numbers and each number is included once in each answer. We are going to examine how we can pair the numbers (ab, de, cd, fg).
What happens with 9? It can't be combined with 3 or 6 since then their product will have three times the factor 3 and we have at total 4 factors of 3. Similarly, 3 and 6 must be combined at least one time together in response to the two factors of 9. This gives a product of 18 and so 9 must be combined at least once with 2.
Now if 9x2 is in a corner then 3x6 must be in the middle. Meaning in the other corner there must be another multiplier of 3. So 9 and 2 are in the middle.
Let's suppose ab=3x6 (The other case is symmetric). Then d must be 9 or 2. But if d is 9 then f or g must be multiplier of 3. So d is 2 and e is 9. We can stop here and answer the middle digit is
2
Now we have 2c = fg and the remaining choices are 1, 4, 8. We see that the only solutions are c = 4, f = 1, g = 8 and c = 4, f = 8, g = 1.
So if is 3x6 is in the left corner we have the following solutions:
3642918, 3642981, 6342918, 6342981
If 3x6 is in the right corner we have the following solutions which are the reverse of the above:
8192463, 1892463, 8192436, 1892436
Here is how you can consider the problem:
Let's note the final solution N1 N2 N3 N4 N5 N6 N7 for the 3 numbers N1N2N3, N3N4N5 and N5N6N7
0, 5 and 7 are to exclude because they are prime and no other ciphers is a multiple of them. So if they had divided one of the 3 numbers, no other number could have divided the others.
So we get the 7 remaining ciphers : 1234689
where the product of the ciphers is 2^7*3^4
(N1*N2*N3) and (N5*N6*N7) are equals so their product is a square number. We can then remove, one of the number (N4) from the product of the previous point to find a square number (i.e. even exponents on both numbers)
N4 can't be 1, 3, 4, 6, 9.
We conclude N4 is 2 or 8
If N4 is 8 and it divides (N3*N4*N5), we can't use the remaining even numbers (2, 4, 6) to divides
both (N1*N2*N3) and (N6*N7*N8) by 8. So N4 is 2 and 8 does not belong to the second group (let's put it in N1).
Now, we have: 1st grp: 8XX, 2nd group: X2X 3rd group: XXX
Note: at this point we know that the product is 72 because it is 2^3*3^2 (the square root of 2^6*3^4) but the result is not really important. We have made the difficult part knowing the 7 numbers and the middle position.
Then, we know that we have to distribute 2^3 on (N1*N2*N3), (N3*N4*N5), (N5*N6*N7) because 2^3*2*2^3=2^7
We already gave 8 to N1, 2 to N4 and we place 6 to N6, and 4 to N5 position, resulting in each of the 3 numbers being a multiple of 8.
Now, we have: 1st grp: 8XX, 2nd group: X24 3rd group: 46X
We have the same way of thinking considering the odd number, we distribute 3^2, on each part knowing that we already have a 6 in the last group.
Last group will then get the 3. And first and second ones the 9.
Now, we have: 1st grp: 8X9, 2nd group: 924 3rd group: 463
And, then 1 at N2, which is the remaining position.
This problem is pretty easy if you look at the number 72 more carefully.
We have our number with this form abcdefg
and abc = cde = efg, with those digits 8,1,9,2,4,3,6
So, first, we can conclude that 8,1,9 must be one of the triple, because, there is no way 1 can go with other two numbers to form 72.
We can also conclude that 1 must be in the start/end of the whole number or middle of the triple.
So now we have 819defg or 918defg ...
Using some calculations with the rest of those digits, we can see that only 819defg is possible, because, we need 72/9 = 8,so only 2,4 is valid, while we cannot create 72/8 = 9 from those 2,4,3,6 digits, so -> 81924fg or 81942fg and 819 must be the triple that start or end our number.
So the rest of the job is easy, we need either 72/4 = 18 or 72/2 = 36, now, we can have our answers: 8192436 or 8192463.
7 digits: 8,1,9,2,4,3,6
say XxYxZ = 72
1) pick any two from above 7 digits. say X,Y
2) divide 72 by X and then Y.. you will get the 3rd number i.e Z.
we found XYZ set of 3-digits which gives result 72.
now repeat 1) and 2) with remaining 4 digits.
this time we found ABC which multiplies to 72.
lets say, 7th digit left out is I.
3) divide 72 by I. result R
4) divide R by one of XYZ. check if result is in ABC.
if No, repeat the step 3)
if yes, found the third pair.(assume you divided R by Y and the result is B)
YIB is the third pair.
so... solution will be.
XZYIBAC
You have your 7 numbers - instead of looking at it in groups of 3 divide up the number as such:
AB | C | D | E | FG
Get the value of AB and use it to get the value of C like so: C = ABC/AB
Next you want to do the same thing with the trailing 2 digits to find E using FG. E = EFG/FG
Now that you have C & E you can solve for D
Since CDE = ABC then D = ABC/CE
Remember your formulas - instead of looking at numbers create a formula aka an algorithm that you know will work every time.
ABC = CDE = EFG However, you have to remember that your = signs have to balance. You can see that D = ABC/CE = EFG/CE Once you know that, you can figure out what you need in order to solve the problem.
Made a quick example in a fiddle of the code:
http://jsfiddle.net/4ykxx9ve/1/
var findMidNum = function() {
var num = [8, 1, 9, 2, 4, 3, 6];
var ab = num[0] * num[1];
var fg = num[5] * num[6];
var abc = num[0] * num[1] * num[2];
var cde = num[2] * num[3] * num[4];
var efg = num[4] * num[5] * num[6];
var c = abc/ab;
var e = efg/fg;
var ce = c * e
var d = abc/ce;
console.log(d); //2
}();
You have been given a 7 digit number(with each digit being distinct and 0-9). The number has this property
product of first 3 digits = product of last 3 digits = product of central 3 digits
Identify the middle digit.
Now, I can do this on paper by brute force(trial and error), the product is 72 and digits being
8,1,9,2,4,3,6
Now how do I approach the problem in a no brute force way?
use linq and substring functions
example var item = array.Skip(3).Take(3) in such a way that you have a loop
for(f =0;f<charlen.length;f++){
var xItemSum = charlen[f].Skip(f).Take(f).Sum(f => f.Value);
}
// untested code

Find longest common substring of multiple strings using factor oracle enhanced with LRS array

Can we use a factor-oracle with suffix link (paper here) to compute the longest common substring of multiple strings? Here, substring means any part of the original string. For example "abc" is the substring of "ffabcgg", while "abg" is not.
I've found a way to compute the maximum length common substring of two strings s1 and s2. It works by concatenating the two strings using a character not in them, '$' for example. Then for each prefix of the concatenated string s with length i >= |s1| + 2, we calculate its LRS (longest repeated suffix) length lrs[i] and sp[i] (the end position of the first occurence of its LRS). Finally, the answer is
max{lrs[i]| i >= |s1| + 2 and sp[i] <= |s1|}
I've written a C++ program that uses this method, which can solve the problem within 200ms on my laptop when |s1|+|s2| <= 200000, using the factor oracle.
s1 = 'ffabcgg'
s2 = 'gfbcge'
s = s1+'$'+s2
= 'ffabcgg$gfbcge'
p: 0 1 2 3 4 5 6 7 8 9 10 11 12 13
s: f f a b c g g $ g f b c g e
sp: 0 1 0 0 0 0 6 0 6 1 4 5 6 0
lrs:0 1 0 0 0 0 1 0 1 1 1 2 3 0
ans = lrs[13] = 3
I know the both problems can be solved using suffix-array and suffix-tree with high efficiency, but I wonder if there is a method using factor oracle to solve it. I am interested in this because the factor oracle is easy to construct (with 30 lines of C++, suffix-array needs about 60, and suffix-tree needs 150), and it runs faster than suffix-array and suffix-tree.
You can test your method of the first problem in this OnlineJudge, and the second problem in here.
Can we use a factor-oracle with suffix link (paper here) to compute
the longest common substring of multiple strings?
I don't think the algorithm is a very good fit (it is designed to factor a single string) but you can use it by concatenating the original strings with a unique separator.
Given abcdefg and hijcdekl and mncdop, find the longest common substring cd:
# combine with unique joiners
>>> s = "abcdefg" + "1" + "hijcdekl" + "2" + "mncdop"
>>> factor_oracle(s)
"cd"
As part of its linear-time and space algorithm, the factor-oracle quickly rediscover the break points between the input strings as part of its search for common factors (the unique joiners provide and immediate cue to stop extending the best factor found so far).

String similarity: how exactly does Bitap work?

I'm trying to wrap my head around the Bitap algorithm, but am having trouble understanding the reasons behind the steps of the algorithm.
I understand the basic premise of the algorithm, which is (correct me if i'm wrong):
Two strings: PATTERN (the desired string)
TEXT (the String to be perused for the presence of PATTERN)
Two indices: i (currently processing index in PATTERN), 1 <= i < PATTERN.SIZE
j (arbitrary index in TEXT)
Match state S(x): S(PATTERN(i)) = S(PATTERN(i-1)) && PATTERN[i] == TEXT[j], S(0) = 1
In english terms, PATTERN.substring(0,i) matches a substring of TEXT if the previous substring PATTERN.substring(0, i-1) was successfully matched and the character at PATTERN[i] is the same as the character at TEXT[j].
What I don't understand is the bit-shifting implementation of this. The official paper detailing this algorithm basically lays it out, but I can't seem to visualize what's supposed to go on. The algorithm specification is only the first 2 pages of the paper, but I'll highlight the important parts:
Here is the bit-shifting version of the concept:
Here is T[text] for a sample search string:
And here is a trace of the algorithm.
Specifically, I don't understand what the T table signifies, and the reason behind ORing an entry in it with the current state.
I'd be grateful if anyone can help me understand what exactly is going on
T is slightly confusing because you would normally number positions in the
pattern from left to right:
0 1 2 3 4
a b a b c
...whereas bits are normally numbered from right to left.
But writing the
pattern backwards above the bits makes it clear:
bit: 4 3 2 1 0
c b a b a
T[a] = 1 1 0 1 0
c b a b a
T[b] = 1 0 1 0 1
c b a b a
T[c] = 0 1 1 1 1
c b a b a
T[d] = 1 1 1 1 1
Bit n of T[x] is 0 if x appears in position n, or 1 if it does not.
Equivalently, you can think of this as saying that if the current character
in the input string is x, and you see a 0 in position n of T[x], then you
can only possibly be matching the pattern if the match started n characters
previously.
Now to the matching procedure. A 0 in bit n of the state means that we started matching the pattern n characters ago (where 0 is the current character). Initially, nothing matches.
[start]
1 1 1 1 1
As we consume characters trying to match, the state is shifted left (which shifts a zero in
to the bottom bit, bit 0) and OR-ed with the table entry for the current character. The first character is a; shifting left and OR-ing in T[a] gives:
a
1 1 1 1 0
The 0 bit that was shifted in is preserved, because a current character of a can
begin a match of the pattern. For any other character, the bit would be have been set to
1.
The fact that bit 0 of the state is now 0 means that we started matching the pattern on
the current character; continuing, we get:
a b
1 1 1 0 1
...because the 0 bit has been shifted left - think of it as saying that we started matching the pattern 1 character ago - and T[b] has a 0 in the same position, telling
us that a seeing a b in the current position is good if we started matching 1 character
ago.
a b d
1 1 1 1 1
d can't match anywhere; all the bits get set back to 1.
a b d a
1 1 1 1 0
As before.
a b d a b
1 1 1 0 1
As before.
b d a b a
1 1 0 1 0
a is good if the match started either 2 characters ago or on the current character.
d a b a b
1 0 1 0 1
b is good if the match started either 1 or 3 characters ago. The 0 in bit 3 means
that we've almost matched the whole pattern...
a b a b a
1 1 0 1 0
...but the next character is a, which is no good if the match started 4 characters
ago. However, shorter matches might still be good.
b a b a b
1 0 1 0 1
Still looking good.
a b a b c
0 1 1 1 1
Finally, c is good if the match started 4 characters before. The fact that
a 0 has made it all the way to the top bit means that we have a match.
Sorry for not allowing anyone else to answer, but I'm pretty sure I've figured it out now.
The concept essential for groking the algorithm is the representation of match states (defined in the original post) in binary. The article in the original post explains it formally; I'll try my hand at doing so colloquially:
Let's have STR, which is a String created with characters from a given alphabet.
Let's represent STR with a set of binary digits: STR_BINARY. The algorithm requires for this representation to be backwards (so, the first letter corresponds to the last digit, second letter with the second-to-last digit, etc.).
Let's assume RANDOM refers to a String with random characters from the same alphabet STR is created from.
In STR_BINARY, a 0 at a given index indicates that, RANDOM matches STR from STR[0] to
STR[(index of letter in STR that the 0 in STR_BINARY corresponds to)]. Empty spaces count as matches. A 1 indicates that RANDOM does not match STR in inside those same boundaries.
The algorithm becomes simpler to learn once this is understood.

Resources