Understanding the algorithm for pattern matching using an LCP array

Understanding the algorithm for pattern matching using an LCP array - algorithm

Foreword: My question is mainly an algorithmic question, so even if you are not familiar with suffix and LCP arrays you can probably help me.
In this paper it is described how to efficiently use suffix and LCP arrays for string pattern matching.
I understood SA and LCP work and how the algorithm's runtime can be improved from O(P*log(N)) (where P is the length of the pattern and N is length of the string) to O(P+log(N)) (Thanks to Chris Eelmaa's answer here and jogojapans answer here).
I was trying to go through the algorithm in figure 4 which explains the usage of LLcp and RLcp. But I have problems understanding how it works.
The algorithm (taken from the source):
Explanation of the used variable names:
lcp(v,w) : Length of the longest common prefix of v and w
W = w0..wP-1 : pattern of length P
A = a0..aN-1 : the text (length N)
Pos[0..N-1] : suffix array
L_W : index (in A) of first occurrence of the matched pattern
M : middle index of current substring
L : lower bound
R : upper bound
Lcp : array of size N-2 such that Lcp[M] = lcp(A_Pos[L_M], A_pos[M]) where L_M is the lower bound of the unique interval with M in the middle
Rcp : array of size N-2 such that Rcp[M] = lcp(A_Pos[R_M], A_pos[M]) where R_M is the upper bound of the unique interval with M in the middle
Now I want to try the algorithm using the following example (partly taken from here):
SA | LCP | Suffix entry
-----------------------
5 | N/A | a
3 | 1 | ana
1 | 3 | anana
0 | 0 | banana
4 | 0 | na
2 | 2 | nana
A = "banana" ; N = 6
W = "ban" ; P = 3
I want to try to match a string, say ban and would expect the algorithm to return 0 as L_W.
Here is how I would step through the algorithm:
l = lcp("a", "ban") = 0
r = lcp("nana", "ban") = 0
if 0 = 3 or 'b' =< 'a' then // which is NOT the case for both conditions
L_W = 0
else if 0 < 3 or 'b' =< 'n' then // which is the case for both conditions
L_W = 6 // which means 'not found'
...
...
I feel like I am missing something but I can't find out what. Also I am wondering how the precomputed LCP array can be used instead of calling lcp(v,w).

I believe there was an error.
First condition is fairly easy to understand. When LCP length == pattern length, it's done. When your pattern is even smaller than or equal to the smallest one, then only choice is the smallest one.
The second condition is wrong. We can prove it by contradiction. r < P || Wr <= a... means r >= P && Wr > a... If r >= P, then how can we have Lw = N(not found), since we already have r length common prefix?

Related

Generating correct phrases from PEG grammars

I wrote a PEG parser generator just for fun (I will publish it on NPM some time), and thought it would be easy to add a randomised phrase generator on top of it. The idea is to automatically get correct phrases, given a grammar. So I set the following rules to generate strings from each type of parsers :
Sequence p1 p2 ... pn : Generate a phrase for each subparser and return the concatenation.
Alternative p1 | p2 | ... | pn : Randomly pick a subparser and generate a phrase with it.
Repetition p{n, m} : Pick a number x in [n, m] (or [n, n+2] is m === Infinity) and return a concatenation of x generated phrases from p.
Terminal : Just return the terminal literal.
When I take the following grammar :
S: NP VP
PP: P NP
NP: Det N | Det N PP | 'I'
VP: V NP | VP PP
V: 'shot' | 'killed' | 'wounded'
Det: 'an' | 'my'
N: 'elephant' | 'pajamas' | 'cat' | 'dog'
P: 'in' | 'outside'
It works great. Some examples :
my pajamas killed my elephant
an pajamas wounded my pajamas in my pajamas
an dog in I wounded my cat in I outside my elephant in my elephant in an pajamas outside an cat
I wounded my pajamas in my dog
This grammar has a recursion (PP: P NP > NP: Det N PP). When I take this other recursive grammar, for math expression this time :
expr: term (('+' | '-') term)*
term: fact (('*' | '/') fact)*
fact: '1' | '(' expr ')'
Almost one time in two, I get a "Maximum call stack size exceeded" error (in NodeJS). The other half of the time, I get correct expressions :
( 1 ) * 1 + 1
( ( 1 ) / ( 1 + 1 ) - ( 1 / ( 1 * 1 ) ) / ( 1 / 1 - 1 ) ) * 1
( ( ( 1 ) ) )
1
1 / 1
I guess the recursive production for fact gets called too often, too deep in the call stack and this makes the whole thing just blow off.
How can I make my approach less naive in order to avoid those cases that explode the call stack ? Thank you.

Of course if a grammar describes arbitrarily long inputs, you can easily end up in a very deep recursion. A simple way to avoid this trap is keep a priority queue of partially expanded sentential forms where the key is length. Remove the shortest and replace each non-terminal in each possible way, emitting those that are now all terminals and adding the rest back onto the queue. You might also want to maintain an "already emitted" set to avoid emitting duplicates. If the grammar doesn't have anything like epsilon productions where a sentential form derives a shorter string, then this method produces all strings described by the grammar in non-decreasing length order. That is, once you've seen an output of length N, all strings of length N-1 and shorter have already appeared.
Since OP asked about details, here's an implementation for the expression grammar. It's simplified by rewriting the PEG as a CFG.
import heapq
def run():
g = {
'<expr>': [
['<term>'],
['<term>', '+', '<expr>'],
['<term>', '-', '<expr>'],
],
'<term>': [
['<fact>'],
['<fact>', '*', '<term>'],
['<fact>', '/', '<term>'],
],
'<fact>': [
['1'],
['(', '<expr>', ')']
],
}
gen(g)
def is_terminal(s):
for sym in s:
if sym.startswith('<'):
return False;
return True;
def gen(g, lim = 10000):
q = [(1, ['<expr>'])]
n = 0;
while n < lim:
_, s = heapq.heappop(q)
# print("pop: " + ''.join(s))
a = []
b = s.copy()
while b:
sym = b.pop(0)
if sym.startswith('<'):
for rhs in g[sym]:
s_new = a.copy()
s_new.extend(rhs)
s_new.extend(b)
if is_terminal(s_new):
print(''.join(s_new))
n += 1
else:
# print("push: " + ''.join(s_new))
heapq.heappush(q, (len(s_new), s_new))
break # only generate leftmost derivations
a.append(sym)
run()
Uncomment the extra print()s to see heap activity. Some example output:
1
(1)
1*1
1/1
1+1
1-1
((1))
(1*1)
(1/1)
(1)*1
(1)+1
(1)-1
(1)/1
(1+1)
(1-1)
1*(1)
1*1*1
1*1/1
1+(1)
1+1*1
1+1/1
1+1+1
1+1-1
1-(1)
1-1*1
1-1/1
1-1+1
1-1-1
1/(1)
1/1*1
1/1/1
1*1+1
1*1-1
1/1+1
1/1-1
(((1)))
((1*1))
((1/1))
((1))*1
((1))+1
((1))-1
((1))/1
((1)*1)
((1)+1)
((1)-1)
((1)/1)
((1+1))
((1-1))
(1)*(1)
(1)*1*1
(1)*1/1
(1)+(1)
(1)+1*1

Boyer-Moore Galil Rule

I was implementing the Boyer-Moore Algorithm for substring search in Python when I learned about the Galil Rule. I've looked around online for the Galil Rule but I haven't found anything more than a couple of sentences, and I cannot get access to the original paper. How can I implement this into my current algorithm?
i = 0
while i < (N - M + 1):
skip = 0
for j in reversed(range(0, M)):
if pattern[j] != text[i + j]:
skip = max(1, j - offsets[text[i+j]])
break
if skip == 0:
return i
i += skip
return -1
Notes:
offsets[c] = -1 if c is not in the pattern
offsets[c] = last index of c in the pattern
Example:
aaabcb
offsets[a] = 2
offsets[b] = 5
offsets[c] = 4
offsets[d] = -1
The few sentences I have found have said to keep track of when the first mismatch occurs in my inner loop (j, if the if-statement inside the inner loop is True) and the position in which I started the comparisons (i + j, in my case). I understand the intuition that I've already checked all the indices in between those, so I shouldn't have to do those comparisons again. I just don't understand how to connect the dots and arrive at an implementation.

The Galil rule is about exploiting periodicity in the pattern to reduce comparisons. Say you have a pattern abcabcab. It's periodic with smallest period abc. In general, a pattern P is periodic if there's a string U such that P is a prefix of UUUUU.... (In the above example, abcabcab is clearly a prefix of the repeating string abc = U.) We call the shortest such string the period of P. Let the length of that period be k (in the example above k = 3 since U = abc).
First of all, keep in mind that the Galil rule applies only after you've found an occurrence of P in the text. When you do that, the Galil rule says that you could shift by k (the periodicity of the pattern) and you only have to compare the last k characters of the now shifted pattern to determine if there was a match.
Here's an example:
P = ababa
T = bababababab
U = ab
k = 2
First occurrence: b[ababa]babab. Now you can shift by k = 2 and you only have to check the last two characters of the pattern:
T = bababa[ba]bab
P = aba[ba] // Only need to compare chars inside brackets for next match.
The rest of P must match since P is periodic and you shifted it by its period k from an existing match (this is crucial) so the repeating parts will nicely line up.
If you've found another match, just repeat. If you find a mismatch, however, you revert to the standard Boyer-Moore algorithm until you find another match. Remember, you can only use the Galil rule when you find a match and you shift by k (otherwise the pattern is not guaranteed to line up with the previous occurrence).
Now, you might wonder, how to determine k for a given pattern P. You'll need to calculate the suffixes array N first, where N[i] will be the length of the longest common suffix of the prefix P[0, i] and P. (You can calculate the suffixes array by calculating the prefixes array Z on the reverse of P using the Z algorithm, as described here, for example.) Once you have the suffixes array, you can easily find k since it'll be the smallest k > 0 such that N[m - k - 1] == m - k (where m = |P|).
For example:
P = ababa
m = 5
N = [1, 0, 3, 0, 5]
k = 2 because N[m - k - 1] == N[5 - 2 - 1] == N[2] == 3 == 5 - k

The answer by #Lajos Nagy has explained the idea of Galil rule perfectly, however we have a more straightforward way to calculate k:
Just use the prefix function of KMP algorithm.
The prefix[i] means the longest proper prefix of P[0..i] which is also a suffix.
And, k = m-prefix[m-1] .
This article has explained the details.

Number of Paths in a Triangle

I recently encountered a much more difficult variation of this problem, but realized I couldn't generate a solution for this very simple case. I searched Stack Overflow but couldn't find a resource that previously answered this.
You are given a triangle ABC, and you must compute the number of paths of certain length that start at and end at 'A'. Say our function f(3) is called, it must return the number of paths of length 3 that start and end at A: 2 (ABA,ACA).
I'm having trouble formulating an elegant solution. Right now, I've written a solution that generates all possible paths, but for larger lengths, the program is just too slow. I know there must be a nice dynamic programming solution that reuses sequences that we've previously computed but I can't quite figure it out. All help greatly appreciated.
My dumb code:
def paths(n,sequence):
t = ['A','B','C']
if len(sequence) < n:
for node in set(t) - set(sequence[-1]):
paths(n,sequence+node)
else:
if sequence[0] == 'A' and sequence[-1] == 'A':
print sequence

Let PA(n) be the number of paths from A back to A in exactly n steps.
Let P!A(n) be the number of paths from B (or C) to A in exactly n steps.
Then:
PA(1) = 1
PA(n) = 2 * P!A(n - 1)
P!A(1) = 0
P!A(2) = 1
P!A(n) = P!A(n - 1) + PA(n - 1)
= P!A(n - 1) + 2 * P!A(n - 2) (for n > 2) (substituting for PA(n-1))
We can solve the difference equations for P!A analytically, as we do for Fibonacci, by noting that (-1)^n and 2^n are both solutions of the difference equation, and then finding coefficients a, b such that P!A(n) = a*2^n + b*(-1)^n.
We end up with the equation P!A(n) = 2^n/6 + (-1)^n/3, and PA(n) being 2^(n-1)/3 - 2(-1)^n/3.
This gives us code:
def PA(n):
return (pow(2, n-1) + 2*pow(-1, n-1)) / 3
for n in xrange(1, 30):
print n, PA(n)
Which gives output:
1 1
2 0
3 2
4 2
5 6
6 10
7 22
8 42
9 86
10 170
11 342
12 682
13 1366
14 2730
15 5462
16 10922
17 21846
18 43690
19 87382
20 174762
21 349526
22 699050
23 1398102
24 2796202
25 5592406
26 11184810
27 22369622
28 44739242
29 89478486

The trick is not to try to generate all possible sequences. The number of them increases exponentially so the memory required would be too great.
Instead, let f(n) be the number of sequences of length n beginning and ending A, and let g(n) be the number of sequences of length n beginning with A but ending with B. To get things started, clearly f(1) = 1 and g(1) = 0. For n > 1 we have f(n) = 2g(n - 1), because the penultimate letter will be B or C and there are equal numbers of each. We also have g(n) = f(n - 1) + g(n - 1) because if a sequence ends begins A and ends B the penultimate letter is either A or C.
These rules allows you to compute the numbers really quickly using memoization.

My method is like this:
Define DP(l, end) = # of paths end at end and having length l
Then DP(l,'A') = DP(l-1, 'B') + DP(l-1,'C'), similar for DP(l,'B') and DP(l,'C')
Then for base case i.e. l = 1 I check if the end is not 'A', then I return 0, otherwise return 1, so that all bigger states only counts those starts at 'A'
Answer is simply calling DP(n, 'A') where n is the length
Below is a sample code in C++, you can call it with 3 which gives you 2 as answer; call it with 5 which gives you 6 as answer:
ABCBA, ACBCA, ABABA, ACACA, ABACA, ACABA
#include <bits/stdc++.h>
using namespace std;
int dp[500][500], n;
int DP(int l, int end){
if(l<=0) return 0;
if(l==1){
if(end != 'A') return 0;
return 1;
}
if(dp[l][end] != -1) return dp[l][end];
if(end == 'A') return dp[l][end] = DP(l-1, 'B') + DP(l-1, 'C');
else if(end == 'B') return dp[l][end] = DP(l-1, 'A') + DP(l-1, 'C');
else return dp[l][end] = DP(l-1, 'A') + DP(l-1, 'B');
}
int main() {
memset(dp,-1,sizeof(dp));
scanf("%d", &n);
printf("%d\n", DP(n, 'A'));
return 0;
}
EDITED
To answer OP's comment below:
Firstly, DP(dynamic programming) is always about state.
Remember here our state is DP(l,end), represents the # of paths having length l and ends at end. So to implement states using programming, we usually use array, so DP[500][500] is nothing special but the space to store the states DP(l,end) for all possible l and end (That's why I said if you need a bigger length, change the size of array)
But then you may ask, I understand the first dimension which is for l, 500 means l can be as large as 500, but how about the second dimension? I only need 'A', 'B', 'C', why using 500 then?
Here is another trick (of C/C++), the char type indeed can be used as an int type by default, which value is equal to its ASCII number. And I do not remember the ASCII table of course, but I know that around 300 will be enough to represent all the ASCII characters, including A(65), B(66), C(67)
So I just declare any size large enough to represent 'A','B','C' in the second dimension (that means actually 100 is more than enough, but I just do not think that much and declare 500 as they are almost the same, in terms of order)
so you asked what DP[3][1] means, it means nothing as the I do not need / calculate the second dimension when it is 1. (Or one can think that the state dp(3,1) does not have any physical meaning in our problem)
In fact, I always using 65, 66, 67.
so DP[3][65] means the # of paths of length 3 and ends at char(65) = 'A'

You can do better than the dynamic programming/recursion solution others have posted, for the given triangle and more general graphs. Whenever you are trying to compute the number of walks in a (possibly directed) graph, you can express this in terms of the entries of powers of a transfer matrix. Let M be a matrix whose entry m[i][j] is the number of paths of length 1 from vertex i to vertex j. For a triangle, the transfer matrix is
0 1 1
1 0 1.
1 1 0
Then M^n is a matrix whose i,j entry is the number of paths of length n from vertex i to vertex j. If A corresponds to vertex 1, you want the 1,1 entry of M^n.
Dynamic programming and recursion for the counts of paths of length n in terms of the paths of length n-1 are equivalent to computing M^n with n multiplications, M * M * M * ... * M, which can be fast enough. However, if you want to compute M^100, instead of doing 100 multiplies, you can use repeated squaring: Compute M, M^2, M^4, M^8, M^16, M^32, M^64, and then M^64 * M^32 * M^4. For larger exponents, the number of multiplies is about c log_2(exponent).
Instead of using that a path of length n is made up of a path of length n-1 and then a step of length 1, this uses that a path of length n is made up of a path of length k and then a path of length n-k.

We can solve this with a for loop, although Anonymous described a closed form for it.
function f(n){
var as = 0, abcs = 1;
for (n=n-3; n>0; n--){
as = abcs - as;
abcs *= 2;
}
return 2*(abcs - as);
}
Here's why:
Look at one strand of the decision tree (the other one is symmetrical):
A
B C...
A C
B C A B
A C A B B C A C
B C A B B C A C A C A B B C A B
Num A's Num ABC's (starting with first B on the left)
0 1
1 (1-0) 2
1 (2-1) 4
3 (4-1) 8
5 (8-3) 16
11 (16-5) 32
Cleary, we can't use the strands that end with the A's...

You can write a recursive brute force solution and then memoize it (aka top down dynamic programming). Recursive solutions are more intuitive and easy to come up with. Here is my version:
# search space (we have triangle with nodes)
nodes = ["A", "B", "C"]
#cache # memoize!
def recurse(length, steps):
# if length of the path is n and the last node is "A", then it's
# a valid path and we can count it.
if length == n and ((steps-1)%3 == 0 or (steps+1)%3 == 0):
return 1
# we don't want paths having len > n.
if length > n:
return 0
# from each position, we have two possibilities, either go to next
# node or previous node. Total paths will be sum of both the
# possibilities. We do this recursively.
return recurse(length+1, steps+1) + recurse(length+1, steps-1)

Find number of binary numbers with certain constraints

This is more of a puzzle than a coding problem. I need to find how many binary numbers can be generated satisfying certain constraints. The inputs are
(integer) Len - Number of digits in the binary number
(integer) x
(integer) y
The binary number has to be such that taking any x adjacent digits from the binary number should contain at least y 1's.
For example -
Len = 6, x = 3, y = 2
0 1 1 0 1 1 - Length is 6, Take any 3 adjacent digits from this and
there will be 2 l's
I had this C# coding question posed to me in an interview and I cannot figure out any algorithm to solve this. Not looking for code (although it's welcome), any sort of help, pointers are appreciated

This problem can be solved using dynamic programming. The main idea is to group the binary numbers according to the last x-1 bits and the length of each binary number. If appending a bit sequence to one number yields a number satisfying the constraint, then appending the same bit sequence to any number in the same group results in a number satisfying the constraint also.
For example, x = 4, y = 2. both of 01011 and 10011 have the same last 3 bits (011). Appending a 0 to each of them, resulting 010110 and 100110, both satisfy the constraint.
Here is pseudo code:
mask = (1<<(x-1)) - 1
count[0][0] = 1
for(i = 0; i < Len-1; ++i) {
for(j = 0; j < 1<<i && j < 1<<(x-1); ++j) {
if(i<x-1 || count1Bit(j*2+1)>=y)
count[i+1][(j*2+1)&mask] += count[i][j];
if(i<x-1 || count1Bit(j*2)>=y)
count[i+1][(j*2)&mask] += count[i][j];
}
}
answer = 0
for(j = 0; j < 1<<i && j < 1<<(x-1); ++j)
answer += count[Len][j];
This algorithm assumes that Len >= x. The time complexity is O(Len*2^x).
EDIT
The count1Bit(j) function counts the number of 1 in the binary representation of j.
The only input to this algorithm are Len, x, and y. It starts from an empty binary string [length 0, group 0], and iteratively tries to append 0 and 1 until length equals to Len. It also does the grouping and counting the number of binary strings satisfying the 1-bits constraint in each group. The output of this algorithm is answer, which is the number of binary strings (numbers) satisfying the constraints.
For a binary string in group [length i, group j], appending 0 to it results in a binary string in group [length i+1, group (j*2)%(2^(x-1))]; appending 1 to it results in a binary string in group [length i+1, group (j*2+1)%(2^(x-1))].
Let count[i,j] be the number of binary strings in group [length i, group j] satisfying the 1-bits constraint. If there are at least y 1 in the binary representation of j*2, then appending 0 to each of these count[i,j] binary strings yields a binary string in group [length i+1, group (j*2)%(2^(x-1))] which also satisfies the 1-bit constraint. Therefore, we can add count[i,j] into count[i+1,(j*2)%(2^(x-1))]. The case of appending 1 is similar.
The condition i<x-1 in the above algorithm is to keep the binary strings growing when length is less than x-1.

Using the example of LEN = 6, X = 3 and Y = 2...
Build an exhaustive bit pattern generator for X bits. A simple binary counter can do this. For example, if X = 3
then a counter from 0 to 7 will generate all possible bit patterns of length 3.
The patterns are:
000
001
010
011
100
101
110
111
Verify the adjacency requirement as the patterns are built. Reject any patterns that do not qualify.
Basically this boils down to rejecting any pattern containing fewer than 2 '1' bits (Y = 2). The list prunes down to:
011
101
110
111
For each member of the pruned list, add a '1' bit and retest the first X bits. Keep the new pattern if it passes the
adjacency test. Do the same with a '0' bit. For example this step proceeds as:
1011 <== Keep
1101 <== Keep
1110 <== Keep
1111 <== Keep
0011 <== Reject
0101 <== Reject
0110 <== Keep
0111 <== Keep
Which leaves:
1011
1101
1110
1111
0110
0111
Now repeat this process until the pruned set is empty or the member lengths become LEN bits long. In the end
the only patterns left are:
111011
111101
111110
111111
110110
110111
101101
101110
101111
011011
011101
011110
011111
Count them up and you are done.
Note that you only need to test the first X bits on each iteration because all the subsequent patterns were verified in prior steps.

Considering that input values are variable and wanted to see the actual output, I used recursive algorithm to determine all combinations of 0 and 1 for a given length :
private static void BinaryNumberWithOnes(int n, int dump, int ones, string s = "")
{
if (n == 0)
{
if (BinaryWithoutDumpCountContainsnumberOfOnes(s, dump,ones))
Console.WriteLine(s);
return;
}
BinaryNumberWithOnes(n - 1, dump, ones, s + "0");
BinaryNumberWithOnes(n - 1, dump, ones, s + "1");
}
and BinaryWithoutDumpCountContainsnumberOfOnes to determine if the binary number meets the criteria
private static bool BinaryWithoutDumpCountContainsnumberOfOnes(string binaryNumber, int dump, int ones)
{
int current = 0;
int count = binaryNumber.Length;
while(current +dump < count)
{
var fail = binaryNumber.Remove(current, dump).Replace("0", "").Length < ones;
if (fail)
{
return false;
}
current++;
}
return true;
}
Calling BinaryNumberWithOnes(6, 3, 2) will output all binary numbers that match
010011
011011
011111
100011
100101
100111
101011
101101
101111
110011
110101
110110
110111
111011
111101
111110
111111

Sounds like a nested for loop would do the trick. Pseudocode (not tested).
value = '0101010111110101010111' // change this line to format you would need
for (i = 0; i < (Len-x); i++) { // loop over value from left to right
kount = 0
for (j = i; j < (i+x); j++) { // count '1' bits in the next 'x' bits
kount += value[j] // add 0 or 1
if kount >= y then return success
}
}
return fail

The naive approach would be a tree-recursive algorithm.
Our recursive method would slowly build the number up, e.g. it would start at xxxxxx, return the sum of a call with 1xxxxx and 0xxxxx, which themselves will return the sum of a call with 10, 11 and 00, 01, etc. except if the x/y conditions are NOT satisfied for the string it would build by calling itself it does NOT go down that path, and if you are at a terminal condition (built a number of the correct length) you return 1. (note that since we're building the string up from left to right, you don't have to check x/y for the entire string, just also considering the newly added digit!)
By returning a sum over all calls then all of the returned 1s will pool together and be returned by the initial call, equalling the number of constructed strings.
No idea what the big O notation for time complexity is for this one, it could be as bad as O(2^n)*O(checking x/y conditions) but it will prune lots of branches off the tree in most cases.
UPDATE: One insight I had is that all branches of the recursive tree can be 'merged' if they have identical last x digits so far, because then the same checks would be applied to all digits hereafter so you may as well double them up and save a lot of work. This now requires building the tree explicitly instead of implicitly via recursive calls, and maybe some kind of hashing scheme to detect when branches have identical x endings, but for large length it would provide a huge speedup.

My approach is to start by getting the all binary numbers with the minimum number of 1's, which is easy enough, you just get every unique permutation of a binary number of length x with y 1's, and cycle each unique permutation "Len" times. By flipping the 0 bits of these seeds in every combination possible, we are guaranteed to iterate over all of the binary numbers that fit the criteria.
from itertools import permutations, cycle, combinations
def uniq(x):
d = {}
for i in x:
d[i]=1
return d.keys()
def findn( l, x, y ):
window = []
for i in xrange(y):
window.append(1)
for i in xrange(x-y):
window.append(0)
perms = uniq(permutations(window))
seeds=[]
for p in perms:
pr = cycle(p)
seeds.append([ pr.next() for i in xrange(l) ]) ###a seed is a binary number fitting the criteria with minimum 1 bits
bin_numbers=[]
for seed in seeds:
if seed in bin_numbers: continue
indexes = [ i for i, x in enumerate(seed) if x == 0] ### get indexes of 0 "bits"
exit = False
for i in xrange(len(indexes)+1):
if( exit ): break
for combo in combinations(indexes, i): ### combinatorically flipping the zero bits in the seed
new_num = seed[:]
for index in combo: new_num[index]+=1
if new_num in bin_numbers:
### if our new binary number has been seen before
### we can break out since we are doing a depth first traversal
exit=True
break
else:
bin_numbers.append(new_num)
print len(bin_numbers)
findn(6,3,2)
Growth of this approach is definitely exponential, but I thought I'd share my approach in case it helps someone else get to a lower complexity solution...

Set some condition and introduce simple help variable.
L = 6, x = 3 , y = 2 introduce d = x - y = 1
Condition: if the list of the next number hypotetical value and the previous x - 1 elements values has a number of 0-digits > d next number concrete value must be 1, otherwise add two brances with both 1 and 0 as concrete value.
Start: check(Condition) => both 0,1 due to number of total zeros in the 0-count check.
Empty => add 0 and 1
Step 1:Check(Condition)
0 (number of next value if 0 and previous x - 1 zeros > d(=1)) -> add 1 to sequence
1 -> add both 0,1 in two different branches
Step 2: check(Condition)
01 -> add 1
10 -> add 1
11 -> add 0,1 in two different branches
Step 3:
011 -> add 0,1 in two branches
101 -> add 1 (the next value if 0 and prev x-1 seq would be 010, so we prune and set only 1)
110 -> add 1
111 -> add 0,1
Step 4:
0110 -> obviously 1
0111 -> both 0,1
1011 -> both 0,1
1101 -> 1
1110 -> 1
1111 -> 0,1
Step 5:
01101 -> 1
01110 -> 1
01111 -> 0,1
10110 -> 1
10111 -> 0,1
11011 -> 0,1
11101 -> 1
11110 -> 1
11111 -> 0,1
Step 6 (Finish):
011011
011101
011110
011111
101101
101110
101111
110110
110111
111011
111101
111110
111111
Now count. I've tested for L = 6, x = 4 and y = 2 too, but consider to check the algorithm for special cases and extended cases.
Note: I'm pretty sure some algorithm with Disposition Theory bases should be a really massive improvement of my algorithm.

So in a series of Len binary digits, you are looking for a x-long segment that contains y 1's ..
See the execution: http://ideone.com/xuaWaK
Here's my Algorithm in Java:
import java.util.*;
import java.lang.*;
class Main
{
public static ArrayList<String> solve (String input, int x, int y)
{
int s = 0;
ArrayList<String> matches = new ArrayList<String>();
String segment = null;
for (int i=0; i<(input.length()-x); i++)
{
s = 0;
segment = input.substring(i,(i+x));
System.out.print(" i: "+i+" ");
for (char c : segment.toCharArray())
{
System.out.print("*");
if (c == '1')
{
s = s + 1;
}
}
if (s == y)
{
matches.add(segment);
}
System.out.println();
}
return matches;
}
public static void main (String [] args)
{
String input = "011010101001101110110110101010111011010101000110010";
int x = 6;
int y = 4;
ArrayList<String> matches = null;
matches = solve (input, x, y);
for (String match : matches)
{
System.out.println(" > "+match);
}
System.out.println(" Number of matches is " + matches.size());
}
}

The number of patterns of length X that contain at least Y 1 bits is countable. For the case x == y we know there is exactly one pattern of the 2^x possible patterns that meets the criteria. For smaller y we need to sum up the number of patterns which have excess 1 bits and the number of patterns that have exactly y bits.
choose(n, k) = n! / k! (n - k)!
numPatterns(x, y) {
total = 0
for (int j = x; j >= y; j--)
total += choose(x, j)
return total
}
For example :
X = 4, Y = 4 : 1 pattern
X = 4, Y = 3 : 1 + 4 = 5 patterns
X = 4, Y = 2 : 1 + 4 + 6 = 11 patterns
X = 4, Y = 1 : 1 + 4 + 6 + 4 = 15 patterns
X = 4, Y = 0 : 1 + 4 + 6 + 4 + 1 = 16
(all possible patterns have at least 0 1 bits)
So let M be the number of X length patterns that meet the Y criteria. Now, that X length pattern is a subset of N bits. There are (N - x + 1) "window" positions for the sub pattern, and 2^N total patterns possible. If we start with any of our M patterns, we know that appending a 1 to the right and shifting to the next window will result in one of our known M patterns. The question is, how many of the M patterns can we add a 0 to, shift right, and still have a valid pattern in M?
Since we are adding a zero, we have to be either shifting away from a zero, or we have to already be in an M where we have an excess of 1 bits. To flip that around, we can ask how many of the M patterns have exactly Y bits and start with a 1. Which is the same as "how many patterns of length X-1 have Y-1 bits", which we know how to answer:
shiftablePatternCount = M - choose(X-1, Y-1)
So starting with M possibilities, we are going to increase by shiftablePatternCount when we slide to the right. All patterns in the new window are in the set of M, with some patterns now duplicated. We are going to shift a number of times to fill up N by (N - X), each time increasing the count by shiftablePatternCount, so the full answer should be :
totalCountOfMatchingPatterns = M + (N - X)*shiftablePatternCount
edit - realized a mistake. I need to count the duplicates of the shiftable patterns that are generated. I think that's doable. (draft still)

I am not sure about my answer but here is my view.just take a look at it,
Len=4,
x=3,
y=2.
i just took out two patterns,cause pattern must contain at least y's 1.
X 1 1 X
1 X 1 X
X - represent don't care
now count for 1st expression is 2 1 1 2 =4
and for 2nd expression 1 2 1 2 =4
but 2 pattern is common between both so minus 2..so there will be total 6 pair which satisfy the condition.

I happen to be using a algoritem similar to your problem, trying to find a way to improve it, I found your question. So I will share
static int GetCount(int length, int oneBits){
int result = 0;
double count = Math.Pow(2, length);
for (int i = 1; i <= count - 1; i++)
{
string str = Convert.ToString(i, 2).PadLeft(length, '0');
if (str.ToCharArray().Count(c => c == '1') == oneBits)
{
result++;
}
}
return result;
}
not very efficent I think, but elegent solution.

Find the minimum number of operations required to compute a number using a specified range of numbers

Let me start with an example -
I have a range of numbers from 1 to 9. And let's say the target number that I want is 29.
In this case the minimum number of operations that are required would be (9*3)+2 = 2 operations. Similarly for 18 the minimum number of operations is 1 (9*2=18).
I can use any of the 4 arithmetic operators - +, -, / and *.
How can I programmatically find out the minimum number of operations required?
Thanks in advance for any help provided.
clarification: integers only, no decimals allowed mid-calculation. i.e. the following is not valid (from comments below): ((9/2) + 1) * 4 == 22
I must admit I didn't think about this thoroughly, but for my purpose it doesn't matter if decimal numbers appear mid-calculation. ((9/2) + 1) * 4 == 22 is valid. Sorry for the confusion.

For the special case where set Y = [1..9] and n > 0:
n <= 9 : 0 operations
n <=18 : 1 operation (+)
otherwise : Remove any divisor found in Y. If this is not enough, do a recursion on the remainder for all offsets -9 .. +9. Offset 0 can be skipped as it has already been tried.
Notice how division is not needed in this case. For other Y this does not hold.
This algorithm is exponential in log(n). The exact analysis is a job for somebody with more knowledge about algebra than I.
For more speed, add pruning to eliminate some of the search for larger numbers.
Sample code:
def findop(n, maxlen=9999):
# Return a short postfix list of numbers and operations
# Simple solution to small numbers
if n<=9: return [n]
if n<=18: return [9,n-9,'+']
# Find direct multiply
x = divlist(n)
if len(x) > 1:
mults = len(x)-1
x[-1:] = findop(x[-1], maxlen-2*mults)
x.extend(['*'] * mults)
return x
shortest = 0
for o in range(1,10) + range(-1,-10,-1):
x = divlist(n-o)
if len(x) == 1: continue
mults = len(x)-1
# We spent len(divlist) + mults + 2 fields for offset.
# The last number is expanded by the recursion, so it doesn't count.
recursion_maxlen = maxlen - len(x) - mults - 2 + 1
if recursion_maxlen < 1: continue
x[-1:] = findop(x[-1], recursion_maxlen)
x.extend(['*'] * mults)
if o > 0:
x.extend([o, '+'])
else:
x.extend([-o, '-'])
if shortest == 0 or len(x) < shortest:
shortest = len(x)
maxlen = shortest - 1
solution = x[:]
if shortest == 0:
# Fake solution, it will be discarded
return '#' * (maxlen+1)
return solution
def divlist(n):
l = []
for d in range(9,1,-1):
while n%d == 0:
l.append(d)
n = n/d
if n>1: l.append(n)
return l

The basic idea is to test all possibilities with k operations, for k starting from 0. Imagine you create a tree of height k that branches for every possible new operation with operand (4*9 branches per level). You need to traverse and evaluate the leaves of the tree for each k before moving to the next k.
I didn't test this pseudo-code:
for every k from 0 to infinity
for every n from 1 to 9
if compute(n,0,k):
return k
boolean compute(n,j,k):
if (j == k):
return (n == target)
else:
for each operator in {+,-,*,/}:
for every i from 1 to 9:
if compute((n operator i),j+1,k):
return true
return false
It doesn't take into account arithmetic operators precedence and braces, that would require some rework.

Really cool question :)
Notice that you can start from the end! From your example (9*3)+2 = 29 is equivalent to saying (29-2)/3=9. That way we can avoid the double loop in cyborg's answer. This suggests the following algorithm for set Y and result r:
nextleaves = {r}
nops = 0
while(true):
nops = nops+1
leaves = nextleaves
nextleaves = {}
for leaf in leaves:
for y in Y:
if (leaf+y) or (leaf-y) or (leaf*y) or (leaf/y) is in X:
return(nops)
else:
add (leaf+y) and (leaf-y) and (leaf*y) and (leaf/y) to nextleaves
This is the basic idea, performance can be certainly be improved, for instance by avoiding "backtracks", such as r+a-a or r*a*b/a.

I guess my idea is similar to the one of Peer Sommerlund:
For big numbers, you advance fast, by multiplication with big ciphers.
Is Y=29 prime? If not, divide it by the maximum divider of (2 to 9).
Else you could subtract a number, to reach a dividable number. 27 is fine, since it is dividable by 9, so
(29-2)/9=3 =>
3*9+2 = 29
So maybe - I didn't think about this to the end: Search the next divisible by 9 number below Y. If you don't reach a number which is a digit, repeat.
The formula is the steps reversed.
(I'll try it for some numbers. :) )
I tried with 2551, which is
echo $((((3*9+4)*9+4)*9+4))
But I didn't test every intermediate result whether it is prime.
But
echo $((8*8*8*5-9))
is 2 operations less. Maybe I can investigate this later.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Understanding the algorithm for pattern matching using an LCP array - algorithm

Related

Generating correct phrases from PEG grammars

Boyer-Moore Galil Rule

Number of Paths in a Triangle

Find number of binary numbers with certain constraints

Find the minimum number of operations required to compute a number using a specified range of numbers

Categories

Resources