Mapping integers to strings in a given string space - algorithm

Suppose I have an alphabet of 'abcd' and a maximum string length of 3. This gives me 85 possible strings, including the empty string. What I would like to do is map an integer in the range [0,85) to a string in my string space without using a lookup table. Something like this:
0 => ''
1 => 'a'
...
4 => 'd'
5 => 'aa'
6 => 'ab'
...
84 => 'ddd'
This is simple enough to do if the string is fixed length using this pseudocode algorithm:
str = ''
for i in 0..maxLen do
str += alphabet[i % alphabet.length]
i /= alphabet.length
done
I can't figure out a good, efficient way of doing it though when the length of the string could be anywhere in the range [0,3). This is going to be running in a tight loop with random inputs so I would like to avoid any unnecessary branching or lookups.

Shift your index by one and ignore the empty string temporarily. So you'd map 0 -> "a", ..., 83 -> "ddd".
Then the mapping is
n -> base-4-encode(n - number of shorter strings)
With 26 symbols, that's the Excel-column-numbering scheme.
With s symbols, there are s + s^2 + ... + s^l nonempty strings of length at most l. Leaving aside the trivial case s = 1, that sum is (a partial sum of a geometric series) s*(s^l - 1)/(s-1).
So, given n, find the largest l such that s*(s^l - 1)/(s-1) <= n, i.e.
l = floor(log((s-1)*n/s + 1) / log(s))
Then let m = n - s*(s^l - 1)/(s-1) and encode m as an l+1-symbol string in base s ('a' ~> 0, 'b' ~> 1, ...).
For the problem including the empty string, map 0 to the empty string and for n > 0 encode n-1 as above.

In Haskell
encode cs n = reverse $ encode' n where
len = length cs
encode' 0 = ""
encode' n = (cs !! ((n-1) `mod` len)) : encode' ((n-1) `div` len)
Check:
*Main> map (encode "abcd") [0..84] ["","a","b","c","d","aa","ab","ac","ad","ba","bb","bc","bd","ca","cb","cc","cd","da","db","dc","dd","aaa","aab","aac","aad","aba","abb","abc","abd","aca","acb","acc","acd","ada","adb","adc","add","baa","bab","bac","bad","bba","bbb","bbc","bbd","bca","bcb","bcc","bcd","bda","bdb","bdc","bdd","caa","cab","cac","cad","cba","cbb","cbc","cbd","cca","ccb","ccc","ccd","cda","cdb","cdc","cdd","daa","dab","dac","dad","dba","dbb","dbc","dbd","dca","dcb","dcc","dcd","dda","ddb","ddc","ddd"]

Figure out the number of strings for each length: N0, N1, N2 & N3 (actually, you won't need N3). Then, use those values to partition your space of integers: 0..N0-1 are length 0, N0..N0+N1-1 are length 1, etc. Within each partition, you can use your fixed-length algorithm.
At worst, you've greatly reduced the size of your lookup table.

Here is a C# solution:
static string F(int x, int alphabetSize)
{
string ret = "";
while (x > 0)
{
x--;
ret = (char)('a' + (x % alphabetSize)) + ret;
x /= alphabetSize;
}
return ret;
}
If you want to optimize this further, you may want to do something to avoid the string concatenations. For example, you could store the result into a preallocated char[] array.

Related

Counting the number of ways to make up a string

I have just started learning dynamic programming and was able to do some of the basic problems, such as fibbonaci, the knapsack and a few more problems. Coming across the problem
below, I got stuck and do not know how to proceed forward. What confuses me is what would be the base case in this case, and the overlapping problems. Not knowing
this prevents me from developing a relation. They are not as apparent in this example as they were in the previous ones I have solved thus far.
Suppose we are given some string origString, a string toMatch and some number maxNum greater than or equal to 0. How can we count in how many ways it is possible to take maxNum number of nonempty and nonoverlapping substrings of the string origString to make up the string toMatch?
Example:
If origString = "ppkpke", and toMatch = "ppke"
maxNum = 1: countWays("ppkpke", "ppke", 1) will give 0 because toMatch is not a substring of origString.
maxNum = 2: countWays("ppkpke", "ppke", 2) will give 4 because 4 different combinations of 2 substring made up of "ppkpke" can make "ppke".
Those strings are "ppk" & "e", "pp" & "ke" , "p" & "pke" (excluding "p") and "p" & "pke" (excluding "k")
As an initial word of caution, I’d say that although my solution happens to match the expected output for the tiny test set, it is very likely wrong. It’s up to you to double-check it on other examples you may have etc.
The algorithm walks the longer string and tries to spread the shorter string over it. The incremental state of the algorithm consists of tuples of 3 elements:
long string coordinate i (origString[i] == toMatch[j])
short string coordinate j (origString[i] == toMatch[j])
number of ways we made it into that^^^ state
Then we just walk along the strings over and over again, using stored, previously discovered state, and sum up the total number(s) of ways each state was achieved — in the typical dynamic programming fashion.
For a state to count as a solution, j must be at the end of the short string and the number of iterations of the dynamic algorithm must be equivalent to the number of substrings we wanted at that point (because each iteration added one substring).
It is not entirely clear to me from the assignment whether maxNum actually means something like “exactNum”, i.e. exactly that many substrings, or whether we should sum across all lower or equal numbers of substrings. So the function returns a dictionary like { #substrings : #decompositions }, so that the output can be adjusted as needed.
#!/usr/bin/env python
def countWays(origString, toMatch, maxNum):
origLen = len(origString)
matchLen = len(toMatch)
state = {}
for i in range(origLen):
for j in range(matchLen):
o = i + j
if origString[o] != toMatch[j]:
break
state[(o, j)] = 1
sums = {}
for n in range(1, maxNum):
if not state:
break
nextState = {}
for istart, jstart in state:
prev = state[(istart, jstart)]
for i in range(istart + 1, origLen):
for j in range(jstart + 1, matchLen):
o = i + j - jstart - 1
if origString[o] != toMatch[j]:
break
nextState[(o, j)] = prev + nextState.get((o, j), 0)
sums[n] = sum(state[(i, j)] for i, j in state if j == matchLen - 1)
state = nextState
sums[maxNum] = sum(state[(i, j)] for i, j in state if j == matchLen - 1)
return sums
result = countWays(origString='ppkpke', toMatch='ppke', maxNum=5)
print('for an exact number of substrings:', result)
print(' for up to a number of substrings:', {
n: s for n, s in ((m, sum(result[k] for k in range(1, m + 1)))
for m in range(1, 1 + max(result.keys())))})
This^^^ code is a quick and ugly hack and nothing more. There is a huge room for improvement, including (but not limited to) the use of generator functions (yield), the use of #memoize etc. Here’s some output:
for an exact number of substrings: {1: 0, 2: 4, 3: 8, 4: 4, 5: 0}
for up to a number of substrings: {1: 0, 2: 4, 3: 12, 4: 16, 5: 16}
It would be an interesting (and nicely challenging) exercise to store a bit more of the dynamic state (e.g. to keep it for each n) and then reconstruct and pretty-print (efficiently) the exact string (de)compositions that were counted.
Here is a recursive solution.
Compares the first character of source and target, and if they're equal, choose to either take it (advancing by 1 char in both strings) or not take it (advancing by 1 char in source but not in target). The value of k is decremented everytime a new substring is created; there is an additional variable continued which is True if we're in the middle of building a substring, and False otherwise.
def countWays(source, target, k, continued=False):
if len(target) == 0:
return (k == 0)
elif (k == 0 and not continued) or len(source) == 0:
return 0
elif source[0] == target[0]:
if continued:
return countWays(source[1:], target[1:], k, True) + countWays(source[1:], target[1:], k-1, True) + countWays(source[1:], target, k, False)
else:
return countWays(source[1:], target[1:], k-1, True) + countWays(source[1:], target, k, False)
else:
return countWays(source[1:], target, k, False)
print(countWays('ppkpke', 'ppke', 1))
# 0
print(countWays('ppkpke', 'ppke', 2))
# 4
print(countWays('ppkpke', 'ppke', 3))
# 8
print(countWays('ppkpke', 'ppke', 4))
# 4
print(countWays('ppkpke', 'ppke', 5))
# 0

Number of Ways To arrange Sequence

I am having a M character, from these character i need to make a sequence of length N such that no two consecutive character are same and also first and last character of the sequence is fix. So i need to find the total number of ways.
My Approach:
Dynamic programming.
If first and last character are '0' and '1'
dp[1][0]=1 , dp[1][1]=1
for(int i=2;i<N;i++)
for(int j=0;j<M;j++)
for(int k=0;k<M;k++)
if(j!=k) dp[i][j]+=dp[i-1][k]
So final answer would summation dp[n-1][i] , i!=1
Problem:
Here length N is too large around 10^15 and M is around 128, how find the number of permutation without using arrays ?
Assume M is fixed. Let D(n) be the number of sequences of length n with no repeated characters where the first and last character differ (but are fixed). Let S(n) be the number of sequences of length n where the first and last characters are the same (but are fixed).
For example, D(6) is the number of strings of the form a????b (for some a and b -- noting that for counting it doesn't matter which two characters we chose, and where the ? represent other characters). Similarly, S(6) is the number of strings of the form a????a.
Consider a sequence of length n>3 of the form a....?b. The ? can be any of m-1 characters (anything except b). One of these is a. So D(n) = S(n-1) + (m-2)D(n-1). Using a similar argument, one can figure out that S(n) = (M-1)D(n-1).
For example, how many strings are there of the form a??b? Well, the character just before the b could be a or something else. How many strings are there when it's a? Well, it's the same as the number of strings of the form a?a. How many strings are there when it's something else? Well it's the same as the number of strings of the form a?c multiplied by the number of choices we had for c (namely: m-2 -- everything except for a which we've already counted, and b which is excluded by the rules).
If n is odd, we can consider the middle character. Consider a sequence of length n of the form a...?...b. The ? (which is in the center of the string) can be a, b, or one of the other M-2 characters. Thus D(2n+1) = S(n+1)D(n+1) + D(n+1)S(n+1) + (M-2)D(n+1)D(n+1). Similarly, S(2n+1) = S(n+1)S(n+1) + (M-1)D(n+1)D(n+1).
For small n, S(2)=0, S(3)=M-1, D(2)=1, D(3)=M-2.
We can use the above equations (the first set for even n>3, the second set for odd n>3, and the base cases for n=2 or 3 to compute the result you need in O(log N) arithmetic operations. Presumably the question asks you to compute the result modulo something (since the result grows like O(M^(N-2)), but that's easy to incorporate into the results.
Working code that uses this approach:
def C(n, m, p):
if n == 2:
return 0, 1
if n == 3:
return (m-1)%p, (m-2)%p
if n % 2 == 0:
S, D = C(n-1, m, p)
return ((m-1) * D)%p, (S + (m-2) * D)%p
else:
S, D = C((n-1)//2+1, m, p)
return (S*S + (m-1)*D*D)%p, (2*S*D + (m-2)*D*D)%p
Note that in this code, C(n, m, p) returns two numbers -- S(n)%p and D(n)%p.
For example:
>>> p = 2**64 - 59 # Some large prime
>>> print(C(4, 128, p))
>>> print(C(5, 128, p))
>>> print(C(10**15, 128, p))
(16002, 16003)
(2032381, 2032380)
(12557489471374801501, 12557489471374801502)
Looking at these examples, it seems like D(n) = S(n) + (-1)^n. If that's true, the code can be simplified a bit I guess.
Another, perhaps easier, way to do it efficiently is to use a matrix and the first set of equations. (Sorry for the ascii art -- this diagram is a vector = matrix * vector):
(D(n)) = (M-2 1) * (D(n-1))
(S(n)) = (M-1 0) (S(n-1))
Telescoping this, and using that D(2)=1, S(2)=0:
(D(n)) = (M-2 1)^(n-2) (1)
(S(n)) = (M-1 0) (0)
You can perform the matrix power using exponentiation by squaring in O(log n) time.
Here's working code, including the examples (which you can check produce the same values as the code above). Most of the code is actually matrix multiply and matrix power -- you can probably replace a lot of it with numpy code if you use that package.
def mat_mul(M, N, p):
R = [[0, 0], [0, 0]]
for i in range(2):
for j in range(2):
for k in range(2):
R[i][j] += M[i][k] * N[k][j]
R[i][j] %= p
return R
def mat_pow(M, n, p):
if n == 0:
return [[1, 0], [0, 1]]
if n == 1:
return M
if n % 2 == 0:
R = mat_pow(M, n//2, p)
return mat_mul(R, R, p)
return mat_mul(M, mat_pow(M, n-1, p), p)
def Cmat(n, m, p):
M = [((m-2), 1), (m-1, 0)]
M = mat_pow(M, n-2, p)
return M[1][0], M[0][0]
p = 2**64 - 59
print(Cmat(4, 128, p))
print(Cmat(5, 128, p))
print(Cmat(10**15, 128, p))
You only need to count the number of acceptable sequences, not find them explicitly. It turns out that it doesn't matter what the majority of the characters are. There are only 4 kinds of characters that matter:
The first character
The last character
The last-used character, so you don't repeat characters consecutively
All other characters
In other words, you don't need to iterate over all 10^15 characters. You only need to consider the four cases above, since most characters can be lumped together into the last case.

How to use KMP failure function to determine minimum length repeated substring?

I want to solve UVA 10298 -"Power Strings" problem using KMP algorithm. In this blog a technique is shown how failure function can be used to calculate minimum length repeated substring. The technique is as follows:
Compute prefix-suffix table pi[ ] for the given string.
Let len be the string length, and last_in_pi be the value stored at the last index of pi table.
Check whether len % (len - last_in_pi) == 0 is true or not. If it is true then the length of the minimum length repeated substring is (len - last_in_pi), otherwise it is the length of the given string.
I understand what is failure function and how it is used to find pattern in a text but I am struggling to understand proof of correctness of this technique.
Remember that Pi[i] is defined as the (length of the) longest prefix of your_string that is a proper suffix (so not the whole string) of the substring your_string[0 ... i].
There is an example on the blog post you linked to:
0 1 2 3 4 5
S : a b a b a b
Pi: 0 0 1 2 3 4
Where we have:
a b a
a b a b
Etc. I hope this makes it clear what Pi (the prefix function / table) does.
Now, the blog says:
The last value of prefix table = 4..
Now If it is a repeated string than , It’s minimal length would be 2. (6(string length) – 4) , Now
So you have to check if len % (len - last_in_pi) == 0. If yes, then len - last_in_pi is the length of the shortest repeated string (the period string).
This works because, if you rotate a string with len(period) positions either way, it will match itself. len - last_in_pi tells you how much you'd need to rotate.
Problem
S (of length Ls) is the given string. M (of length Lm) is the largest proper suffix of S, which is also a prefix of S. We have to prove Ls - Lm is the length of the shortest period of S.
Proof by Contradiction
Let's say there were a period Y whose length Ly < Ls - Lm (i.e, it's shorter than the one the above technique gives).
An important property to note is that M is a proper prefix of Y or vice-versa depending on their lengths. We can denote this as M = n*Y + Z, where n >= 0 and Z is the additional part and Lz < Ly. Z forms a prefix to Y, since Y repeats itself. Let Y = Z + W.
Consider M the suffix. Append the previous Ly number of characters from the original string S to it. This won't exceed the string length because (Ly < Ls - Lm). The new suffix is (n + 1)*Y + Z.
Consider M the prefix. Now append the next Ly number of characters from the original string S to it. The new prefix here is
M + (next Ly characters from S)
- > n*Y + Z + (Ly characters)
- > n*Y + Z + (Ly - Lz characters) + (Lz characters)
- > n*Y + (Z + W) + (Z)
{The `Ly - Lz` characters should be `W` because `Z` and these together form `Y`; The last Lz characters are actually the the first Lz characters of Y which is nothing but Z}
- > (n + 1)*Y + Z
Now we have a proper suffix of S which is also a prefix and is greater than M. But we started off saying M is the longest proper suffix which is also a prefix. So it's a contradiction, implying such a Y can not exist.
Assume you have a string s of size n, which looks like s = x1x2x3...x[n-2]x[n-1]x[n]
Assume s has a maximum common prefix/suffix of length len
Then it's period is p = (n - len), iff n % p == 0
Induction:
Denote prefix = s[1...len], postfix = s[p+1...n]
Then we have prefix[1...p] == postfix[1...p] == s[p+1...2p]
Since s[p+1...2p] == prefix[p+1...2p] so postfix[1...p] == postfix[p+1...2p]
Recursively postfix[p+1...2p] == s[2p+1...3p] == prefix[2p+1...3p]
...

Find number of binary numbers with certain constraints

This is more of a puzzle than a coding problem. I need to find how many binary numbers can be generated satisfying certain constraints. The inputs are
(integer) Len - Number of digits in the binary number
(integer) x
(integer) y
The binary number has to be such that taking any x adjacent digits from the binary number should contain at least y 1's.
For example -
Len = 6, x = 3, y = 2
0 1 1 0 1 1 - Length is 6, Take any 3 adjacent digits from this and
there will be 2 l's
I had this C# coding question posed to me in an interview and I cannot figure out any algorithm to solve this. Not looking for code (although it's welcome), any sort of help, pointers are appreciated
This problem can be solved using dynamic programming. The main idea is to group the binary numbers according to the last x-1 bits and the length of each binary number. If appending a bit sequence to one number yields a number satisfying the constraint, then appending the same bit sequence to any number in the same group results in a number satisfying the constraint also.
For example, x = 4, y = 2. both of 01011 and 10011 have the same last 3 bits (011). Appending a 0 to each of them, resulting 010110 and 100110, both satisfy the constraint.
Here is pseudo code:
mask = (1<<(x-1)) - 1
count[0][0] = 1
for(i = 0; i < Len-1; ++i) {
for(j = 0; j < 1<<i && j < 1<<(x-1); ++j) {
if(i<x-1 || count1Bit(j*2+1)>=y)
count[i+1][(j*2+1)&mask] += count[i][j];
if(i<x-1 || count1Bit(j*2)>=y)
count[i+1][(j*2)&mask] += count[i][j];
}
}
answer = 0
for(j = 0; j < 1<<i && j < 1<<(x-1); ++j)
answer += count[Len][j];
This algorithm assumes that Len >= x. The time complexity is O(Len*2^x).
EDIT
The count1Bit(j) function counts the number of 1 in the binary representation of j.
The only input to this algorithm are Len, x, and y. It starts from an empty binary string [length 0, group 0], and iteratively tries to append 0 and 1 until length equals to Len. It also does the grouping and counting the number of binary strings satisfying the 1-bits constraint in each group. The output of this algorithm is answer, which is the number of binary strings (numbers) satisfying the constraints.
For a binary string in group [length i, group j], appending 0 to it results in a binary string in group [length i+1, group (j*2)%(2^(x-1))]; appending 1 to it results in a binary string in group [length i+1, group (j*2+1)%(2^(x-1))].
Let count[i,j] be the number of binary strings in group [length i, group j] satisfying the 1-bits constraint. If there are at least y 1 in the binary representation of j*2, then appending 0 to each of these count[i,j] binary strings yields a binary string in group [length i+1, group (j*2)%(2^(x-1))] which also satisfies the 1-bit constraint. Therefore, we can add count[i,j] into count[i+1,(j*2)%(2^(x-1))]. The case of appending 1 is similar.
The condition i<x-1 in the above algorithm is to keep the binary strings growing when length is less than x-1.
Using the example of LEN = 6, X = 3 and Y = 2...
Build an exhaustive bit pattern generator for X bits. A simple binary counter can do this. For example, if X = 3
then a counter from 0 to 7 will generate all possible bit patterns of length 3.
The patterns are:
000
001
010
011
100
101
110
111
Verify the adjacency requirement as the patterns are built. Reject any patterns that do not qualify.
Basically this boils down to rejecting any pattern containing fewer than 2 '1' bits (Y = 2). The list prunes down to:
011
101
110
111
For each member of the pruned list, add a '1' bit and retest the first X bits. Keep the new pattern if it passes the
adjacency test. Do the same with a '0' bit. For example this step proceeds as:
1011 <== Keep
1101 <== Keep
1110 <== Keep
1111 <== Keep
0011 <== Reject
0101 <== Reject
0110 <== Keep
0111 <== Keep
Which leaves:
1011
1101
1110
1111
0110
0111
Now repeat this process until the pruned set is empty or the member lengths become LEN bits long. In the end
the only patterns left are:
111011
111101
111110
111111
110110
110111
101101
101110
101111
011011
011101
011110
011111
Count them up and you are done.
Note that you only need to test the first X bits on each iteration because all the subsequent patterns were verified in prior steps.
Considering that input values are variable and wanted to see the actual output, I used recursive algorithm to determine all combinations of 0 and 1 for a given length :
private static void BinaryNumberWithOnes(int n, int dump, int ones, string s = "")
{
if (n == 0)
{
if (BinaryWithoutDumpCountContainsnumberOfOnes(s, dump,ones))
Console.WriteLine(s);
return;
}
BinaryNumberWithOnes(n - 1, dump, ones, s + "0");
BinaryNumberWithOnes(n - 1, dump, ones, s + "1");
}
and BinaryWithoutDumpCountContainsnumberOfOnes to determine if the binary number meets the criteria
private static bool BinaryWithoutDumpCountContainsnumberOfOnes(string binaryNumber, int dump, int ones)
{
int current = 0;
int count = binaryNumber.Length;
while(current +dump < count)
{
var fail = binaryNumber.Remove(current, dump).Replace("0", "").Length < ones;
if (fail)
{
return false;
}
current++;
}
return true;
}
Calling BinaryNumberWithOnes(6, 3, 2) will output all binary numbers that match
010011
011011
011111
100011
100101
100111
101011
101101
101111
110011
110101
110110
110111
111011
111101
111110
111111
Sounds like a nested for loop would do the trick. Pseudocode (not tested).
value = '0101010111110101010111' // change this line to format you would need
for (i = 0; i < (Len-x); i++) { // loop over value from left to right
kount = 0
for (j = i; j < (i+x); j++) { // count '1' bits in the next 'x' bits
kount += value[j] // add 0 or 1
if kount >= y then return success
}
}
return fail
The naive approach would be a tree-recursive algorithm.
Our recursive method would slowly build the number up, e.g. it would start at xxxxxx, return the sum of a call with 1xxxxx and 0xxxxx, which themselves will return the sum of a call with 10, 11 and 00, 01, etc. except if the x/y conditions are NOT satisfied for the string it would build by calling itself it does NOT go down that path, and if you are at a terminal condition (built a number of the correct length) you return 1. (note that since we're building the string up from left to right, you don't have to check x/y for the entire string, just also considering the newly added digit!)
By returning a sum over all calls then all of the returned 1s will pool together and be returned by the initial call, equalling the number of constructed strings.
No idea what the big O notation for time complexity is for this one, it could be as bad as O(2^n)*O(checking x/y conditions) but it will prune lots of branches off the tree in most cases.
UPDATE: One insight I had is that all branches of the recursive tree can be 'merged' if they have identical last x digits so far, because then the same checks would be applied to all digits hereafter so you may as well double them up and save a lot of work. This now requires building the tree explicitly instead of implicitly via recursive calls, and maybe some kind of hashing scheme to detect when branches have identical x endings, but for large length it would provide a huge speedup.
My approach is to start by getting the all binary numbers with the minimum number of 1's, which is easy enough, you just get every unique permutation of a binary number of length x with y 1's, and cycle each unique permutation "Len" times. By flipping the 0 bits of these seeds in every combination possible, we are guaranteed to iterate over all of the binary numbers that fit the criteria.
from itertools import permutations, cycle, combinations
def uniq(x):
d = {}
for i in x:
d[i]=1
return d.keys()
def findn( l, x, y ):
window = []
for i in xrange(y):
window.append(1)
for i in xrange(x-y):
window.append(0)
perms = uniq(permutations(window))
seeds=[]
for p in perms:
pr = cycle(p)
seeds.append([ pr.next() for i in xrange(l) ]) ###a seed is a binary number fitting the criteria with minimum 1 bits
bin_numbers=[]
for seed in seeds:
if seed in bin_numbers: continue
indexes = [ i for i, x in enumerate(seed) if x == 0] ### get indexes of 0 "bits"
exit = False
for i in xrange(len(indexes)+1):
if( exit ): break
for combo in combinations(indexes, i): ### combinatorically flipping the zero bits in the seed
new_num = seed[:]
for index in combo: new_num[index]+=1
if new_num in bin_numbers:
### if our new binary number has been seen before
### we can break out since we are doing a depth first traversal
exit=True
break
else:
bin_numbers.append(new_num)
print len(bin_numbers)
findn(6,3,2)
Growth of this approach is definitely exponential, but I thought I'd share my approach in case it helps someone else get to a lower complexity solution...
Set some condition and introduce simple help variable.
L = 6, x = 3 , y = 2 introduce d = x - y = 1
Condition: if the list of the next number hypotetical value and the previous x - 1 elements values has a number of 0-digits > d next number concrete value must be 1, otherwise add two brances with both 1 and 0 as concrete value.
Start: check(Condition) => both 0,1 due to number of total zeros in the 0-count check.
Empty => add 0 and 1
Step 1:Check(Condition)
0 (number of next value if 0 and previous x - 1 zeros > d(=1)) -> add 1 to sequence
1 -> add both 0,1 in two different branches
Step 2: check(Condition)
01 -> add 1
10 -> add 1
11 -> add 0,1 in two different branches
Step 3:
011 -> add 0,1 in two branches
101 -> add 1 (the next value if 0 and prev x-1 seq would be 010, so we prune and set only 1)
110 -> add 1
111 -> add 0,1
Step 4:
0110 -> obviously 1
0111 -> both 0,1
1011 -> both 0,1
1101 -> 1
1110 -> 1
1111 -> 0,1
Step 5:
01101 -> 1
01110 -> 1
01111 -> 0,1
10110 -> 1
10111 -> 0,1
11011 -> 0,1
11101 -> 1
11110 -> 1
11111 -> 0,1
Step 6 (Finish):
011011
011101
011110
011111
101101
101110
101111
110110
110111
111011
111101
111110
111111
Now count. I've tested for L = 6, x = 4 and y = 2 too, but consider to check the algorithm for special cases and extended cases.
Note: I'm pretty sure some algorithm with Disposition Theory bases should be a really massive improvement of my algorithm.
So in a series of Len binary digits, you are looking for a x-long segment that contains y 1's ..
See the execution: http://ideone.com/xuaWaK
Here's my Algorithm in Java:
import java.util.*;
import java.lang.*;
class Main
{
public static ArrayList<String> solve (String input, int x, int y)
{
int s = 0;
ArrayList<String> matches = new ArrayList<String>();
String segment = null;
for (int i=0; i<(input.length()-x); i++)
{
s = 0;
segment = input.substring(i,(i+x));
System.out.print(" i: "+i+" ");
for (char c : segment.toCharArray())
{
System.out.print("*");
if (c == '1')
{
s = s + 1;
}
}
if (s == y)
{
matches.add(segment);
}
System.out.println();
}
return matches;
}
public static void main (String [] args)
{
String input = "011010101001101110110110101010111011010101000110010";
int x = 6;
int y = 4;
ArrayList<String> matches = null;
matches = solve (input, x, y);
for (String match : matches)
{
System.out.println(" > "+match);
}
System.out.println(" Number of matches is " + matches.size());
}
}
The number of patterns of length X that contain at least Y 1 bits is countable. For the case x == y we know there is exactly one pattern of the 2^x possible patterns that meets the criteria. For smaller y we need to sum up the number of patterns which have excess 1 bits and the number of patterns that have exactly y bits.
choose(n, k) = n! / k! (n - k)!
numPatterns(x, y) {
total = 0
for (int j = x; j >= y; j--)
total += choose(x, j)
return total
}
For example :
X = 4, Y = 4 : 1 pattern
X = 4, Y = 3 : 1 + 4 = 5 patterns
X = 4, Y = 2 : 1 + 4 + 6 = 11 patterns
X = 4, Y = 1 : 1 + 4 + 6 + 4 = 15 patterns
X = 4, Y = 0 : 1 + 4 + 6 + 4 + 1 = 16
(all possible patterns have at least 0 1 bits)
So let M be the number of X length patterns that meet the Y criteria. Now, that X length pattern is a subset of N bits. There are (N - x + 1) "window" positions for the sub pattern, and 2^N total patterns possible. If we start with any of our M patterns, we know that appending a 1 to the right and shifting to the next window will result in one of our known M patterns. The question is, how many of the M patterns can we add a 0 to, shift right, and still have a valid pattern in M?
Since we are adding a zero, we have to be either shifting away from a zero, or we have to already be in an M where we have an excess of 1 bits. To flip that around, we can ask how many of the M patterns have exactly Y bits and start with a 1. Which is the same as "how many patterns of length X-1 have Y-1 bits", which we know how to answer:
shiftablePatternCount = M - choose(X-1, Y-1)
So starting with M possibilities, we are going to increase by shiftablePatternCount when we slide to the right. All patterns in the new window are in the set of M, with some patterns now duplicated. We are going to shift a number of times to fill up N by (N - X), each time increasing the count by shiftablePatternCount, so the full answer should be :
totalCountOfMatchingPatterns = M + (N - X)*shiftablePatternCount
edit - realized a mistake. I need to count the duplicates of the shiftable patterns that are generated. I think that's doable. (draft still)
I am not sure about my answer but here is my view.just take a look at it,
Len=4,
x=3,
y=2.
i just took out two patterns,cause pattern must contain at least y's 1.
X 1 1 X
1 X 1 X
X - represent don't care
now count for 1st expression is 2 1 1 2 =4
and for 2nd expression 1 2 1 2 =4
but 2 pattern is common between both so minus 2..so there will be total 6 pair which satisfy the condition.
I happen to be using a algoritem similar to your problem, trying to find a way to improve it, I found your question. So I will share
static int GetCount(int length, int oneBits){
int result = 0;
double count = Math.Pow(2, length);
for (int i = 1; i <= count - 1; i++)
{
string str = Convert.ToString(i, 2).PadLeft(length, '0');
if (str.ToCharArray().Count(c => c == '1') == oneBits)
{
result++;
}
}
return result;
}
not very efficent I think, but elegent solution.

algorithm to find closest string using same characters

Given a list L of n character strings, and an input character string S, what is an efficient way to find the character string in L that contains the most characters that exist in S? We want to find the string in L that is most-closely made up of the letters contained in S.
The obvious answer is to loop through all n strings and check to see how many characters in the current string exist in S. However, this algorithm will be run frequently, and the list L of n string will be stored in a database... loop manually through all n strings would require something like big-Oh of n*m^2, where n is the number of strings in L, and m is the max length of any string in L, as well as the max length of S... in this case m is actually a constant of 150.
Is there a better way than just a simple loop? Is there a data structure I can load the n strings into that would give me fast search ability? Is there an algorithm that uses the pre-calculated meta-data about each of the n strings that would perform better than a loop?
I know there are a lot of geeks out there that are into the algorithms. So please help!
Thanks!
If you are after substrings, a Trie or Patrica trie might be a good starting point.
If you don't care about the order, just about the number of each symbol or letter, I would calculate the histogram of all strings and then compare them with the histogram of the input.
ABCDEFGHIJKLMNOPQRSTUVWXYZ
Hello World => ...11..1...3..2..1....1...
This will lower the costs to O(26 * m + n) plus the preprocessing once if you consider only case-insensitive latin letters.
If m is constant, you could interpret the histogram as a 26 dimensional vector on a 26 dimensional unit sphere by normalizing it. Then you could just calculate the Dot Product of two vectors yielding the cosine of the angle between the two vectors, and this value should be proportional to the similarity of the strings.
Assuming m = 3, a alphabet A = { 'U', 'V', 'W' } of size three only, and the following list of strings.
L = { "UUU", "UVW", "WUU" }
The histograms are the following.
H = { (3, 0, 0), (1, 1, 1), (2, 0, 1) }
A histogram h = (x, y, z) is normalized to h' = (x/r, y/r, z/r) with r the Euclidian norm of the histogram h - that is r = sqrt(x² + y² + z²).
H' = { (1.000, 0.000, 0.000), (0.577, 0.577, 0.577), (0.894, 0.000, 0.447) }
The input S = "VVW" has the histogram hs = (0, 2, 1) and the normalized histogram hs' = (0.000, 0.894, 0.447).
Now we can calculate the similarity of two histograms h1 = (a, b, c) and h2 = (x, y, z) as the Euclidian distance of both histograms.
d(h1, h2) = sqrt((a - x)² + (b - y)² + (c - z)²)
For the example we obtain.
d((3, 0, 0), (0, 2, 1)) = 3.742
d((1, 1, 1), (0, 2, 1)) = 1.414
d((2, 0, 1), (0, 2, 1)) = 2.828
Hence "UVW" is closest to "VVW" (smaller numbers indicate higher similarity).
Using the normalized histograms h1' = (a', b', c') and h2' = (x', y', z') we can calculate the distance as the dot product of both histograms.
d'(h1', h2') = a'x' + b'y' + c'z'
For the example we obtain.
d'((1.000, 0.000, 0.000), (0.000, 0.894, 0.447)) = 0.000
d'((0.577, 0.577, 0.577), (0.000, 0.894, 0.447)) = 0.774
d'((0.894, 0.000, 0.447), (0.000, 0.894, 0.447)) = 0.200
Again "UVW" is determined to be closest to "VVW" (larger numbers indicate higher similarity).
Both version yield different numbers, but the results are always the same. One could also use other norms - Manhattan distance (L1 norm) for example - but this will only change the numbers because norms in finite dimensional vector spaces are all equivalent.
Sounds like you need a trie. Tries are used to search for words similar to the way a spell checker will work. So if the String S has the characters in the same order as the Strings in L then this may work for you.
If however, the order of the characters in S is not relevant - like a set of scrabble tiles and you want to search for the longest word - then this is not your solution.
What you want is a BK-Tree. It's a bit unintuitive, but very cool - and it makes it possible to search for elements within a levenshtein (edit) distance threshold in O(log n) time.
If you care about ordering in your input strings, use them as is. If you don't you can sort the individual characters before inserting them into the BK-Tree (or querying with them).
I believe what you're looking for can be found here: Fuzzy Logic Based Search Technique
It's pretty heavy, but so is what you're asking for. It talks about word similarities, and character misplacement.
i.e:
L I N E A R T R N A S F O R M
L I N A E R T R A N S F O R M
L E N E A R T R A N S F R M
it seems to me that the order of the characters is not important in your problem, but you are searching for "near-anagrams" of the word S.
If that's so, then you can represent every word in the set L as an array of 26 integers (assuming your alphabet has 26 letters). You can represent S similarly as an array of 26 integers; now to find the best match you just run once through the set L and calculate a distance metric between the S-vector and the current L-vector, however you want to define the distance metric (e.g. euclidean / sum-of-squares or Manhattan / sum of absolute differences). This is O(n) algorithm because the vectors have constant lengths.
Here is a T-SQL function that has been working great for me, gives you the edit distance:
Example:
SELECT TOP 1 [StringValue] , edit_distance([StringValue, 'Input Value')
FROM [SomeTable]
ORDER BY edit_distance([StringValue, 'Input Value')
The Function:
CREATE FUNCTION edit_distance(#s1 nvarchar(3999), #s2 nvarchar(3999))
RETURNS int
AS
BEGIN
DECLARE #s1_len int, #s2_len int, #i int, #j int, #s1_char nchar, #c int, #c_temp int,
#cv0 varbinary(8000), #cv1 varbinary(8000)
SELECT #s1_len = LEN(#s1), #s2_len = LEN(#s2), #cv1 = 0x0000, #j = 1, #i = 1, #c = 0
WHILE #j <= #s2_len
SELECT #cv1 = #cv1 + CAST(#j AS binary(2)), #j = #j + 1
WHILE #i <= #s1_len
BEGIN
SELECT #s1_char = SUBSTRING(#s1, #i, 1), #c = #i, #cv0 = CAST(#i AS binary(2)), #j = 1
WHILE #j <= #s2_len
BEGIN
SET #c = #c + 1
SET #c_temp = CAST(SUBSTRING(#cv1, #j+#j-1, 2) AS int) +
CASE WHEN #s1_char = SUBSTRING(#s2, #j, 1) THEN 0 ELSE 1 END
IF #c > #c_temp SET #c = #c_temp
SET #c_temp = CAST(SUBSTRING(#cv1, #j+#j+1, 2) AS int)+1
IF #c > #c_temp SET #c = #c_temp
SELECT #cv0 = #cv0 + CAST(#c AS binary(2)), #j = #j + 1
END
SELECT #cv1 = #cv0, #i = #i + 1
END
RETURN #c
END

Resources