How to get the permutations without repetition of all the combinations of a set of letters? [duplicate] - algorithm

This question already has answers here:
Algorithm to return all combinations of k elements from n
(77 answers)
Closed 3 years ago.
I'm trying to create an algorithm that after giving it a number of random letters, for example: abcd, it shows you all the possible permutations, in this case, the answer would be:
abcd abc abd acd ab ac ad a bcd bc bd b cd c d
As you can see, the different permutations must have different letters, I have spent hours trying to do it without success, this is what I did so far:
vector<char> letters;
letters.push_back('a');
letters.push_back('b');
letters.push_back('c');
letters.push_back('d');
int number_of_letters = 4;
int number_of_repetitions;
vector<char> combination;
vector<char> comb_copy;
for(int i = 0; i < number_of_letters; i++){
number_of_repetitions = 1;
changing_letters = number_of_letters - (i + 1);
for(int j = i+1; j < number_of_letters; j++){
combination.push_back(letters[j]);
}
comb_copy = combination;
for(int i = 0; i < comb_copy.size(); i++){
cout << comb_copy[i];
}
while(number_of_repetitions <= changing_letters){
comb_copy = combinations(comb_copy, number_of_repetitions);
for(int i = 0; i < comb_copy.size(); i++){
cout << comb_copy[i];
}
comb_copy = combination;
number_of_repetitions++;
}
}
vector<char> combinations(vector<char> combi, int reps){
combi.erase(combi.end() - reps);
return (reps > 1?combinations(combi, reps-1):combi);
}
And this is what I'm getting:
abcd abc abd acd bc bd c
I would need to get:
abcd abc abd acd ab ac ad a bcd bc bd b cd c d
Can someone help me? :)
Thanks!!

You want to get all subsets of n items except for empty one. There are 2^n-1 such subsets. A pair of methods to generate them:
1) Make loop for value R in range 1..2^n-1. At every step represent R in binary. Every non-zero bit of R corresponds to index of initial array used in result. For example R=5=0101b denotes "ac" (0-th and 2-nd elements)
2) Use recursive procedure, at every level make two recursive calls - adding elelent with current index and omitting it. At the deepest level output result (if not empty)
Python examples:
s = "abcd"
n = len(s)
for r in range(1, 1<<n):
result = ""
for i in range(n):
if r & (1 << i):
result += s[i]
print(result, end = " ")
a b ab c ac bc abc d ad bd abd cd acd bcd abcd
def subs(s, result, idx):
if idx == len(s):
if len(result) > 0:
print(result, end = " ")
else:
subs(s, result + s[idx], idx + 1)
subs(s, result, idx + 1)
subs("abcd", "", 0)
abcd abc abd ab acd ac ad a bcd bc bd b cd c d

Related

Given two sequences, find the maximal overlap between ending of one and beginning of the other

I need to find an efficient (pseudo)code to solve the following problem:
Given two sequences of (not necessarily distinct) integers (a[1], a[2], ..., a[n]) and (b[1], b[2], ..., b[n]), find the maximum d such that a[n-d+1] == b[1], a[n-d+2] == b[2], ..., and a[n] == b[d].
This is not homework, I actually came up with this when trying to contract two tensors along as many dimensions as possible. I suspect an efficient algorithm exists (maybe O(n)?), but I cannot come up with something that is not O(n^2). The O(n^2) approach would be the obvious loop on d and then an inner loop on the items to check the required condition until hitting the maximum d. But I suspect something better than this is possible.
You can utilize the z algorithm, a linear time (O(n)) algorithm that:
Given a string S of length n, the Z Algorithm produces an array Z
where Z[i] is the length of the longest substring starting from S[i]
which is also a prefix of S
You need to concatenate your arrays (b+a) and run the algorithm on the resulting constructed array till the first i such that Z[i]+i == m+n.
For example, for a = [1, 2, 3, 6, 2, 3] & b = [2, 3, 6, 2, 1, 0], the concatenation would be [2, 3, 6, 2, 1, 0, 1, 2, 3, 6, 2, 3] which would yield Z[10] = 2 fulfilling Z[i] + i = 12 = m + n.
For O(n) time/space complexity, the trick is to evaluate hashes for each subsequence. Consider the array b:
[b1 b2 b3 ... bn]
Using Horner's method, you can evaluate all the possible hashes for each subsequence. Pick a base value B (bigger than any value in both of your arrays):
from b1 to b1 = b1 * B^1
from b1 to b2 = b1 * B^1 + b2 * B^2
from b1 to b3 = b1 * B^1 + b2 * B^2 + b3 * B^3
...
from b1 to bn = b1 * B^1 + b2 * B^2 + b3 * B^3 + ... + bn * B^n
Note that you can evaluate each sequence in O(1) time, using the result of the previous sequence, hence all the job costs O(n).
Now you have an array Hb = [h(b1), h(b2), ... , h(bn)], where Hb[i] is the hash from b1 until bi.
Do the same thing for the array a, but with a little trick:
from an to an = (an * B^1)
from an-1 to an = (an-1 * B^1) + (an * B^2)
from an-2 to an = (an-2 * B^1) + (an-1 * B^2) + (an * B^3)
...
from a1 to an = (a1 * B^1) + (a2 * B^2) + (a3 * B^3) + ... + (an * B^n)
You must note that, when you step from one sequence to another, you multiply the whole previous sequence by B and add the new value multiplied by B. For example:
from an to an = (an * B^1)
for the next sequence, multiply the previous by B: (an * B^1) * B = (an * B^2)
now sum with the new value multiplied by B: (an-1 * B^1) + (an * B^2)
hence:
from an-1 to an = (an-1 * B^1) + (an * B^2)
Now you have an array Ha = [h(an), h(an-1), ... , h(a1)], where Ha[i] is the hash from ai until an.
Now, you can compare Ha[d] == Hb[d] for all d values from n to 1, if they match, you have your answer.
ATTENTION: this is a hash method, the values can be large and you may have to use a fast exponentiation method and modular arithmetics, which may (hardly) give you collisions, making this method not totally safe. A good practice is to pick a base B as a really big prime number (at least bigger than the biggest value in your arrays). You should also be careful as the limits of the numbers may overflow at each step, so you'll have to use (modulo K) in each operation (where K can be a prime bigger than B).
This means that two different sequences might have the same hash, but two equal sequences will always have the same hash.
This can indeed be done in linear time, O(n), and O(n) extra space. I will assume the input arrays are character strings, but this is not essential.
A naive method would -- after matching k characters that are equal -- find a character that does not match, and go back k-1 units in a, reset the index in b, and then start the matching process from there. This clearly represents a O(n²) worst case.
To avoid this backtracking process, we can observe that going back is not useful if we have not encountered the b[0] character while scanning the last k-1 characters. If we did find that character, then backtracking to that position would only be useful, if in that k sized substring we had a periodic repetition.
For instance, if we look at substring "abcabc" somewhere in a, and b is "abcabd", and we find that the final character of b does not match, we must consider that a successful match might start at the second "a" in the substring, and we should move our current index in b back accordingly before continuing the comparison.
The idea is then to do some preprocessing based on string b to log back-references in b that are useful to check when there is a mismatch. So for instance, if b is "acaacaacd", we could identify these 0-based backreferences (put below each character):
index: 0 1 2 3 4 5 6 7 8
b: a c a a c a a c d
ref: 0 0 0 1 0 0 1 0 5
For example, if we have a equal to "acaacaaca" the first mismatch happens on the final character. The above information then tells the algorithm to go back in b to index 5, since "acaac" is common. And then with only changing the current index in b we can continue the matching at the current index of a. In this example the match of the final character then succeeds.
With this we can optimise the search and make sure that the index in a can always progress forwards.
Here is an implementation of that idea in JavaScript, using the most basic syntax of that language only:
function overlapCount(a, b) {
// Deal with cases where the strings differ in length
let startA = 0;
if (a.length > b.length) startA = a.length - b.length;
let endB = b.length;
if (a.length < b.length) endB = a.length;
// Create a back-reference for each index
// that should be followed in case of a mismatch.
// We only need B to make these references:
let map = Array(endB);
let k = 0; // Index that lags behind j
map[0] = 0;
for (let j = 1; j < endB; j++) {
if (b[j] == b[k]) {
map[j] = map[k]; // skip over the same character (optional optimisation)
} else {
map[j] = k;
}
while (k > 0 && b[j] != b[k]) k = map[k];
if (b[j] == b[k]) k++;
}
// Phase 2: use these references while iterating over A
k = 0;
for (let i = startA; i < a.length; i++) {
while (k > 0 && a[i] != b[k]) k = map[k];
if (a[i] == b[k]) k++;
}
return k;
}
console.log(overlapCount("ababaaaabaabab", "abaababaaz")); // 7
Although there are nested while loops, these do not have more iterations in total than n. This is because the value of k strictly decreases in the while body, and cannot become negative. This can only happen when k++ was executed that many times to give enough room for such decreases. So all in all, there cannot be more executions of the while body than there are k++ executions, and the latter is clearly O(n).
To complete, here you can find the same code as above, but in an interactive snippet: you can input your own strings and see the result interactively:
function overlapCount(a, b) {
// Deal with cases where the strings differ in length
let startA = 0;
if (a.length > b.length) startA = a.length - b.length;
let endB = b.length;
if (a.length < b.length) endB = a.length;
// Create a back-reference for each index
// that should be followed in case of a mismatch.
// We only need B to make these references:
let map = Array(endB);
let k = 0; // Index that lags behind j
map[0] = 0;
for (let j = 1; j < endB; j++) {
if (b[j] == b[k]) {
map[j] = map[k]; // skip over the same character (optional optimisation)
} else {
map[j] = k;
}
while (k > 0 && b[j] != b[k]) k = map[k];
if (b[j] == b[k]) k++;
}
// Phase 2: use these references while iterating over A
k = 0;
for (let i = startA; i < a.length; i++) {
while (k > 0 && a[i] != b[k]) k = map[k];
if (a[i] == b[k]) k++;
}
return k;
}
// I/O handling
let [inputA, inputB] = document.querySelectorAll("input");
let output = document.querySelector("pre");
function refresh() {
let a = inputA.value;
let b = inputB.value;
let count = overlapCount(a, b);
let padding = a.length - count;
// Apply some HTML formatting to highlight the overlap:
if (count) {
a = a.slice(0, -count) + "<b>" + a.slice(-count) + "</b>";
b = "<b>" + b.slice(0, count) + "</b>" + b.slice(count);
}
output.innerHTML = count + " overlapping characters:\n" +
a + "\n" +
" ".repeat(padding) + b;
}
document.addEventListener("input", refresh);
refresh();
body { font-family: monospace }
b { background:yellow }
input { width: 90% }
a: <input value="acacaacaa"><br>
b: <input value="acaacaacd"><br>
<pre></pre>

Caculating total combinations

I don't know how to go about this programming problem.
Given two integers n and m, how many numbers exist such that all numbers have all digits from 0 to n-1 and the difference between two adjacent digits is exactly 1 and the number of digits in the number is atmost 'm'.
What is the best way to solve this problem? Is there a direct mathematical formula?
Edit: The number cannot start with 0.
Example:
for n = 3 and m = 6 there are 18 such numbers (210, 2101, 21012, 210121 ... etc)
Update (some people have encountered an ambiguity):
All digits from 0 to n-1 must be present.
This Python code computes the answer in O(nm) by keeping track of the numbers ending with a particular digit.
Different arrays (A,B,C,D) are used to track numbers that have hit the maximum or minimum of the range.
n=3
m=6
A=[1]*n # Number of ways of being at digit i and never being to min or max
B=[0]*n # number of ways with minimum being observed
C=[0]*n # number of ways with maximum being observed
D=[0]*n # number of ways with both being observed
A[0]=0 # Cannot start with 0
A[n-1]=0 # Have seen max so this 1 moves from A to C
C[n-1]=1 # Have seen max if start with highest digit
t=0
for k in range(m-1):
A2=[0]*n
B2=[0]*n
C2=[0]*n
D2=[0]*n
for i in range(1,n-1):
A2[i]=A[i+1]+A[i-1]
B2[i]=B[i+1]+B[i-1]
C2[i]=C[i+1]+C[i-1]
D2[i]=D[i+1]+D[i-1]
B2[0]=A[1]+B[1]
C2[n-1]=A[n-2]+C[n-2]
D2[0]=C[1]+D[1]
D2[n-1]=B[n-2]+D[n-2]
A=A2
B=B2
C=C2
D=D2
x=sum(d for d in D2)
t+=x
print t
After doing some more research, I think there may actually be a mathematical approach after all, although the math is advanced for me. Douglas S. Stones pointed me in the direction of Joseph Myers' (2008) article, BMO 2008–2009 Round 1 Problem 1—Generalisation, which derives formulas for calculating the number of zig-zag paths across a rectangular board.
As I understand it, in Anirudh's example, our board would have 6 rows of length 3 (I believe this would mean n=3 and r=6 in the article's terms). We can visualize our board so:
0 1 2 example zig-zag path: 0
0 1 2 1
0 1 2 0
0 1 2 1
0 1 2 2
0 1 2 1
Since Myers' formula m(n,r) would generate the number for all the zig-zag paths, that is, the number of all 6-digit numbers where all adjacent digits are consecutive and digits are chosen from (0,1,2), we would still need to determine and subtract those that begin with zero and those that do not include all digits.
If I understand correctly, we may do this in the following way for our example, although generalizing the concept to arbitrary m and n may prove more complicated:
Let m(3,6) equal the number of 6-digit numbers where all adjacent digits
are consecutive and digits are chosen from (0,1,2). According to Myers,
m(3,r) is given by formula and also equals OEIS sequence A029744 at
index r+2, so we have
m(3,6) = 16
How many of these numbers start with zero? Myers describes c(n,r) as the
number of zig-zag paths whose colour is that of the square in the top
right corner of the board. In our case, c(3,6) would include the total
for starting-digit 0 as well as starting-digit 2. He gives c(3,2r) as 2^r,
so we have
c(3,6) = 8. For starting-digit 0 only, we divide by two to get 4.
Now we need to obtain only those numbers that include all the digits in
the range, but how? We can do this be subtracting m(n-1,r) from m(n,r).
In our case, we have all the m(2,6) that would include only 0's and 1's,
and all the m(2,6) that would include 1's and 2's. Myers gives
m(2,anything) as 2, so we have
2*m(2,6) = 2*2 = 4
But we must remember that one of the zero-starting numbers is included
in our total for 2*m(2,6), namely 010101. So all together we have
m(3,6) - c(3,6)/2 - 4 + 1
= 16 - 4 - 4 + 1
= 9
To complete our example, we must follow a similar process for m(3,5),
m(3,4) and m(3,3). Since it's late here, I might follow up tomorrow...
One approach could be to program it recursively, calling the function to add as well as subtract from the last digit.
Haskell code:
import Data.List (sort,nub)
f n m = concatMap (combs n) [n..m]
combs n m = concatMap (\x -> combs' 1 [x]) [1..n - 1] where
combs' count result
| count == m = if test then [concatMap show result] else []
| otherwise = combs' (count + 1) (result ++ [r + 1])
++ combs' (count + 1) (result ++ [r - 1])
where r = last result
test = (nub . sort $ result) == [0..n - 1]
Output:
*Main> f 3 6
["210","1210","1012","2101","12101","10121","21210","21012"
,"21010","121210","121012","121010","101212","101210","101012"
,"212101","210121","210101"]
In response to Anirudh Rayabharam's comment, I hope the following code will be more 'pseudocode' like. When the total number of digits reaches m, the function g outputs 1 if the solution has hashed all [0..n-1], and 0 if not. The function f accumulates the results for g for starting digits [1..n-1] and total number of digits [n..m].
Haskell code:
import qualified Data.Set as S
g :: Int -> Int -> Int -> Int -> (S.Set Int, Int) -> Int
g n m digitCount lastDigit (hash,hashCount)
| digitCount == m = if test then 1 else 0
| otherwise =
if lastDigit == 0
then g n m d' (lastDigit + 1) (hash'',hashCount')
else if lastDigit == n - 1
then g n m d' (lastDigit - 1) (hash'',hashCount')
else g n m d' (lastDigit + 1) (hash'',hashCount')
+ g n m d' (lastDigit - 1) (hash'',hashCount')
where test = hashCount' == n
d' = digitCount + 1
hash'' = if test then S.empty else hash'
(hash',hashCount')
| hashCount == n = (S.empty,hashCount)
| S.member lastDigit hash = (hash,hashCount)
| otherwise = (S.insert lastDigit hash,hashCount + 1)
f n m = foldr forEachNumDigits 0 [n..m] where
forEachNumDigits numDigits accumulator =
accumulator + foldr forEachStartingDigit 0 [1..n - 1] where
forEachStartingDigit startingDigit accumulator' =
accumulator' + g n numDigits 1 startingDigit (S.empty,0)
Output:
*Main> f 3 6
18
(0.01 secs, 571980 bytes)
*Main> f 4 20
62784
(1.23 secs, 97795656 bytes)
*Main> f 4 25
762465
(11.73 secs, 1068373268 bytes)
model your problem as 2 superimposed lattices in 2 dimensions, specifically as pairs (i,j) interconnected with oriented edges ((i0,j0),(i1,j1)) where i1 = i0 + 1, |j1 - j0| = 1, modified as follows:
dropping all pairs (i,j) with j > 9 and its incident edges
dropping all pairs (i,j) with i > m-1 and its incident edges
dropping edge ((0,0), (1,1))
this construction results in a structure like in this diagram:
:
the requested numbers map to paths in the lattice starting at one of the green elements ((0,j), j=1..min(n-1,9)) that contain at least one pink and one red element ((i,0), i=1..m-1, (i,n-1), i=0..m-1 ). to see this, identify the i-th digit j of a given number with point (i,j). including pink and red elements ('extremal digits') guarantee that all available diguts are represented in the number.
Analysis
for convenience, let q1, q2 denote the position-1.
let q1 be the position of a number's first digit being either 0 or min(n-1,9).
let q2 be the position of a number's first 0 if the digit at position q1 is min(n-1,9) and vv.
case 1: first extremal digit is 0
the number of valid prefixes containing no 0 can be expressed as sum_{k=1..min(n-1,9)} (paths_to_0(k,1,q1), the function paths_to_0 being recursively defined as
paths_to_0(0,q1-1,q1) = 0;
paths_to_0(1,q1-1,q1) = 1;
paths_to_0(digit,i,q1) = 0; if q1-i < digit;
paths_to_0(x,_,_) = 0; if x >= min(n-1,9)
// x=min(n-1,9) mustn't occur before position q2,
// x > min(n-1,9) not at all
paths_to_0(x,_,_) = 0; if x <= 0;
// x=0 mustn't occur before position q1,
// x < 0 not at all
and else paths_to_0(digit,i,q1) =
paths_to_0(digit+1,i+1,q1) + paths_to_0(digit-1,i+1,q1);
similarly we have
paths_to_max(min(n-1,9),q2-1,q2) = 0;
paths_to_max(min(n-2,8),q2-1,q2) = 1;
paths_to_max(digit,i,q2) = 0 if q2-i < n-1;
paths_to_max(x,_,_) = 0; if x >= min(n-1,9)
// x=min(n-1,9) mustn't occur before
// position q2,
// x > min(n-1,9) not at all
paths_to_max(x,_,_) = 0; if x < 0;
and else paths_to_max(digit,q1,q2) =
paths_max(digit+1,q1+1,q2) + paths_to_max(digit-1,q1+1,q2);
and finally
paths_suffix(digit,length-1,length) = 2; if digit > 0 and digit < min(n-1,9)
paths_suffix(digit,length-1,length) = 1; if digit = 0 or digit = min(n-1,9)
paths_suffix(digit,k,length) = 0; if length > m-1
or length < q2
or k > length
paths_suffix(digit,k,0) = 1; // the empty path
and else paths_suffix(digit,k,length) =
paths_suffix(digit+1,k+1,length) + paths_suffix(digit-1,k+1,length);
... for a grand total of
number_count_case_1(n, m) =
sum_{first=1..min(n-1,9), q1=1..m-1-(n-1), q2=q1..m-1, l_suffix=0..m-1-q2} (
paths_to_0(first,1,q1)
+ paths_to_max(0,q1,q2)
+ paths_suffix(min(n-1,9),q2,l_suffix+q2)
)
case 2: first extremal digit is min(n-1,9)
case 2.1: initial digit is not min(n-1,9)
this is symmetrical to case 1 with all digits d replaced by min(n,10) - d. as the lattice structure is symmetrical, this means number_count_case_2_1 = number_count_case_1.
case 2.2: initial digit is min(n-1,9)
note that q1 is 1 and the second digit must be min(n-2,8).
thus
number_count_case_2_2 (n, m) =
sum_{q2=1..m-2, l_suffix=0..m-2-q2} (
paths_to_max(1,1,q2)
+ paths_suffix(min(n-1,9),q2,l_suffix+q2)
)
so the grand grand total will be
number_count ( n, m ) = 2 * number_count_case_1 (n, m) + number_count_case_2_2 (n, m);
Code
i don't know whether a closed expression for number_count exists, but the following perl code will compute it (the code is but a proof of concept as it does not use memoization techniques to avoid recomputing results already obtained):
use strict;
use warnings;
my ($n, $m) = ( 5, 7 ); # for example
$n = ($n > 10) ? 10 : $n; # cutoff
sub min
sub paths_to_0 ($$$) {
my (
$d
, $at
, $until
) = #_;
#
if (($d == 0) && ($at == $until - 1)) { return 0; }
if (($d == 1) && ($at == $until - 1)) { return 1; }
if ($until - $at < $d) { return 0; }
if (($d <= 0) || ($d >= $n))) { return 0; }
return paths_to_0($d+1, $at+1, $until) + paths_to_0($d-1, $at+1, $until);
} # paths_to_0
sub paths_to_max ($$$) {
my (
$d
, $at
, $until
) = #_;
#
if (($d == $n-1) && ($at == $until - 1)) { return 0; }
if (($d == $n-2) && ($at == $until - 1)) { return 1; }
if ($until - $at < $n-1) { return 0; }
if (($d < 0) || ($d >= $n-1)) { return 0; }
return paths_to_max($d+1, $at+1, $until) + paths_to_max($d-1, $at+1, $until);
} # paths_to_max
sub paths_suffix ($$$) {
my (
$d
, $at
, $until
) = #_;
#
if (($d < $n-1) && ($d > 0) && ($at == $until - 1)) { return 2; }
if ((($d == $n-1) && ($d == 0)) && ($at == $until - 1)) { return 1; }
if (($until > $m-1) || ($at > $until)) { return 0; }
if ($until == 0) { return 1; }
return paths_suffix($d+1, $at+1, $until) + paths_suffix($d-1, $at+1, $until);
} # paths_suffix
#
# main
#
number_count =
sum_{first=1..min(n-1,9), q1=1..m-1-(n-1), q2=q1..m-1, l_suffix=0..m-1-q2} (
paths_to_0(first,1,q1)
+ paths_to_max(0,q1,q2)
+ paths_suffix(min(n-1,9),q2,l_suffix+q2)
)
my ($number_count, $number_count_2_2) = (0, 0);
my ($first, $q1, i, $l_suffix);
for ($first = 1; $first <= $n-1; $first++) {
for ($q1 = 1; $q1 <= $m-1 - ($n-1); $q1++) {
for ($q2 = $q1; $q2 <= $m-1; $q2++) {
for ($l_suffix = 0; $l_suffix <= $m-1 - $q2; $l_suffix++) {
$number_count =
$number_count
+ paths_to_0($first,1,$q1)
+ paths_to_max(0,$q1,$q2)
+ paths_suffix($n-1,$q2,$l_suffix+$q2)
;
}
}
}
}
#
# case 2.2
#
for ($q2 = 1; $q2 <= $m-2; $q2++) {
for ($l_suffix = 0; $l_suffix <= $m-2 - $q2; $l_suffix++) {
$number_count_2_2 =
$number_count_2_2
+ paths_to_max(1,1,$q2)
+ paths_suffix($n-1,$q2,$l_suffix+$q2)
;
}
}
$number_count = 2 * $number_count + number_count_2_2;

Find all possible combinations of a String representation of a number

Given a mapping:
A: 1
B: 2
C: 3
...
...
...
Z: 26
Find all possible ways a number can be represented. E.g. For an input: "121", we can represent it as:
ABA [using: 1 2 1]
LA [using: 12 1]
AU [using: 1 21]
I tried thinking about using some sort of a dynamic programming approach, but I am not sure how to proceed. I was asked this question in a technical interview.
Here is a solution I could think of, please let me know if this looks good:
A[i]: Total number of ways to represent the sub-array number[0..i-1] using the integer to alphabet mapping.
Solution [am I missing something?]:
A[0] = 1 // there is only 1 way to represent the subarray consisting of only 1 number
for(i = 1:A.size):
A[i] = A[i-1]
if(input[i-1]*10 + input[i] < 26):
A[i] += 1
end
end
print A[A.size-1]
To just get the count, the dynamic programming approach is pretty straight-forward:
A[0] = 1
for i = 1:n
A[i] = 0
if input[i-1] > 0 // avoid 0
A[i] += A[i-1];
if i > 1 && // avoid index-out-of-bounds on i = 1
10 <= (10*input[i-2] + input[i-1]) <= 26 // check that number is 10-26
A[i] += A[i-2];
If you instead want to list all representations, dynamic programming isn't particularly well-suited for this, you're better off with a simple recursive algorithm.
First off, we need to find an intuitive way to enumerate all the possibilities. My simple construction, is given below.
let us assume a simple way to represent your integer in string format.
a1 a2 a3 a4 ....an, for instance in 121 a1 -> 1 a2 -> 2, a3 -> 1
Now,
We need to find out number of possibilities of placing a + sign in between two characters. + is to mean characters concatenation here.
a1 - a2 - a3 - .... - an, - shows the places where '+' can be placed. So, number of positions is n - 1, where n is the string length.
Assume a position may or may not have a + symbol shall be represented as a bit.
So, this boils down to how many different bit strings are possible with the length of n-1, which is clearly 2^(n-1). Now in order to enumerate the possibilities go through every bit string and place right + signs in respective positions to get every representations,
For your example, 121
Four bit strings are possible 00 01 10 11
1 2 1
1 2 + 1
1 + 2 1
1 + 2 + 1
And if you see a character followed by a +, just add the next char with the current one and do it sequentially to get the representation,
x + y z a + b + c d
would be (x+y) z (a+b+c) d
Hope it helps.
And you will have to take care of edge cases where the size of some integer > 26, of course.
I think, recursive traverse through all possible combinations would do just fine:
mapping = {"1":"A", "2":"B", "3":"C", "4":"D", "5":"E", "6":"F", "7":"G",
"8":"H", "9":"I", "10":"J",
"11":"K", "12":"L", "13":"M", "14":"N", "15":"O", "16":"P",
"17":"Q", "18":"R", "19":"S", "20":"T", "21":"U", "22":"V", "23":"W",
"24":"A", "25":"Y", "26":"Z"}
def represent(A, B):
if A == B == '':
return [""]
ret = []
if A in mapping:
ret += [mapping[A] + r for r in represent(B, '')]
if len(A) > 1:
ret += represent(A[:-1], A[-1]+B)
return ret
print represent("121", "")
Assuming you only need to count the number of combinations.
Assuming 0 followed by an integer in [1,9] is not a valid concatenation, then a brute-force strategy would be:
Count(s,n)
x=0
if (s[n-1] is valid)
x=Count(s,n-1)
y=0
if (s[n-2] concat s[n-1] is valid)
y=Count(s,n-2)
return x+y
A better strategy would be to use divide-and-conquer:
Count(s,start,n)
if (len is even)
{
//split s into equal left and right part, total count is left count multiply right count
x=Count(s,start,n/2) + Count(s,start+n/2,n/2);
y=0;
if (s[start+len/2-1] concat s[start+len/2] is valid)
{
//if middle two charaters concatenation is valid
//count left of the middle two characters
//count right of the middle two characters
//multiply the two counts and add to existing count
y=Count(s,start,len/2-1)*Count(s,start+len/2+1,len/2-1);
}
return x+y;
}
else
{
//there are three cases here:
//case 1: if middle character is valid,
//then count everything to the left of the middle character,
//count everything to the right of the middle character,
//multiply the two, assign to x
x=...
//case 2: if middle character concatenates the one to the left is valid,
//then count everything to the left of these two characters
//count everything to the right of these two characters
//multiply the two, assign to y
y=...
//case 3: if middle character concatenates the one to the right is valid,
//then count everything to the left of these two characters
//count everything to the right of these two characters
//multiply the two, assign to z
z=...
return x+y+z;
}
The brute-force solution has time complexity of T(n)=T(n-1)+T(n-2)+O(1) which is exponential.
The divide-and-conquer solution has time complexity of T(n)=3T(n/2)+O(1) which is O(n**lg3).
Hope this is correct.
Something like this?
Haskell code:
import qualified Data.Map as M
import Data.Maybe (fromJust)
combs str = f str [] where
charMap = M.fromList $ zip (map show [1..]) ['A'..'Z']
f [] result = [reverse result]
f (x:xs) result
| null xs =
case M.lookup [x] charMap of
Nothing -> ["The character " ++ [x] ++ " is not in the map."]
Just a -> [reverse $ a:result]
| otherwise =
case M.lookup [x,head xs] charMap of
Just a -> f (tail xs) (a:result)
++ (f xs ((fromJust $ M.lookup [x] charMap):result))
Nothing -> case M.lookup [x] charMap of
Nothing -> ["The character " ++ [x]
++ " is not in the map."]
Just a -> f xs (a:result)
Output:
*Main> combs "121"
["LA","AU","ABA"]
Here is the solution based on my discussion here:
private static int decoder2(int[] input) {
int[] A = new int[input.length + 1];
A[0] = 1;
for(int i=1; i<input.length+1; i++) {
A[i] = 0;
if(input[i-1] > 0) {
A[i] += A[i-1];
}
if (i > 1 && (10*input[i-2] + input[i-1]) <= 26) {
A[i] += A[i-2];
}
System.out.println(A[i]);
}
return A[input.length];
}
Just us breadth-first search.
for instance 121
Start from the first integer,
consider 1 integer character first, map 1 to a, leave 21
then 2 integer character map 12 to L leave 1.
This problem can be done in o(fib(n+2)) time with a standard DP algorithm.
We have exactly n sub problems and button up we can solve each problem with size i in o(fib(i)) time.
Summing the series gives fib (n+2).
If you consider the question carefully you see that it is a Fibonacci series.
I took a standard Fibonacci code and just changed it to fit our conditions.
The space is obviously bound to the size of all solutions o(fib(n)).
Consider this pseudo code:
Map<Integer, String> mapping = new HashMap<Integer, String>();
List<String > iterative_fib_sequence(string input) {
int length = input.length;
if (length <= 1)
{
if (length==0)
{
return "";
}
else//input is a-j
{
return mapping.get(input);
}
}
List<String> b = new List<String>();
List<String> a = new List<String>(mapping.get(input.substring(0,0));
List<String> c = new List<String>();
for (int i = 1; i < length; ++i)
{
int dig2Prefix = input.substring(i-1, i); //Get a letter with 2 digit (k-z)
if (mapping.contains(dig2Prefix))
{
String word2Prefix = mapping.get(dig2Prefix);
foreach (String s in b)
{
c.Add(s.append(word2Prefix));
}
}
int dig1Prefix = input.substring(i, i); //Get a letter with 1 digit (a-j)
String word1Prefix = mapping.get(dig1Prefix);
foreach (String s in a)
{
c.Add(s.append(word1Prefix));
}
b = a;
a = c;
c = new List<String>();
}
return a;
}
old question but adding an answer so that one can find help
It took me some time to understand the solution to this problem – I refer accepted answer and #Karthikeyan's answer and the solution from geeksforgeeks and written my own code as below:
To understand my code first understand below examples:
we know, decodings([1, 2]) are "AB" or "L" and so decoding_counts([1, 2]) == 2
And, decodings([1, 2, 1]) are "ABA", "AU", "LA" and so decoding_counts([1, 2, 1]) == 3
using the above two examples let's evaluate decodings([1, 2, 1, 4]):
case:- "taking next digit as single digit"
taking 4 as single digit to decode to letter 'D', we get decodings([1, 2, 1, 4]) == decoding_counts([1, 2, 1]) because [1, 2, 1, 4] will be decode as "ABAD", "AUD", "LAD"
case:- "combining next digit with the previous digit"
combining 4 with previous 1 as 14 as a single to decode to letter N, we get decodings([1, 2, 1, 4]) == decoding_counts([1, 2]) because [1, 2, 1, 4] will be decode as "ABN" or "LN"
Below is my Python code, read comments
def decoding_counts(digits):
# defininig count as, counts[i] -> decoding_counts(digits[: i+1])
counts = [0] * len(digits)
counts[0] = 1
for i in xrange(1, len(digits)):
# case:- "taking next digit as single digit"
if digits[i] != 0: # `0` do not have mapping to any letter
counts[i] = counts[i -1]
# case:- "combining next digit with the previous digit"
combine = 10 * digits[i - 1] + digits[i]
if 10 <= combine <= 26: # two digits mappings
counts[i] += (1 if i < 2 else counts[i-2])
return counts[-1]
for digits in "13", "121", "1214", "1234121":
print digits, "-->", decoding_counts(map(int, digits))
outputs:
13 --> 2
121 --> 3
1214 --> 5
1234121 --> 9
note: I assumed that input digits do not start with 0 and only consists of 0-9 and have a sufficent length
For Swift, this is what I came up with. Basically, I converted the string into an array and goes through it, adding a space into different positions of this array, then appending them to another array for the second part, which should be easy after this is done.
//test case
let input = [1,2,2,1]
func combination(_ input: String) {
var arr = Array(input)
var possible = [String]()
//... means inclusive range
for i in 2...arr.count {
var temp = arr
//basically goes through it backwards so
// adding the space doesn't mess up the index
for j in (1..<i).reversed() {
temp.insert(" ", at: j)
possible.append(String(temp))
}
}
print(possible)
}
combination(input)
//prints:
//["1 221", "12 21", "1 2 21", "122 1", "12 2 1", "1 2 2 1"]
def stringCombinations(digits, i=0, s=''):
if i == len(digits):
print(s)
return
alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
total = 0
for j in range(i, min(i + 1, len(digits) - 1) + 1):
total = (total * 10) + digits[j]
if 0 < total <= 26:
stringCombinations(digits, j + 1, s + alphabet[total - 1])
if __name__ == '__main__':
digits = list()
n = input()
n.split()
d = list(n)
for i in d:
i = int(i)
digits.append(i)
print(digits)
stringCombinations(digits)

how to match dna sequence pattern

I am getting a trouble finding an approach to solve this problem.
Input-output sequences are as follows :
**input1 :** aaagctgctagag
**output1 :** a3gct2ag2
**input2 :** aaaaaaagctaagctaag
**output2 :** a6agcta2ag
Input nsequence can be of 10^6 characters and largest continuous patterns will be considered.
For example for input2 "agctaagcta" output will not be "agcta2gcta" but it will be "agcta2".
Any help appreciated.
Explanation of the algorithm:
Having a sequence S with symbols s(1), s(2),…, s(N).
Let B(i) be the best compressed sequence with elements s(1), s(2),…,s(i).
So, for example, B(3) will be the best compressed sequence for s(1), s(2), s(3).
What we want to know is B(N).
To find it, we will proceed by induction. We want to calculate B(i+1), knowing B(i), B(i-1), B(i-2), …, B(1), B(0), where B(0) is empty sequence, and and B(1) = s(1). At the same time, this constitutes a proof that the solution is optimal. ;)
To calculate B(i+1), we will pick the best sequence among the candidates:
Candidate sequences where the last block has one element:
B(i )s(i+1)1
B(i-1)s(i+1)2 ; only if s(i) = s(i+1)
B(i-2)s(i+1)3 ; only if s(i-1) = s(i) and s(i) = s(i+1)
…
B(1)s(i+1)[i-1] ; only if s(2)=s(3) and s(3)=s(4) and … and s(i) = s(i+1)
B(0)s(i+1)i = s(i+1)i ; only if s(1)=s(2) and s(2)=s(3) and … and s(i) = s(i+1)
Candidate sequences where the last block has 2 elements:
B(i-1)s(i)s(i+1)1
B(i-3)s(i)s(i+1)2 ; only if s(i-2)s(i-1)=s(i)s(i+1)
B(i-5)s(i)s(i+1)3 ; only if s(i-4)s(i-3)=s(i-2)s(i-1) and s(i-2)s(i-1)=s(i)s(i+1)
…
Candidate sequences where the last block has 3 elements:
…
Candidate sequences where the last block has 4 elements:
…
…
Candidate sequences where last block has n+1 elements:
s(1)s(2)s(3)………s(i+1)
For each possibility, the algorithm stops when the sequence block is no longer repeated. And that’s it.
The algorithm will be some thing like this in psude-c code:
B(0) = “”
for (i=1; i<=N; i++) {
// Calculate all the candidates for B(i)
BestCandidate=null
for (j=1; j<=i; j++) {
Calculate all the candidates of length (i)
r=1;
do {
Candidadte = B([i-j]*r-1) s(i-j+1)…s(i-1)s(i) r
If ( (BestCandidate==null)
|| (Candidate is shorter that BestCandidate))
{
BestCandidate=Candidate.
}
r++;
} while ( ([i-j]*r <= i)
&&(s(i-j*r+1) s(i-j*r+2)…s(i-j*r+j) == s(i-j+1) s(i-j+2)…s(i-j+j))
}
B(i)=BestCandidate
}
Hope that this can help a little more.
The full C program performing the required task is given below. It runs in O(n^2). The central part is only 30 lines of code.
EDIT I have restructured a little bit the code, changed the names of the variables and added some comment in order to be more readable.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
// This struct represents a compressed segment like atg4, g3, agc1
struct Segment {
char *elements;
int nElements;
int count;
};
// As an example, for the segment agagt3 elements would be:
// {
// elements: "agagt",
// nElements: 5,
// count: 3
// }
struct Sequence {
struct Segment lastSegment;
struct Sequence *prev; // Points to a sequence without the last segment or NULL if it is the first segment
int totalLen; // Total length of the compressed sequence.
};
// as an example, for the sequence agt32ta5, the representation will be:
// {
// lastSegment:{"ta" , 2 , 5},
// prev: #A,
// totalLen: 8
// }
// and A will be
// {
// lastSegment{ "agt", 3, 32},
// prev: NULL,
// totalLen: 5
// }
// This function converts a sequence to a string.
// You have to free the string after using it.
// The strategy is to construct the string from right to left.
char *sequence2string(struct Sequence *S) {
char *Res=malloc(S->totalLen + 1);
char *digits="0123456789";
int p= S->totalLen;
Res[p]=0;
while (S!=NULL) {
// first we insert the count of the last element.
// We do digit by digit starting with the units.
int C = S->lastSegment.count;
while (C) {
p--;
Res[p] = digits[ C % 10 ];
C /= 10;
}
p -= S->lastSegment.nElements;
strncpy(Res + p , S->lastSegment.elements, S->lastSegment.nElements);
S = S ->prev;
}
return Res;
}
// Compresses a dna sequence.
// Returns a string with the in sequence compressed.
// The returned string must be freed after using it.
char *dnaCompress(char *in) {
int i,j;
int N = strlen(in);; // Number of elements of a in sequence.
// B is an array of N+1 sequences where B(i) is the best compressed sequence sequence of the first i characters.
// What we want to return is B[N];
struct Sequence *B;
B = malloc((N+1) * sizeof (struct Sequence));
// We first do an initialization for i=0
B[0].lastSegment.elements="";
B[0].lastSegment.nElements=0;
B[0].lastSegment.count=0;
B[0].prev = NULL;
B[0].totalLen=0;
// and set totalLen of all the sequences to a very HIGH VALUE in this case N*2 will be enougth, We will try different sequences and keep the minimum one.
for (i=1; i<=N; i++) B[i].totalLen = INT_MAX; // A very high value
for (i=1; i<=N; i++) {
// at this point we want to calculate B[i] and we know B[i-1], B[i-2], .... ,B[0]
for (j=1; j<=i; j++) {
// Here we will check all the candidates where the last segment has j elements
int r=1; // number of times the last segment is repeated
int rNDigits=1; // Number of digits of r
int rNDigitsBound=10; // We will increment r, so this value is when r will have an extra digit.
// when r = 0,1,...,9 => rNDigitsBound = 10
// when r = 10,11,...,99 => rNDigitsBound = 100
// when r = 100,101,.,999 => rNDigitsBound = 1000 and so on.
do {
// Here we analitze a candidate B(i).
// where the las segment has j elements repeated r times.
int CandidateLen = B[i-j*r].totalLen + j + rNDigits;
if (CandidateLen < B[i].totalLen) {
B[i].lastSegment.elements = in + i - j*r;
B[i].lastSegment.nElements = j;
B[i].lastSegment.count = r;
B[i].prev = &(B[i-j*r]);
B[i].totalLen = CandidateLen;
}
r++;
if (r == rNDigitsBound ) {
rNDigits++;
rNDigitsBound *= 10;
}
} while ( (i - j*r >= 0)
&& (strncmp(in + i -j, in + i - j*r, j)==0));
}
}
char *Res=sequence2string(&(B[N]));
free(B);
return Res;
}
int main(int argc, char** argv) {
char *compressedDNA=dnaCompress(argv[1]);
puts(compressedDNA);
free(compressedDNA);
return 0;
}
Forget Ukonnen. Dynamic programming it is. With 3-dimensional table:
sequence position
subsequence size
number of segments
TERMINOLOGY: For example, having a = "aaagctgctagag", sequence position coordinate would run from 1 to 13. At sequence position 3 (letter 'g'), having subsequence size 4, the subsequence would be "gctg". Understood? And as for the number of segments, then expressing a as "aaagctgctagag1" consists of 1 segment (the sequence itself). Expressing it as "a3gct2ag2" consists of 3 segments. "aaagctgct1ag2" consists of 2 segments. "a2a1ctg2ag2" would consist of 4 segments. Understood? Now, with this, you start filling a 3-dimensional array 13 x 13 x 13, so your time and memory complexity seems to be around n ** 3 for this. Are you sure you can handle it for million-bp sequences? I think that greedy approach would be better, because large DNA sequences are unlikely to repeat exactly. And, I would suggest that you widen your assignment to approximate matches, and you can publish it straight in a journal.
Anyway, you will start filling the table of compressing a subsequence starting at some position (dimension 1) with length equal to dimension 2 coordinate, having at most dimension 3 segments. So you first fill the first row, representing compressions of subsequences of length 1 consisting of at most 1 segment:
a a a g c t g c t a g a g
1(a1) 1(a1) 1(a1) 1(g1) 1(c1) 1(t1) 1(g1) 1(c1) 1(t1) 1(a1) 1(g1) 1(a1) 1(g1)
The number is the character cost (always 1 for these trivial 1-char sequences; number 1 does not count into the character cost), and in the parenthesis, you have the compression (also trivial for this simple case). The second row will be still simple:
2(a2) 2(a2) 2(ag1) 2(gc1) 2(ct1) 2(tg1) 2(gc1) 2(ct1) 2(ta1) 2(ag1) 2(ga1) 2(ag1)
There is only 1 way to decompose a 2-character sequence into 2 subsequences -- 1 character + 1 character. If they are identical, the result is like a + a = a2. If they are different, such as a + g, then, because only 1-segment sequences are admissible, the result cannot be a1g1, but must be ag1. The third row will be finally more interesting:
2(a3) 2(aag1) 3(agc1) 3(gct1) 3(ctg1) 3(tgc1) 3(gct1) 3(cta1) 3(tag1) 3(aga1) 3(gag1)
Here, you can always choose between 2 ways of composing the compressed string. For example, aag can be composed either as aa + g or a + ag. But again, we cannot have 2 segments, as in aa1g1 or a1ag1, so we must be satisfied with aag1, unless both components consist of the same character, as in aa + a => a3, with character cost 2. We can continue onto 4 th line:
4(aaag1) 4(aagc1) 4(agct1) 4(gctg1) 4(ctgc1) 4(tgct1) 4(gcta1) 4(ctag1) 4(taga1) 3(ag2)
Here, on the first position, we cannot use a3g1, because only 1 segment is allowed at this layer. But at the last position, compression to character cost 3 is agchieved by ag1 + ag1 = ag2. This way, one can fill the whole first-level table all the way up to the single subsequence of 13 characters, and each subsequence will have its optimal character cost and its compression under the first-level constraint of at most 1 segment associated with it.
Then you go to the 2nd level, where 2 segments are allowed... And again, from the bottom up, you identify the optimum cost and compression of each table coordinate under the given level's segment count constraint, by comparing all the possible ways to compose the subsequence using already computed positions, until you fill the table completely and thus compute the global optimum. There are some details to solve, but sorry, I'm not gonna code this for you.
After trying my own way for a while, my kudos to jbaylina for his beautiful algorithm and C implementation. Here's my attempted version of jbaylina's algorithm in Haskell, and below it further development of my attempt at a linear-time algorithm that attempts to compress segments that include repeated patterns in a one-by-one fashion:
import Data.Map (fromList, insert, size, (!))
compress s = (foldl f (fromList [(0,([],0)),(1,([s!!0],1))]) [1..n - 1]) ! n
where
n = length s
f b i = insert (size b) bestCandidate b where
add (sequence, sLength) (sequence', sLength') =
(sequence ++ sequence', sLength + sLength')
j' = [1..min 100 i]
bestCandidate = foldr combCandidates (b!i `add` ([s!!i,'1'],2)) j'
combCandidates j candidate' =
let nextCandidate' = comb 2 (b!(i - j + 1)
`add` ((take j . drop (i - j + 1) $ s) ++ "1", j + 1))
in if snd nextCandidate' <= snd candidate'
then nextCandidate'
else candidate' where
comb r candidate
| r > uBound = candidate
| not (strcmp r True) = candidate
| snd nextCandidate <= snd candidate = comb (r + 1) nextCandidate
| otherwise = comb (r + 1) candidate
where
uBound = div (i + 1) j
prev = b!(i - r * j + 1)
nextCandidate = prev `add`
((take j . drop (i - j + 1) $ s) ++ show r, j + length (show r))
strcmp 1 _ = True
strcmp num bool
| (take j . drop (i - num * j + 1) $ s)
== (take j . drop (i - (num - 1) * j + 1) $ s) =
strcmp (num - 1) True
| otherwise = False
Output:
*Main> compress "aaagctgctagag"
("a3gct2ag2",9)
*Main> compress "aaabbbaaabbbaaabbbaaabbb"
("aaabbb4",7)
Linear-time attempt:
import Data.List (sortBy)
group' xxs sAccum (chr, count)
| null xxs = if null chr
then singles
else if count <= 2
then reverse sAccum ++ multiples ++ "1"
else singles ++ if null chr then [] else chr ++ show count
| [x] == chr = group' xs sAccum (chr,count + 1)
| otherwise = if null chr
then group' xs (sAccum) ([x],1)
else if count <= 2
then group' xs (multiples ++ sAccum) ([x],1)
else singles
++ chr ++ show count ++ group' xs [] ([x],1)
where x:xs = xxs
singles = reverse sAccum ++ (if null sAccum then [] else "1")
multiples = concat (replicate count chr)
sequences ws strIndex maxSeqLen = repeated' where
half = if null . drop (2 * maxSeqLen - 1) $ ws
then div (length ws) 2 else maxSeqLen
repeated' = let (sequence,(sequenceStart, sequenceEnd'),notSinglesFlag) = repeated
in (sequence,(sequenceStart, sequenceEnd'))
repeated = foldr divide ([],(strIndex,strIndex),False) [1..half]
equalChunksOf t a = takeWhile(==t) . map (take a) . iterate (drop a)
divide chunkSize b#(sequence,(sequenceStart, sequenceEnd'),notSinglesFlag) =
let t = take (2*chunkSize) ws
t' = take chunkSize t
in if t' == drop chunkSize t
then let ts = equalChunksOf t' chunkSize ws
lenTs = length ts
sequenceEnd = strIndex + lenTs * chunkSize
newEnd = if sequenceEnd > sequenceEnd'
then sequenceEnd else sequenceEnd'
in if chunkSize > 1
then if length (group' (concat (replicate lenTs t')) [] ([],0)) > length (t' ++ show lenTs)
then (((strIndex,sequenceEnd,chunkSize,lenTs),t'):sequence, (sequenceStart,newEnd),True)
else b
else if notSinglesFlag
then b
else (((strIndex,sequenceEnd,chunkSize,lenTs),t'):sequence, (sequenceStart,newEnd),False)
else b
addOne a b
| null (fst b) = a
| null (fst a) = b
| otherwise =
let (((start,end,patLen,lenS),sequence):rest,(sStart,sEnd)) = a
(((start',end',patLen',lenS'),sequence'):rest',(sStart',sEnd')) = b
in if sStart' < sEnd && sEnd < sEnd'
then let c = ((start,end,patLen,lenS),sequence):rest
d = ((start',end',patLen',lenS'),sequence'):rest'
in (c ++ d, (sStart, sEnd'))
else a
segment xs baseIndex maxSeqLen = segment' xs baseIndex baseIndex where
segment' zzs#(z:zs) strIndex farthest
| null zs = initial
| strIndex >= farthest && strIndex > 0 = ([],(0,0))
| otherwise = addOne initial next
where
next#(s',(start',end')) = segment' zs (strIndex + 1) farthest'
farthest' | null s = farthest
| otherwise = if start /= end && end > farthest then end else farthest
initial#(s,(start,end)) = sequences zzs strIndex maxSeqLen
areExclusive ((a,b,_,_),_) ((a',b',_,_),_) = (a' >= b) || (b' <= a)
combs [] r = [r]
combs (x:xs) r
| null r = combs xs (x:r) ++ if null xs then [] else combs xs r
| otherwise = if areExclusive (head r) x
then combs xs (x:r) ++ combs xs r
else if l' > lowerBound
then combs xs (x: reduced : drop 1 r) ++ combs xs r
else combs xs r
where lowerBound = l + 2 * patLen
((l,u,patLen,lenS),s) = head r
((l',u',patLen',lenS'),s') = x
reduce = takeWhile (>=l') . iterate (\x -> x - patLen) $ u
lenReduced = length reduce
reduced = ((l,u - lenReduced * patLen,patLen,lenS - lenReduced),s)
buildString origStr sequences = buildString' origStr sequences 0 (0,"",0)
where
buildString' origStr sequences index accum#(lenC,cStr,lenOrig)
| null sequences = accum
| l /= index =
buildString' (drop l' origStr) sequences l (lenC + l' + 1, cStr ++ take l' origStr ++ "1", lenOrig + l')
| otherwise =
buildString' (drop u' origStr) rest u (lenC + length s', cStr ++ s', lenOrig + u')
where
l' = l - index
u' = u - l
s' = s ++ show lenS
(((l,u,patLen,lenS),s):rest) = sequences
compress [] _ accum = reverse accum ++ (if null accum then [] else "1")
compress zzs#(z:zs) maxSeqLen accum
| null (fst segment') = compress zs maxSeqLen (z:accum)
| (start,end) == (0,2) && not (null accum) = compress zs maxSeqLen (z:accum)
| otherwise =
reverse accum ++ (if null accum || takeWhile' compressedStr 0 /= 0 then [] else "1")
++ compressedStr
++ compress (drop lengthOriginal zzs) maxSeqLen []
where segment'#(s,(start,end)) = segment zzs 0 maxSeqLen
combinations = combs (fst $ segment') []
takeWhile' xxs count
| null xxs = 0
| x == '1' && null (reads (take 1 xs)::[(Int,String)]) = count
| not (null (reads [x]::[(Int,String)])) = 0
| otherwise = takeWhile' xs (count + 1)
where x:xs = xxs
f (lenC,cStr,lenOrig) (lenC',cStr',lenOrig') =
let g = compare ((fromIntegral lenC + if not (null accum) && takeWhile' cStr 0 == 0 then 1 else 0) / fromIntegral lenOrig)
((fromIntegral lenC' + if not (null accum) && takeWhile' cStr' 0 == 0 then 1 else 0) / fromIntegral lenOrig')
in if g == EQ
then compare (takeWhile' cStr' 0) (takeWhile' cStr 0)
else g
(lenCompressed,compressedStr,lengthOriginal) =
head $ sortBy f (map (buildString (take end zzs)) (map reverse combinations))
Output:
*Main> compress "aaaaaaaaabbbbbbbbbaaaaaaaaabbbbbbbbb" 100 []
"a9b9a9b9"
*Main> compress "aaabbbaaabbbaaabbbaaabbb" 100 []
"aaabbb4"

How to calculate the index (lexicographical order) when the combination is given

I know that there is an algorithm that permits, given a combination of number (no repetitions, no order), calculates the index of the lexicographic order.
It would be very useful for my application to speedup things...
For example:
combination(10, 5)
1 - 1 2 3 4 5
2 - 1 2 3 4 6
3 - 1 2 3 4 7
....
251 - 5 7 8 9 10
252 - 6 7 8 9 10
I need that the algorithm returns the index of the given combination.
es: index( 2, 5, 7, 8, 10 ) --> index
EDIT: actually I'm using a java application that generates all combinations C(53, 5) and inserts them into a TreeMap.
My idea is to create an array that contains all combinations (and related data) that I can index with this algorithm.
Everything is to speedup combination searching.
However I tried some (not all) of your solutions and the algorithms that you proposed are slower that a get() from TreeMap.
If it helps: my needs are for a combination of 5 from 53 starting from 0 to 52.
Thank you again to all :-)
Here is a snippet that will do the work.
#include <iostream>
int main()
{
const int n = 10;
const int k = 5;
int combination[k] = {2, 5, 7, 8, 10};
int index = 0;
int j = 0;
for (int i = 0; i != k; ++i)
{
for (++j; j != combination[i]; ++j)
{
index += c(n - j, k - i - 1);
}
}
std::cout << index + 1 << std::endl;
return 0;
}
It assumes you have a function
int c(int n, int k);
that will return the number of combinations of choosing k elements out of n elements.
The loop calculates the number of combinations preceding the given combination.
By adding one at the end we get the actual index.
For the given combination there are
c(9, 4) = 126 combinations containing 1 and hence preceding it in lexicographic order.
Of the combinations containing 2 as the smallest number there are
c(7, 3) = 35 combinations having 3 as the second smallest number
c(6, 3) = 20 combinations having 4 as the second smallest number
All of these are preceding the given combination.
Of the combinations containing 2 and 5 as the two smallest numbers there are
c(4, 2) = 6 combinations having 6 as the third smallest number.
All of these are preceding the given combination.
Etc.
If you put a print statement in the inner loop you will get the numbers
126, 35, 20, 6, 1.
Hope that explains the code.
Convert your number selections to a factorial base number. This number will be the index you want. Technically this calculates the lexicographical index of all permutations, but if you only give it combinations, the indexes will still be well ordered, just with some large gaps for all the permutations that come in between each combination.
Edit: pseudocode removed, it was incorrect, but the method above should work. Too tired to come up with correct pseudocode at the moment.
Edit 2: Here's an example. Say we were choosing a combination of 5 elements from a set of 10 elements, like in your example above. If the combination was 2 3 4 6 8, you would get the related factorial base number like so:
Take the unselected elements and count how many you have to pass by to get to the one you are selecting.
1 2 3 4 5 6 7 8 9 10
2 -> 1
1 3 4 5 6 7 8 9 10
3 -> 1
1 4 5 6 7 8 9 10
4 -> 1
1 5 6 7 8 9 10
6 -> 2
1 5 7 8 9 10
8 -> 3
So the index in factorial base is 1112300000
In decimal base, it's
1*9! + 1*8! + 1*7! + 2*6! + 3*5! = 410040
This is Algorithm 2.7 kSubsetLexRank on page 44 of Combinatorial Algorithms by Kreher and Stinson.
r = 0
t[0] = 0
for i from 1 to k
if t[i - 1] + 1 <= t[i] - 1
for j from t[i - 1] to t[i] - 1
r = r + choose(n - j, k - i)
return r
The array t holds your values, for example [5 7 8 9 10]. The function choose(n, k) calculates the number "n choose k". The result value r will be the index, 251 for the example. Other inputs are n and k, for the example they would be 10 and 5.
zero-base,
# v: array of length k consisting of numbers between 0 and n-1 (ascending)
def index_of_combination(n,k,v):
idx = 0
for p in range(k-1):
if p == 0: arrg = range(1,v[p]+1)
else: arrg = range(v[p-1]+2, v[p]+1)
for a in arrg:
idx += combi[n-a, k-1-p]
idx += v[k-1] - v[k-2] - 1
return idx
Null Set has the right approach. The index corresponds to the factorial-base number of the sequence. You build a factorial-base number just like any other base number, except that the base decreases for each digit.
Now, the value of each digit in the factorial-base number is the number of elements less than it that have not yet been used. So, for combination(10, 5):
(1 2 3 4 5) == 0*9!/5! + 0*8!/5! + 0*7!/5! + 0*6!/5! + 0*5!/5!
== 0*3024 + 0*336 + 0*42 + 0*6 + 0*1
== 0
(10 9 8 7 6) == 9*3024 + 8*336 + 7*42 + 6*6 + 5*1
== 30239
It should be pretty easy to calculate the index incrementally.
If you have a set of positive integers 0<=x_1 < x_2< ... < x_k , then you could use something called the squashed order:
I = sum(j=1..k) Choose(x_j,j)
The beauty of the squashed order is that it works independent of the largest value in the parent set.
The squashed order is not the order you are looking for, but it is related.
To use the squashed order to get the lexicographic order in the set of k-subsets of {1,...,n) is by taking
1 <= x1 < ... < x_k <=n
compute
0 <= n-x_k < n-x_(k-1) ... < n-x_1
Then compute the squashed order index of (n-x_k,...,n-k_1)
Then subtract the squashed order index from Choose(n,k) to get your result, which is the lexicographic index.
If you have relatively small values of n and k, you can cache all the values Choose(a,b) with a
See Anderson, Combinatorics on Finite Sets, pp 112-119
I needed also the same for a project of mine and the fastest solution I found was (Python):
import math
def nCr(n,r):
f = math.factorial
return f(n) / f(r) / f(n-r)
def index(comb,n,k):
r=nCr(n,k)
for i in range(k):
if n-comb[i]<k-i:continue
r=r-nCr(n-comb[i],k-i)
return r
My input "comb" contained elements in increasing order You can test the code with for example:
import itertools
k=3
t=[1,2,3,4,5]
for x in itertools.combinations(t, k):
print x,index(x,len(t),k)
It is not hard to prove that if comb=(a1,a2,a3...,ak) (in increasing order) then:
index=[nCk-(n-a1+1)Ck] + [(n-a1)C(k-1)-(n-a2+1)C(k-1)] + ... =
nCk -(n-a1)Ck -(n-a2)C(k-1) - .... -(n-ak)C1
There's another way to do all this. You could generate all possible combinations and write them into a binary file where each comb is represented by it's index starting from zero. Then, when you need to find an index, and the combination is given, you apply a binary search on the file. Here's the function. It's written in VB.NET 2010 for my lotto program, it works with Israel lottery system so there's a bonus (7th) number; just ignore it.
Public Function Comb2Index( _
ByVal gAr() As Byte) As UInt32
Dim mxPntr As UInt32 = WHL.AMT.WHL_SYS_00 '(16.273.488)
Dim mdPntr As UInt32 = mxPntr \ 2
Dim eqCntr As Byte
Dim rdAr() As Byte
modBinary.OpenFile(WHL.WHL_SYS_00, _
FileMode.Open, FileAccess.Read)
Do
modBinary.ReadBlock(mdPntr, rdAr)
RP: If eqCntr = 7 Then GoTo EX
If gAr(eqCntr) = rdAr(eqCntr) Then
eqCntr += 1
GoTo RP
ElseIf gAr(eqCntr) < rdAr(eqCntr) Then
If eqCntr > 0 Then eqCntr = 0
mxPntr = mdPntr
mdPntr \= 2
ElseIf gAr(eqCntr) > rdAr(eqCntr) Then
If eqCntr > 0 Then eqCntr = 0
mdPntr += (mxPntr - mdPntr) \ 2
End If
Loop Until eqCntr = 7
EX: modBinary.CloseFile()
Return mdPntr
End Function
P.S. It takes 5 to 10 mins to generate 16 million combs on a Core 2 Duo. To find the index using binary search on file takes 397 milliseconds on a SATA drive.
Assuming the maximum setSize is not too large, you can simply generate a lookup table, where the inputs are encoded this way:
int index(a,b,c,...)
{
int key = 0;
key |= 1<<a;
key |= 1<<b;
key |= 1<<c;
//repeat for all arguments
return Lookup[key];
}
To generate the lookup table, look at this "banker's order" algorithm. Generate all the combinations, and also store the base index for each nItems. (For the example on p6, this would be [0,1,5,11,15]). Note that by you storing the answers in the opposite order from the example (LSBs set first) you will only need one table, sized for the largest possible set.
Populate the lookup table by walking through the combinations doing Lookup[combination[i]]=i-baseIdx[nItems]
EDIT: Never mind. This is completely wrong.
Let your combination be (a1, a2, ..., ak-1, ak) where a1 < a2 < ... < ak. Let choose(a,b) = a!/(b!*(a-b)!) if a >= b and 0 otherwise. Then, the index you are looking for is
choose(ak-1, k) + choose(ak-1-1, k-1) + choose(ak-2-1, k-2) + ... + choose (a2-1, 2) + choose (a1-1, 1) + 1
The first term counts the number of k-element combinations such that the largest element is less than ak. The second term counts the number of (k-1)-element combinations such that the largest element is less than ak-1. And, so on.
Notice that the size of the universe of elements to be chosen from (10 in your example) does not play a role in the computation of the index. Can you see why?
Sample solution:
class Program
{
static void Main(string[] args)
{
// The input
var n = 5;
var t = new[] { 2, 4, 5 };
// Helping transformations
ComputeDistances(t);
CorrectDistances(t);
// The algorithm
var r = CalculateRank(t, n);
Console.WriteLine("n = 5");
Console.WriteLine("t = {2, 4, 5}");
Console.WriteLine("r = {0}", r);
Console.ReadKey();
}
static void ComputeDistances(int[] t)
{
var k = t.Length;
while (--k >= 0)
t[k] -= (k + 1);
}
static void CorrectDistances(int[] t)
{
var k = t.Length;
while (--k > 0)
t[k] -= t[k - 1];
}
static int CalculateRank(int[] t, int n)
{
int k = t.Length - 1, r = 0;
for (var i = 0; i < t.Length; i++)
{
if (t[i] == 0)
{
n--;
k--;
continue;
}
for (var j = 0; j < t[i]; j++)
{
n--;
r += CalculateBinomialCoefficient(n, k);
}
n--;
k--;
}
return r;
}
static int CalculateBinomialCoefficient(int n, int k)
{
int i, l = 1, m, x, y;
if (n - k < k)
{
x = k;
y = n - k;
}
else
{
x = n - k;
y = k;
}
for (i = x + 1; i <= n; i++)
l *= i;
m = CalculateFactorial(y);
return l/m;
}
static int CalculateFactorial(int n)
{
int i, w = 1;
for (i = 1; i <= n; i++)
w *= i;
return w;
}
}
The idea behind the scenes is to associate a k-subset with an operation of drawing k-elements from the n-size set. It is a combination, so the overall count of possible items will be (n k). It is a clue that we could seek the solution in Pascal Triangle. After a while of comparing manually written examples with the appropriate numbers from the Pascal Triangle, we will find the pattern and hence the algorithm.
I used user515430's answer and converted to python3. Also this supports non-continuous values so you could pass in [1,3,5,7,9] as your pool instead of range(1,11)
from itertools import combinations
from scipy.special import comb
from pandas import Index
debugcombinations = False
class IndexedCombination:
def __init__(self, _setsize, _poolvalues):
self.setsize = _setsize
self.poolvals = Index(_poolvalues)
self.poolsize = len(self.poolvals)
self.totalcombinations = 1
fast_k = min(self.setsize, self.poolsize - self.setsize)
for i in range(1, fast_k + 1):
self.totalcombinations = self.totalcombinations * (self.poolsize - fast_k + i) // i
#fill the nCr cache
self.choose_cache = {}
n = self.poolsize
k = self.setsize
for i in range(k + 1):
for j in range(n + 1):
if n - j >= k - i:
self.choose_cache[n - j,k - i] = comb(n - j,k - i, exact=True)
if debugcombinations:
print('testnth = ' + str(self.testnth()))
def get_nth_combination(self,index):
n = self.poolsize
r = self.setsize
c = self.totalcombinations
#if index < 0 or index >= c:
# raise IndexError
result = []
while r:
c, n, r = c*r//n, n-1, r-1
while index >= c:
index -= c
c, n = c*(n-r)//n, n-1
result.append(self.poolvals[-1 - n])
return tuple(result)
def get_n_from_combination(self,someset):
n = self.poolsize
k = self.setsize
index = 0
j = 0
for i in range(k):
setidx = self.poolvals.get_loc(someset[i])
for j in range(j + 1, setidx + 1):
index += self.choose_cache[n - j, k - i - 1]
j += 1
return index
#just used to test whether nth_combination from the internet actually works
def testnth(self):
n = 0
_setsize = self.setsize
mainset = self.poolvals
for someset in combinations(mainset, _setsize):
nthset = self.get_nth_combination(n)
n2 = self.get_n_from_combination(nthset)
if debugcombinations:
print(str(n) + ': ' + str(someset) + ' vs ' + str(n2) + ': ' + str(nthset))
if n != n2:
return False
for x in range(_setsize):
if someset[x] != nthset[x]:
return False
n += 1
return True
setcombination = IndexedCombination(5, list(range(1,10+1)))
print( str(setcombination.get_n_from_combination([2,5,7,8,10])))
returns 188

Resources