Algorithm on interview - algorithm

Recently I was asked the following interview question:
You have two sets of numbers of the same length N, for example A = [3, 5, 9] and B = [7, 5, 1]. Next, for each position i in range 0..N-1, you can pick either number A[i] or B[i], so at the end you will have another array C of length N which consists in elements from A and B. If sum of all elements in C is less than or equal to K, then such array is good. Please write an algorithm to figure out the total number of good arrays by given arrays A, B and number K.
The only solution I've come up is Dynamic Programming approach, when we have a matrix of size NxK and M[i][j] represents how many combinations could we have for number X[i] if current sum is equal to j. But looks like they expected me to come up with a formula. Could you please help me with that? At least what direction should I look for? Will appreciate any help. Thanks.

After some consideration, I believe this is an NP-complete problem. Consider:
A = [0, 0, 0, ..., 0]
B = [b1, b2, b3, ..., bn]
Note that every construction of the third set C = ( A[i] or B[i] for i = 0..n ) is is just the union of some subset of A and some subset of B. In this case, since every subset of A sums to 0, the sum of C is the same as the sum of some subset of B.
Now your question "How many ways can we construct C with a sum less than K?" can be restated as "How many subsets of B sum to less than K?". Solving this problem for K = 1 and K = 0 yields the solution to the subset sum problem for B (the difference between the two solutions is the number of subsets that sum to 0).
By similar argument, even in the general case where A contains nonzero elements, we can construct an array S = [b1-a1, b2-a2, b3-a3, ..., bn-an], and the question becomes "How many subsets of S sum to less than K - sum(A)?"
Since the subset sum problem is NP-complete, this problem must be also. So with that in mind, I would venture that the dynamic programming solution you proposed is the best you can do, and certainly no magic formula exists.

" Please write an algorithm to figure out the total number of good
arrays by given arrays A, B and number K."
Is it not the goal?
int A[];
int B[];
int N;
int K;
int Solutions = 0;
void FindSolutons(int Depth, int theSumSoFar) {
if (theSumSoFar > K) return;
if (Depth >= N) {
Solutions++;
return;
}
FindSolutions(Depth+1,theSumSoFar+A[Depth]);
FindSolutions(Depth+1,theSumSoFar+B[Depth]);
}
Invoke FindSolutions with both arguments set to zero. On return the Solutions will be equal to the number of good arrays;

this is how i would try to solve the problem
(Sorry if its stupid)
think of arrays
A=[3,5,9,8,2]
B=[7,5,1,8,2]
if
elements
0..N-1
number of choices
2^N
C1=0,C2=0
for all A[i]=B[i]
{
C1++
C2+=A[i]+B[i]
}
then create new two arrays like
A1=[3,5,9]
B1=[7,5,1]
also now C2 is 10
now number of all choices are reduced to 2^(N-C1)
now calculate all good numbers
using 'K' as K=K-C2
unfortunately
no matter what method you use, you have
to calculate sum 2^(N-C1) times

So there's 2^N choices, since at each point you either pick from A or from B. In the specific example you give where N happens to be 3 there are 8. For discussion you can characterise each set of decisions as a bit pattern.
So as a brute-force approach would try every single bit pattern.
But what should be obvious is that if the first few bits produce a number too large then every subsequent possible group of tail bits will also produce a number that is too large. So probably a better way to model it is a tree where you don't bother walking down the limbs that have already grown beyond your limit.
You can also compute the maximum totals that can be reached from each bit to the end of the table. If at any point your running total plus the maximum that you can obtain from here on down is less than K then every subtree from where you are is acceptable without any need for traversal. The case, as discussed in the comments, where every single combination is acceptable is a special case of this observation.
As pointed out by Serge below, a related observation is to us minimums and use the converse logic to cancel whole subtrees without traversal.
A potential further optimisation rests behind the observation that, as long as we shuffle each in the same way, changing the order of A and B has no effect because addition is commutative. You can therefore make an effort to ensure either that the maximums grow as quickly as possible or the minimums grow as slowly as possible, to try to get the earliest possible exit from traversal. In practice you'd probably want to apply a heuristic comparing the absolute maximum and minimum (both of which you've computed anyway) to K.
That being the case, a recursive implementation is easiest, e.g. (in C)
/* assume A, B and N are known globals */
unsigned int numberOfGoodArraysFromBit(
unsigned int bit,
unsigned int runningTotal,
unsigned int limit)
{
// have we ended up in an unacceptable subtree?
if(runningTotal > limit) return 0;
// have we reached the leaf node without at any
// point finding this subtree to be unacceptable?
if(bit >= N) return 1;
// maybe every subtree is acceptable?
if(runningTotal + MAXV[bit] <= limit)
{
return 1 << (N - bit);
}
// maybe no subtrees are acceptable?
if(runningTotal + MINV[bit] > limit)
{
return 0;
}
// if we can't prima facie judge the subtreees,
// we'll need specifically to evaluate them
return
numberOfGoodArraysFromBit(bit+1, runningTotal+A[bit], limit) +
numberOfGoodArraysFromBit(bit+1, runningTotal+B[bit], limit);
}
// work out the minimum and maximum values at each position
for(int i = 0; i < N; i++)
{
MAXV[i] = MAX(A[i], B[i]);
MINV[i] = MIN(A[i], B[i]);
}
// hence work out the cumulative totals from right to left
for(int i = N-2; i >= 0; i--)
{
MAXV[i] += MAXV[i+1];
MINV[i] += MINV[i+1];
}
// to kick it off
printf("Total valid combinations is %u", numberOfGoodArraysFromBit(0, 0, K));
I'm just thinking extemporaneously; it's likely better solutions exist.

Related

Find minimum cost to convert array to arithmetic progression

I recently encountered this question in an interview. I couldn't really come up with an algorithm for this.
Given an array of unsorted integers, we have to find the minimum cost in which this array can be converted to an Arithmetic Progression where a cost of 1 unit is incurred if any element is changed in the array. Also, the value of the element ranges between (-inf,inf).
I sort of realised that DP can be used here, but I couldn't solve the equation. There were some constraints on the values, but I don't remember them. I am just looking for high level pseudo code.
EDIT
Here's a correct solution, unfortunately, while simple to understand it's not very efficient at O(n^3).
function costAP(arr) {
if(arr.length < 3) { return 0; }
var minCost = arr.length;
for(var i = 0; i < arr.length - 1; i++) {
for(var j = i + 1; j < arr.length; j++) {
var delta = (arr[j] - arr[i]) / (j - i);
var cost = 0;
for(var k = 0; k < arr.length; k++) {
if(k == i) { continue; }
if((arr[k] + delta * (i - k)) != arr[i]) { cost++; }
}
if(cost < minCost) { minCost = cost; }
}
}
return minCost;
}
Find the relative delta between every distinct pair of indices in the array
Use the relative delta to test the cost of transforming the whole array to AP using that delta
Return the minimum cost
Louis Ricci had the right basic idea of looking for the largest existing arithmetic progression, but assumed that it would have to appear in a single run, when in fact the elements of this progression can appear in any subset of the positions, e.g.:
1 42 3 69 5 1111 2222 8
requires just 4 changes:
42 69 1111 2222
1 3 5 8
To calculate this, notice that every AP has a rightmost element. We can suppose each element i of the input vector to be the rightmost AP position in turn, and for each such i consider all positions j to the left of i, determining the step size implied for each (i, j) combination and, when this is integer (indicating a valid AP), add one to the the number of elements that imply this step size and end at position i -- since all such elements belong to the same AP. The overall maximum is then the longest AP:
struct solution {
int len;
int pos;
int step;
};
solution longestArithProg(vector<int> const& v) {
solution best = { -1, 0, 0 };
for (int i = 1; i < v.size(); ++i) {
unordered_map<int, int> bestForStep;
for (int j = 0; j < i; ++j) {
int step = (v[i] - v[j]) / (i - j);
if (step * (i - j) == v[i] - v[j]) {
// This j gives an integer step size: record that j lies on this AP
int len = ++bestForStep[step];
if (len > best.len) {
best.len = len;
best.pos = i;
best.step = step;
}
}
}
}
++best.len; // We never counted the final element in the AP
return best;
}
The above C++ code uses O(n^2) time and O(n) space, since it loops over every pair of positions i and j, performing a single hash read and write for each. To answer the original problem:
int howManyChangesNeeded(vector<int> const& v) {
return v.size() - longestArithProg(v).len;
}
This problem has a simple geometric interpretation, which shows that it can be solved in O(n^2) time and probably can't be solved any faster than that (reduction from 3SUM). Suppose our array is [1, 2, 10, 3, 5]. We can write that array as a sequence of points
(0,1), (1,2), (2,10), (3,3), (4,5)
in which the x-value is the index of the array item and the y-value is the value of the array item. The question now becomes one of finding a line which passes the maximum possible number of points in that set. The cost of converting the array is the number of points not on a line, which is minimized when the number of points on a line is maximized.
A fairly definitive answer to that question is given in this SO posting: What is the most efficient algorithm to find a straight line that goes through most points?
The idea: for each point P in the set from left to right, find the line passing through that point and a maximum number of points to the right of P. (We don't need to look at points to the left of P because they would have been caught in an earlier iteration).
To find the maximum number of P-collinear points to the right of P, for each such point Q calculate the slope of the line segment PQ. Tally up the different slopes in a hash map. The slope which maps to the maximum number of hits is what you're looking for.
Technical issue: you probably don't want to use floating point arithmetic to calculate the slopes. On the other hand, if you use rational numbers, you potentially have to calculate the greatest common divisor in order to compare fractions by comparing numerator and denominator, which multiplies running time by a factor of log n. Instead, you should check equality of rational numbers a/b and c/d by testing whether ad == bc.
The SO posting referenced above gives a reduction from 3SUM, i.e., this problem is 3SUM-hard which shows that if this problem could be solved substantially faster than O(n^2), then 3SUM could also be solved substantially faster than O(n^2). This is where the condition that the integers are in (-inf,inf) comes in. If it is known that the integers are from a bounded set, the reduction from 3SUM is not definitive.
An interesting further question is whether the idea in the Wikipedia for solving 3SUM in O(n + N log N) time when the integers are in the bounded set (-N,N) can be used to solve the minimum cost to convert an array to an AP problem in time faster than O(n^2).
Given the array a = [a_1, a_2, ..., a_n] of unsorted integers, let diffs = [a_2-a_1, a_3-a_2, ..., a_n-a_(n-1)].
Find the maximum occurring value in diffs and adjust any values in a necessary so that all neighboring values differ by this amount.
Interestingly,even I had the same question in my campus recruitment test today.While doing the test itself,I realised that this logic of altering elements based on most frequent differences between 2 subsequent elements in the array fails in some cases.
Eg-4,5,8,9 .According to the logic of a2-a1,a3-a2 as proposed above,answer shud be 1 which is not the case.
As you suggested DP,I feel it can be on the lines of considering 2 values for each element in array-cost when it is modified as well as when it is not modified and return minimum of the 2.Finally terminate when you reach end of the array.

Modification to subsetsum algorithm by pisinger

I was looking at the algorithm by pisinger as detailed here
Fast solution to Subset sum algorithm by Pisinger
and on wikipedia http://en.wikipedia.org/wiki/Subset_sum_problem
For the case that each xi is positive and bounded by a fixed constant C, Pisinger found a linear time algorithm having time complexity O(NC).[3] (Note that this is for the version of the problem where the target sum is not necessarily zero, otherwise the problem would be trivial.)
It seems that, with his approach, there are two constraints. The first one in particular says that all values in any input, must be <= C.
It sounds to me that, with just that constraint alone, this is not an algorithm that can solve the original subset sum problem (with no restrictions).
But suppose C=1000000 for example. Then before we even run the algorithm on the input list. Can't we just divide every element by some number d (and also divide the target sum by d too) such that every number in the input list will be <= C.
When the algorithm returns some subset s, we just multiply every element in s by d. And we will have our true subset.
With that observation, it feels like it not worth mentioning that we need that C constraint. Or at least say that the input can be changed W.L.O.G (without loss of generality). So we can still solve the same problem no matter the input. That constraint is not making the problem any harder in other words. So really his algorithms only constraint is that it can only handle numbers >=0.
Am I right?
Thanks
What adrian.budau said is correct.
In Polynomial solution,
every value of array should remain as integer after the division you said.
otherwise DP solution does not work anymore, because item values are used as index in DP array.
Following is my Polynomial(DP) solution for subsetsum problem(C=ELEMENT_MAX_VALUE)
bool subsetsum_dp(VI& v, int sum)
{
const int MAX_ELEMENT = 100;
const int MAX_ELEMENT_VALUE = 1000;
static int dp[MAX_ELEMENT+1][MAX_ELEMENT*MAX_ELEMENT_VALUE + 1]; memset(dp, -1, sizeof(dp));
int n = S(v);
dp[0][0] = 1, dp[0][v[0]] = 1;//include, excluse the value
FOR(i, 1, n)
{
REP(j, MAX_ELEMENT*MAX_ELEMENT_VALUE + 1)
{
if (dp[i-1][j] != -1) dp[i][j] = 1, dp[i][j + v[i]] = 1;//include, excluse the value
}
}
if (dp[n-1][sum] == 1) return true;
return false;
}

Algorithm required for this task?

Inputs: n (int) and n values (float) that represent exchange rates
(different between them) with a random value between 4 and 5.
Output: compute the maximum number of values that can be used (in the
same order) to represent an ascending then descending curve?
e.x. The eight values
4.5 4.6 4.3 4.0 4.8 4.4 4.7 4.1
should output
5 (4.5 4.6 4.8 4.4 4.1)
My approach
If I try successive ifs, I get a random array that respects the curve condition, but not the longest.
I have not tried backtracking because I am not that familiar with it, but something tells me I have to compute all the solutions with it then pick the longest.
And lastly: brute force, but since it is an assignment for algorithm design; I may as well not hand it in. :)
Is there a simpler/more efficient/faster method?
Here's my try based on Daniel Lemire's algorithm. It seems it doesn't take into account the positions 0, i and n. I'm sure the ifs are the problem, how can I fix them?
for(int i = 0; i<n-1; i++){
int countp=0; // count ascending
int countn=0; // count descending
for(int j=0;j<=i;j++){
if(currency[j]<currency[j+1]){
countp++;
System.out.print(j+" ");
}
}
System.out.print("|| ");
for(int j=i;j<n-1;j++){
if(currency[j]>currency[j+1]){
countn++;
System.out.print(j+" ");
}
}
System.out.println();
if(countn+countp>maxcount) maxcount=countn+countp;
}
Firstly, you want to be able to compute the longest monotonic subsequence from one point to another. (Whether it is increasing or decreasing does not affect the problem much.) To do this, you may use dynamic programming. For example, to solve the problem given indexes 0 to i, you start by solving the problem from 0 to 0 (trivial!), then from 0 to 1, then from 0 to 2, and so on, each time recording (in an array) your best solution.
For example, here is some code in python to compute the longest non-decreasing sequence going from index 0 to index i. We use an array (bbest) to store the solution from 0 to j for all j's from 0 to i: that is, the length of the longest non-decreasing subsequence from 0 to j. (The strategy used is dynamic programming.)
def countasc(array,i):
mmin = array[0] # must start with mmin
mmax= array[i] # must end with mmax
bbest=[1] # going from 0 to 0 the best we can do is length 1
for j in range(1,i+1): # j goes from 1 to i
if(array[j]>mmax):
bbest.append(0) # can't be used
continue
best = 0 # store best result
for k in range(j-1,-1,-1): # count backward from j-1 to 0
if(array[k]>array[j]) :
continue # can't be used
if(bbest[k]+1>best):
best = bbest[k]+1
bbest.append(best)
return bbest[-1] # return last value of array bbest
or equivalently in Java (provided by request):
int countasc(float[] array,int i) {
float mmin = array[0];
float mmax = array[i];
ArrayList<Integer> bbest= new ArrayList<Integer>();
bbest.add(1);
for (int j = 1; j<=i;++j) {
if(array[j]>mmax){
bbest.add(0);
continue;
}
int best = 0;
for(int k = j-1; k>=0;--k) {
if(array[k]>array[j])
continue;
if(bbest.get(k).intValue()+1>best)
best = bbest.get(k).intValue()+1;
}
bbest.add(best);
}
return bbest.get(bbest.size()-1);
}
You can write the same type of function to find the longest non-increasing sequence from i to n-1 (left as an exercise).
Note that countasc runs in linear time.
Now, we can solve the actual problem:
Start with S, an empty array
For i an index that goes from 0 to n-1 :
compute the length of the longest increasing subsequence from 0 to i (see function countasc above)
compute the length of the longest decreasing subsequence from n-1 to i
add these two numbers, add the sum to S
return the max of S
It has quadratic complexity. I am sure you can improve this solution. There is a lot of redundancy in this approach. For example, for speed, you should probably not repeatedly call countasc with an uninitialized array bbest: it can be computed once. Possibly you can bring down the complexity to O(n log n) with some more work.
A first step is to understand how to solve the related longest increasing subsequence problem. For this problem, there is a simple algorithm that is O(n^2) though the optimal algorithm is O(n log n). Understanding these algorithms should put you on the right track to a solution.

Most efficient way of randomly choosing a set of distinct integers

I'm looking for the most efficient algorithm to randomly choose a set of n distinct integers, where all the integers are in some range [0..maxValue].
Constraints:
maxValue is larger than n, and possibly much larger
I don't care if the output list is sorted or not
all integers must be chosen with equal probability
My initial idea was to construct a list of the integers [0..maxValue] then extract n elements at random without replacement. But that seems quite inefficient, especially if maxValue is large.
Any better solutions?
Here is an optimal algorithm, assuming that we are allowed to use hashmaps. It runs in O(n) time and space (and not O(maxValue) time, which is too expensive).
It is based on Floyd's random sample algorithm. See my blog post about it for details.
The code is in Java:
private static Random rnd = new Random();
public static Set<Integer> randomSample(int max, int n) {
HashSet<Integer> res = new HashSet<Integer>(n);
int count = max + 1;
for (int i = count - n; i < count; i++) {
Integer item = rnd.nextInt(i + 1);
if (res.contains(item))
res.add(i);
else
res.add(item);
}
return res;
}
For small values of maxValue such that it is reasonable to generate an array of all the integers in memory then you can use a variation of the Fisher-Yates shuffle except only performing the first n steps.
If n is much smaller than maxValue and you don't wish to generate the entire array then you can use this algorithm:
Keep a sorted list l of number picked so far, initially empty.
Pick a random number x between 0 and maxValue - (elements in l)
For each number in l if it smaller than or equal to x, add 1 to x
Add the adjusted value of x into the sorted list and repeat.
If n is very close to maxValue then you can randomly pick the elements that aren't in the result and then find the complement of that set.
Here is another algorithm that is simpler but has potentially unbounded execution time:
Keep a set s of element picked so far, initially empty.
Pick a number at random between 0 and maxValue.
If the number is not in s, add it to s.
Go back to step 2 until s has n elements.
In practice if n is small and maxValue is large this will be good enough for most purposes.
One way to do it without generating the full array.
Say I want a randomly selected subset of m items from a set {x1, ..., xn} where m <= n.
Consider element x1. I add x1 to my subset with probability m/n.
If I do add x1 to my subset then I reduce my problem to selecting (m - 1) items from {x2, ..., xn}.
If I don't add x1 to my subset then I reduce my problem to selecting m items from {x2, ..., xn}.
Lather, rinse, and repeat until m = 0.
This algorithm is O(n) where n is the number of items I have to consider.
I rather imagine there is an O(m) algorithm where at each step you consider how many elements to remove from the "front" of the set of possibilities, but I haven't convinced myself of a good solution and I have to do some work now!
If you are selecting M elements out of N, the strategy changes depending on whether M is of the same order as N or much less (i.e. less than about N/log N).
If they are similar in size, then you go through each item from 1 to N. You keep track of how many items you've got so far (let's call that m items picked out of n that you've gone through), and then you take the next number with probability (M-m)/(N-n) and discard it otherwise. You then update m and n appropriately and continue. This is a O(N) algorithm with low constant cost.
If, on the other hand, M is significantly less than N, then a resampling strategy is a good one. Here you will want to sort M so you can find them quickly (and that will cost you O(M log M) time--stick them into a tree, for example). Now you pick numbers uniformly from 1 to N and insert them into your list. If you find a collision, pick again. You will collide about M/N of the time (actually, you're integrating from 1/N to M/N), which will require you to pick again (recursively), so you'll expect to take M/(1-M/N) selections to complete the process. Thus, your cost for this algorithm is approximately O(M*(N/(N-M))*log(M)).
These are both such simple methods that you can just implement both--assuming you have access to a sorted tree--and pick the one that is appropriate given the fraction of numbers that will be picked.
(Note that picking numbers is symmetric with not picking them, so if M is almost equal to N, then you can use the resampling strategy, but pick those numbers to not include; this can be a win, even if you have to push all almost-N numbers around, if your random number generation is expensive.)
My solution is the same as Mark Byers'. It takes O(n^2) time, hence it's useful when n is much smaller than maxValue. Here's the implementation in python:
def pick(n, maxValue):
chosen = []
for i in range(n):
r = random.randint(0, maxValue - i)
for e in chosen:
if e <= r:
r += 1
else:
break;
bisect.insort(chosen, r)
return chosen
The trick is to use a variation of shuffle or in other words a partial shuffle.
function random_pick( a, n )
{
N = len(a);
n = min(n, N);
picked = array_fill(0, n, 0); backup = array_fill(0, n, 0);
// partially shuffle the array, and generate unbiased selection simultaneously
// this is a variation on fisher-yates-knuth shuffle
for (i=0; i<n; i++) // O(n) times
{
selected = rand( 0, --N ); // unbiased sampling N * N-1 * N-2 * .. * N-n+1
value = a[ selected ];
a[ selected ] = a[ N ];
a[ N ] = value;
backup[ i ] = selected;
picked[ i ] = value;
}
// restore partially shuffled input array from backup
// optional step, if needed it can be ignored
for (i=n-1; i>=0; i--) // O(n) times
{
selected = backup[ i ];
value = a[ N ];
a[ N ] = a[ selected ];
a[ selected ] = value;
N++;
}
return picked;
}
NOTE the algorithm is strictly O(n) in both time and space, produces unbiased selections (it is a partial unbiased shuffling) and does not need hasmaps (which may not be available and/or usualy hide a complexity behind their implementation, e.g fetch time is not O(1), it might even be O(n) in worst case)
adapted from here
Linear congruential generator modulo maxValue+1. I'm sure I've written this answer before, but I can't find it...
UPDATE: I am wrong. The output of this is not uniformly distributed. Details on why are here.
I think this algorithm below is optimum. I.e. you cannot get better performance than this.
For choosing n numbers out of m numbers, the best offered algorithm so far is presented below. Its worst run time complexity is O(n), and needs only a single array to store the original numbers. It partially shuffles the first n elements from the original array, and then you pick those first n shuffled numbers as your solution.
This is also a fully working C program. What you find is:
Function getrand: This is just a PRNG that returns a number from 0 up to upto.
Function randselect: This is the function that randmoly chooses n unique numbers out of m many numbers. This is what this question is about.
Function main: This is only to demonstrate a use for other functions, so that you could compile it into a program and have fun.
#include <stdio.h>
#include <stdlib.h>
int getrand(int upto) {
long int r;
do {
r = rand();
} while (r > upto);
return r;
}
void randselect(int *all, int end, int select) {
int upto = RAND_MAX - (RAND_MAX % end);
int binwidth = upto / end;
int c;
for (c = 0; c < select; c++) {
/* randomly choose some bin */
int bin = getrand(upto)/binwidth;
/* swap c with bin */
int tmp = all[c];
all[c] = all[bin];
all[bin] = tmp;
}
}
int main() {
int end = 1000;
int select = 5;
/* initialize all numbers up to end */
int *all = malloc(end * sizeof(int));
int c;
for (c = 0; c < end; c++) {
all[c] = c;
}
/* select select unique numbers randomly */
srand(0);
randselect(all, end, select);
for (c = 0; c < select; c++) printf("%d ", all[c]);
putchar('\n');
return 0;
}
Here is the output of an example code where I randomly output 4 permutations out of a pool of 8 numbers for 100,000,000 many times. Then I use those many permutations to compute the probability of having each unique permutation occur. I then sort them by this probability. You notice that the numbers are fairly close, which I think means that it is uniformly distributed. The theoretical probability should be 1/1680 = 0.000595238095238095. Note how the empirical test is close to the theoretical one.

Algorithm for finding all combinations of k elements from stack of n "randomly"

I have a problem resembling the one described here:
Algorithm to return all combinations of k elements from n
I am looking for something similar that covers all possible combinations of k from n. However, I need a subset to vary a lot from the one drawn previously. For example, if I were to draw a subset of 3 elements from a set of 8, the following algorithm wouldn't be useful to me since every subset is very similar to the one previously drawn:
11100000,
11010000,
10110000,
01110000,
...
I am looking for an algorithm thats picks the subsets in a more "random" looking fashion, ie. where the majority of elements in one subset is not reused in the next:
11100000,
00010011,
00101100,
...
Does anyone know of such an algorithm?
I hope my question made sence and that someone can help me out =)
Kind regards,
Christian
How about first generating all possible combinations of k from n, and then rearranging them with help of a random function.
If you have the result in a vector, loop through the vector: for each element let it change place with the element at a random position.
This of course becomes slow for large k and n.
This is not really random, but depending on your needs it might suit you.
Calculate the number of possible combinations. Let's name them N.
Calculate a large number which is coprime to N. Let's name it P.
Order the combinations and give them numbers from 1 to N. Let's name them C1 to CN
Iterate for output combinations. The first one will be VP mod N, the second one will be C2*P mod N, the third one C3*P mod N, etc. In essence, Outputi = Ci*P mod N. Mod is meant as the modulus operator.
If P is picked carefully, you will get seemingly random combinations. Values close to 1 or to N will produce values that differ little. Better pick values close to, say N/4 or N/5. You can also randomize the generation of P for every iteration that you need.
As a follow-up to my comment on this answer, here is some code that allows one to determine the composition of a subset from its "index", in colex order.
Shamelessly stolen from my own assignments.
//////////////////////////////////////
// NChooseK
//
// computes n!
// --------
// k!(n-k)!
//
// using Pascal's identity
// i.e. (n,k) = (n-1,k-1) + (n-1,k)
//
// easily optimizable by memoization
long long NChooseK(int n, int k)
{
if(k >= 0 && k <= n && n >= 1)
{
if( k > n / 2)
k = n - k;
if(k == 0 || n == 0)
return 1;
else
return NChooseK(n-1, k-1) + NChooseK(n-1, k);
}
else
return 0;
}
///////////////////////////////////////////////////////////////////////
// SubsetColexUnrank
// The unranking works by finding each element
// in turn, beginning with the biggest, leftmost one.
// We just have to find, for each element, how many subsets there are
// before the one beginning with the elements we have already found.
//
// It stores its results (indices of the elements present in the subset) into T, in ascending order.
void SubsetColexUnrank(long long r, int * T, int subsetSize)
{
assert( subsetSize >= 1 );
// For each element in the k-subset to be found
for(int i = subsetSize; i >= 1; i--)
{
// T[i] cannot be less than i
int x = i;
// Find the smallest element such that, of all the k-subsets that contain it,
// none has a rank that exceeds r.
while( NChooseK(x, i) <= r )
x++;
// update T with the newly found element
T[i] = x;
// if the subset we have to find is not the first one containing this element
if(r > 0)
{
// finding the next element of our k-subset
// is like finding the first one of the same subset
// divided by {T[i]}
r -= NChooseK(x - 1, i);
}
}
}
Random-in, random-out.
The colex order is such that its unranking function does not need the size of the set from which to pick the elements to work; the number of elements is assumed to be NChooseK(size of the set, size of the subset).
How about randomly choosing the k elements. ie choose the pth where p is random between 1 and n, then reorder what's left and choose the qth where q is between 1 and n-1 etc?
or maybe i misunderstood. do you still want all possibilities? in that case you can always generate them first and then choose random entries from your list
By "random looking" I think you mean lexicographically distant.. does this apply to combination i vs. i-1, or i vs. all previous combinations?
If so, here are some suggestions:
since most of the combinators yield ordered output, there are two options:
design or find a generator which somehow yields non-ordered output
enumerate and store enough/all combinations in a tie'd array file/db
if you decide to go with door #2, then you can just access randomly ordered combinations by random integers between 1 and the # of combinations
Just as a final check, compare the current and previous combination using a measure of difference/distance between combinations, e.g. for an unsigned Bit::Vector in Perl:
$vec1->Lexicompare($vec2) >= $MIN_LEX_DIST
You might take another look behind door #1, since even for moderate values of n and k you can get a big array:
EDIT:
Just saw your comment to AnnK... maybe the lexicompare might still help you skip similar combinations?
Depending on what you are trying to do, you could do something like playing cards. Keep two lists: Source is your source (unused) list; and Used the second is the "already-picked" list. As you randomly pick k items from Source, you move them to your Used list.
If there are k items left in Source when you need to pick again, you pick them all and swap the lists. If there are fewer than k items, you pick j items from Used and add them to Source to make k items in Source, then pick them all and swap the lists.
This is kind of like picking k cards from a deck. You discard them to the used pile. Once you reach the end or need more cards, you shuffle the old ones back into play.
This is just to make sure each set is definitely different from the previous subsets.
Also, this will not really guarantee that all possible subsets are picked before old ones start being repeated.
The good is that you don't need to worry about pre-calculating all the subsets, and your memory requirements are linear with your data (2 n-sized lists).

Resources