Related
I'm trying to find a name for my problem, so I don't have to re-invent wheel when coding an algorithm which solves it...
I have say 2,000 binary (row) vectors and I need to pick 500 from them. In the picked sample I do column sums and I want my sample to be as close as possible to a pre-defined distribution of the column sums. I'll be working with 20 to 60 columns.
A tiny example:
Out of the vectors:
110
010
011
110
100
I need to pick 2 to get column sums 2, 1, 0. The solution (exact in this case) would be
110
100
My ideas so far
one could maybe call this a binary multidimensional knapsack, but I did not find any algos for that
Linear Programming could help, but I'd need some step by step explanation as I got no experience with it
as exact solution is not always feasible, something like simulated annealing brute force could work well
a hacky way using constraint solvers comes to mind - first set the constraints tight and gradually loosen them until some solution is found - given that CSP should be much faster than ILP...?
My concrete, practical (if the approximation guarantee works out for you) suggestion would be to apply the maximum entropy method (in Chapter 7 of Boyd and Vandenberghe's book Convex Optimization; you can probably find several implementations with your favorite search engine) to find the maximum entropy probability distribution on row indexes such that (1) no row index is more likely than 1/500 (2) the expected value of the row vector chosen is 1/500th of the predefined distribution. Given this distribution, choose each row independently with probability 500 times its distribution likelihood, which will give you 500 rows on average. If you need exactly 500, repeat until you get exactly 500 (shouldn't take too many tries due to concentration bounds).
Firstly I will make some assumptions regarding this problem:
Regardless whether the column sum of the selected solution is over or under the target, it weighs the same.
The sum of the first, second, and third column are equally weighted in the solution (i.e. If there's a solution whereas the first column sum is off by 1, and another where the third column sum is off by 1, the solution are equally good).
The closest problem I can think of this problem is the Subset sum problem, which itself can be thought of a special case of Knapsack problem.
However both of these problem are NP-Complete. This means there are no polynomial time algorithm that can solve them, even though it is easy to verify the solution.
If I were you the two most arguably efficient solution of this problem are linear programming and machine learning.
Depending on how many columns you are optimising in this problem, with linear programming you can control how much finely tuned you want the solution, in exchange of time. You should read up on this, because this is fairly simple and efficient.
With Machine learning, you need a lot of data sets (the set of vectors and the set of solutions). You don't even need to specify what you want, a lot of machine learning algorithms can generally deduce what you want them to optimise based on your data set.
Both solution has pros and cons, you should decide which one to use yourself based on the circumstances and problem set.
This definitely can be modeled as (integer!) linear program (many problems can). Once you have it, you can use a program such as lpsolve to solve it.
We model vector i is selected as x_i which can be 0 or 1.
Then for each column c, we have a constraint:
sum of all (x_i * value of i in column c) = target for column c
Taking your example, in lp_solve this could look like:
min: ;
+x1 +x4 +x5 >= 2;
+x1 +x4 +x5 <= 2;
+x1 +x2 +x3 +x4 <= 1;
+x1 +x2 +x3 +x4 >= 1;
+x3 <= 0;
+x3 >= 0;
bin x1, x2, x3, x4, x5;
If you are fine with a heuristic based search approach, here is one.
Go over the list and find the minimum squared sum of the digit wise difference between each bit string and the goal. For example, if we are looking for 2, 1, 0, and we are scoring 0, 1, 0, we would do it in the following way:
Take the digit wise difference:
2, 0, 1
Square the digit wise difference:
4, 0, 1
Sum:
5
As a side note, squaring the difference when scoring is a common method when doing heuristic search. In your case, it makes sense because bit strings that have a 1 in as the first digit are a lot more interesting to us. In your case this simple algorithm would pick first 110, then 100, which would is the best solution.
In any case, there are some optimizations that could be made to this, I will post them here if this kind of approach is what you are looking for, but this is the core of the algorithm.
You have a given target binary vector. You want to select M vectors out of N that have the closest sum to the target. Let's say you use the eucilidean distance to measure if a selection is better than another.
If you want an exact sum, have a look at the k-sum problem which is a generalization of the 3SUM problem. The problem is harder than the subset sum problem, because you want an exact number of elements to add to a target value. There is a solution in O(N^(M/2)). lg N), but that means more than 2000^250 * 7.6 > 10^826 operations in your case (in the favorable case where vectors operations have a cost of 1).
First conclusion: do not try to get an exact result unless your vectors have some characteristics that may reduce the complexity.
Here's a hill climbing approach:
sort the vectors by number of 1's: 111... first, 000... last;
use the polynomial time approximate algorithm for the subset sum;
you have an approximate solution with K elements. Because of the order of elements (the big ones come first), K should be a little as possible:
if K >= M, you take the M first vectors of the solution and that's probably near the best you can do.
if K < M, you can remove the first vector and try to replace it with 2 or more vectors from the rest of the N vectors, using the same technique, until you have M vectors. To sumarize: split the big vectors into smaller ones until you reach the correct number of vectors.
Here's a proof of concept with numbers, in Python:
import random
def distance(x, y):
return abs(x-y)
def show(ls):
if len(ls) < 10:
return str(ls)
else:
return ", ".join(map(str, ls[:5]+("...",)+ls[-5:]))
def find(is_xs, target):
# see https://en.wikipedia.org/wiki/Subset_sum_problem#Pseudo-polynomial_time_dynamic_programming_solution
S = [(0, ())] # we store indices along with values to get the path
for i, x in is_xs:
T = [(x + t, js + (i,)) for t, js in S]
U = sorted(S + T)
y, ks = U[0]
S = [(y, ks)]
for z, ls in U:
if z == target: # use the euclidean distance here if you want an approximation
return ls
if z != y and z < target:
y, ks = z, ls
S.append((z, ls))
ls = S[-1][1] # take the closest element to target
return ls
N = 2000
M = 500
target = 1000
xs = [random.randint(0, 10) for _ in range(N)]
print ("Take {} numbers out of {} to make a sum of {}", M, xs, target)
xs = sorted(xs, reverse = True)
is_xs = list(enumerate(xs))
print ("Sorted numbers: {}".format(show(tuple(is_xs))))
ls = find(is_xs, target)
print("FIRST TRY: {} elements ({}) -> {}".format(len(ls), show(ls), sum(x for i, x in is_xs if i in ls)))
splits = 0
while len(ls) < M:
first_x = xs[ls[0]]
js_ys = [(i, x) for i, x in is_xs if i not in ls and x != first_x]
replace = find(js_ys, first_x)
splits += 1
if len(replace) < 2 or len(replace) + len(ls) - 1 > M or sum(xs[i] for i in replace) != first_x:
print("Give up: can't replace {}.\nAdd the lowest elements.")
ls += tuple([i for i, x in is_xs if i not in ls][len(ls)-M:])
break
print ("Replace {} (={}) by {} (={})".format(ls[:1], first_x, replace, sum(xs[i] for i in replace)))
ls = tuple(sorted(ls[1:] + replace)) # use a heap?
print("{} elements ({}) -> {}".format(len(ls), show(ls), sum(x for i, x in is_xs if i in ls)))
print("AFTER {} splits, {} -> {}".format(splits, ls, sum(x for i, x in is_xs if i in ls)))
The result is obviously not guaranteed to be optimal.
Remarks:
Complexity: find has a polynomial time complexity (see the Wikipedia page) and is called at most M^2 times, hence the complexity remains polynomial. In practice, the process is reasonably fast (split calls have a small target).
Vectors: to ensure that you reach the target with the minimum of elements, you can improve the order of element. Your target is (t_1, ..., t_c): if you sort the t_js from max to min, you get the more importants columns first. You can sort the vectors: by number of 1s and then by the presence of a 1 in the most important columns. E.g. target = 4 8 6 => 1 1 1 > 0 1 1 > 1 1 0 > 1 0 1 > 0 1 0 > 0 0 1 > 1 0 0 > 0 0 0.
find (Vectors) if the current sum exceed the target in all the columns, then you're not connecting to the target (any vector you add to the current sum will bring you farther from the target): don't add the sum to S (z >= target case for numbers).
I propose a simple ad hoc algorithm, which, broadly speaking, is a kind of gradient descent algorithm. It seems to work relatively well for input vectors which have a distribution of 1s “similar” to the target sum vector, and probably also for all “nice” input vectors, as defined in a comment of yours. The solution is not exact, but the approximation seems good.
The distance between the sum vector of the output vectors and the target vector is taken to be Euclidean. To minimize it means minimizing the sum of the square differences off sum vector and target vector (the square root is not needed because it is monotonic). The algorithm does not guarantee to yield the sample that minimizes the distance from the target, but anyway makes a serious attempt at doing so, by always moving in some locally optimal direction.
The algorithm can be split into 3 parts.
First of all the first M candidate output vectors out of the N input vectors (e.g., N=2000, M=500) are put in a list, and the remaining vectors are put in another.
Then "approximately optimal" swaps between vectors in the two lists are done, until either the distance would not decrease any more, or a predefined maximum number of iterations is reached. An approximately optimal swap is one where removing the first vector from the list of output vectors causes a maximal decrease or minimal increase of the distance, and then, after the removal of the first vector, adding the second vector to the same list causes a maximal decrease of the distance. The whole swap is avoided if the net result is not a decrease of the distance.
Then, as a last phase, "optimal" swaps are done, again stopping on no decrease in distance or maximum number of iterations reached. Optimal swaps cause a maximal decrease of the distance, without requiring the removal of the first vector to be optimal in itself. To find an optimal swap all vector pairs have to be checked. This phase is much more expensive, being O(M(N-M)), while the previous "approximate" phase is O(M+(N-M))=O(N). Luckily, when entering this phase, most of the work has already been done by the previous phase.
from typing import List, Tuple
def get_sample(vects: List[Tuple[int]], target: Tuple[int], n_out: int,
max_approx_swaps: int = None, max_optimal_swaps: int = None,
verbose: bool = False) -> List[Tuple[int]]:
"""
Get a sample of the input vectors having a sum close to the target vector.
Closeness is measured in Euclidean metrics. The output is not guaranteed to be
optimal (minimum square distance from target), but a serious attempt is made.
The max_* parameters can be used to avoid too long execution times,
tune them to your needs by setting verbose to True, or leave them None (∞).
:param vects: the list of vectors (tuples) with the same number of "columns"
:param target: the target vector, with the same number of "columns"
:param n_out: the requested sample size
:param max_approx_swaps: the max number of approximately optimal vector swaps,
None means unlimited (default: None)
:param max_optimal_swaps: the max number of optimal vector swaps,
None means unlimited (default: None)
:param verbose: print some info if True (default: False)
:return: the sample of n_out vectors having a sum close to the target vector
"""
def square_distance(v1, v2):
return sum((e1 - e2) ** 2 for e1, e2 in zip(v1, v2))
n_vec = len(vects)
assert n_vec > 0
assert n_out > 0
n_rem = n_vec - n_out
assert n_rem > 0
output = vects[:n_out]
remain = vects[n_out:]
n_col = len(vects[0])
assert n_col == len(target) > 0
sumvect = (0,) * n_col
for outvect in output:
sumvect = tuple(map(int.__add__, sumvect, outvect))
sqdist = square_distance(sumvect, target)
if verbose:
print(f"sqdist = {sqdist:4} after"
f" picking the first {n_out} vectors out of {n_vec}")
if max_approx_swaps is None:
max_approx_swaps = sqdist
n_approx_swaps = 0
while sqdist and n_approx_swaps < max_approx_swaps:
# find the best vect to subtract (the square distance MAY increase)
sqdist_0 = None
index_0 = None
sumvect_0 = None
for index in range(n_out):
tmp_sumvect = tuple(map(int.__sub__, sumvect, output[index]))
tmp_sqdist = square_distance(tmp_sumvect, target)
if sqdist_0 is None or sqdist_0 > tmp_sqdist:
sqdist_0 = tmp_sqdist
index_0 = index
sumvect_0 = tmp_sumvect
# find the best vect to add,
# but only if there is a net decrease of the square distance
sqdist_1 = sqdist
index_1 = None
sumvect_1 = None
for index in range(n_rem):
tmp_sumvect = tuple(map(int.__add__, sumvect_0, remain[index]))
tmp_sqdist = square_distance(tmp_sumvect, target)
if sqdist_1 > tmp_sqdist:
sqdist_1 = tmp_sqdist
index_1 = index
sumvect_1 = tmp_sumvect
if sumvect_1:
tmp = output[index_0]
output[index_0] = remain[index_1]
remain[index_1] = tmp
sqdist = sqdist_1
sumvect = sumvect_1
n_approx_swaps += 1
else:
break
if verbose:
print(f"sqdist = {sqdist:4} after {n_approx_swaps}"
f" approximately optimal swap{'s'[n_approx_swaps == 1:]}")
diffvect = tuple(map(int.__sub__, sumvect, target))
if max_optimal_swaps is None:
max_optimal_swaps = sqdist
n_optimal_swaps = 0
while sqdist and n_optimal_swaps < max_optimal_swaps:
# find the best pair to swap,
# but only if the square distance decreases
best_sqdist = sqdist
best_diffvect = diffvect
best_pair = None
for i0 in range(M):
tmp_diffvect = tuple(map(int.__sub__, diffvect, output[i0]))
for i1 in range(n_rem):
new_diffvect = tuple(map(int.__add__, tmp_diffvect, remain[i1]))
new_sqdist = sum(d * d for d in new_diffvect)
if best_sqdist > new_sqdist:
best_sqdist = new_sqdist
best_diffvect = new_diffvect
best_pair = (i0, i1)
if best_pair:
tmp = output[best_pair[0]]
output[best_pair[0]] = remain[best_pair[1]]
remain[best_pair[1]] = tmp
sqdist = best_sqdist
diffvect = best_diffvect
n_optimal_swaps += 1
else:
break
if verbose:
print(f"sqdist = {sqdist:4} after {n_optimal_swaps}"
f" optimal swap{'s'[n_optimal_swaps == 1:]}")
return output
from random import randrange
C = 30 # number of columns
N = 2000 # total number of vectors
M = 500 # number of output vectors
F = 0.9 # fill factor of the target sum vector
T = int(M * F) # maximum value + 1 that can be appear in the target sum vector
A = 10000 # maximum number of approximately optimal swaps, may be None (∞)
B = 10 # maximum number of optimal swaps, may be None (unlimited)
target = tuple(randrange(T) for _ in range(C))
vects = [tuple(int(randrange(M) < t) for t in target) for _ in range(N)]
sample = get_sample(vects, target, M, A, B, True)
Typical output:
sqdist = 2639 after picking the first 500 vectors out of 2000
sqdist = 9 after 27 approximately optimal swaps
sqdist = 1 after 4 optimal swaps
P.S.: As it stands, this algorithm is not limited to binary input vectors, integer vectors would work too. Intuitively I suspect that the quality of the optimization could suffer, though. I suspect that this algorithm is more appropriate for binary vectors.
P.P.S.: Execution times with your kind of data are probably acceptable with standard CPython, but get better (like a couple of seconds, almost a factor of 10) with PyPy. To handle bigger sets of data, the algorithm would have to be translated to C or some other language, which should not be difficult at all.
I have a sorted array of N intervals of different length. I am plotting these intervals with alternating colors blue/green.
I am trying to find a method or algorithm to "downsample" the array of intervals to produce a visually similar plot, but with less elements.
Ideally I could write some function where I can pass the target number of output intervals as an argument. The output length only has to come close to the target.
input = [
[0, 5, "blue"],
[5, 6, "green"],
[6, 10, "blue"],
// ...etc
]
output = downsample(input, 25)
// [[0, 10, "blue"], ... ]
Below is a picture of what I am trying to accomplish. In this example the input has about 250 intervals, and the output about ~25 intervals. The input length can vary a lot.
Update 1:
Below is my original post which I initially deleted, because there were issues with displaying the equations and also I wasn't very confident if it really makes sense. But later, I figured that the optimisation problem that I described can be actually solved efficiently with DP (Dynamic programming).
So I did a sample C++ implementation. Here are some results:
Here is a live demo that you can play with in your browser (make sure browser support WebGL2, like Chrome or Firefox). It takes a bit to load the page.
Here is the C++ implementation: link
Update 2:
Turns out the proposed solution has the following nice property - we can easily control the importance of the two parts F1 and F2 of the cost function. Simply change the cost function to F(α)=F1 + αF2, where α >= 1.0 is a free parameter. The DP algorithm remains the same.
Here are some result for different α values using the same number of intervals N:
Live demo (WebGL2 required)
As can be seen, higher α means it is more important to cover the original input intervals even if this means covering more of the background in-between.
Original post
Even-though some good algorithms have already been proposed, I would like to propose a slightly unusual approach - interpreting the task as an optimisation problem. Although, I don't know how to efficiently solve the optimisation problem (or even if it can be solved in reasonable time at all), it might be useful to someone purely as a concept.
First, without loss of generality, lets declare the blue color to be background. We will be painting N green intervals on top of it (N is the number provided to the downsample() function in OP's description). The ith interval is defined by its starting coordinate 0 <= xi < xmax and width wi >= 0 (xmax is the maximum coordinate from the input).
Lets also define the array G(x) to be the number of green cells in the interval [0, x) in the input data. This array can easily be pre-calculated. We will use it to quickly calculate the number of green cells in arbitrary interval [x, y) - namely: G(y) - G(x).
We can now introduce the first part of the cost function for our optimisation problem:
The smaller F1 is, the better our generated intervals cover the input intervals, so we will be searching for xi, wi that minimise it. Ideally we want F1=0 which would mean that the intervals do not cover any of the background (which of course is not possible because N is less than the input intervals).
However, this function is not enough to describe the problem, because obviously we can minimise it by taking empty intervals: F1(x, 0)=0. Instead, we want to cover as much as possible from the input intervals. Lets introduce the second part of the cost function which corresponds to this requirement:
The smaller F2 is, the more input intervals are covered. Ideally we want F2=0 which would mean that we covered all of the input rectangles. However, minimising F2 competes with minimising F1.
Finally, we can state our optimisation problem: find xi, wi that minimize F=F1 + F2
How to solve this problem? Not sure. Maybe use some metaheuristic approach for global optimisation such as Simulated annealing or Differential evolution. These are typically easy to implement, especially for this simple cost function.
Best case would be to exist some kind of DP algorithm for solving it efficiently, but unlikely.
I would advise you to use Haar wavelet. That is a very simple algorithm which was often used to provide the functionality of progressive loading for big images on websites.
Here you can see how it works with 2D function. That is what you can use. Alas, the document is in Ukrainian, but code in C++, so readable:)
This document provides an example of 3D object:
Pseudocode on how to compress with Haar wavelet you can find in Wavelets for Computer Graphics: A Primer Part 1y.
You could do the following:
Write out the points that divide the whole strip into intervals as the array [a[0], a[1], a[2], ..., a[n-1]]. In your example, the array would be [0, 5, 6, 10, ... ].
Calculate double-interval lengths a[2]-a[0], a[3]-a[1], a[4]-a[2], ..., a[n-1]-a[n-3] and find the least of them. Let it be a[k+2]-a[k]. If there are two or more equal lengths having the lowest value, choose one of them randomly. In your example, you should get the array [6, 5, ... ] and search for the minimum value through it.
Swap the intervals (a[k], a[k+1]) and (a[k+1], a[k+2]). Basically, you need to assign a[k+1]=a[k]+a[k+2]-a[k+1] to keep the lengths, and to remove the points a[k] and a[k+2] from the array after that because two pairs of intervals of the same color are now merged into two larger intervals. Thus, the numbers of blue and green intervals decreases by one each after this step.
If you're satisfied with the current number of intervals, end the process, otherwise go to the step 1.
You performed the step 2 in order to decrease "color shift" because, at the step 3, the left interval is moved a[k+2]-a[k+1] to the right and the right interval is moved a[k+1]-a[k] to the left. The sum of these distances, a[k+2]-a[k] can be considered a measure of change you're introducing into the whole picture.
Main advantages of this approach:
It is simple.
It doesn't give a preference to any of the two colors. You don't need to assign one of the colors to be the background and the other to be the painting color. The picture can be considered both as "green-on-blue" and "blue-on-green". This reflects quite common use case when two colors just describe two opposite states (like the bit 0/1, "yes/no" answer) of some process extended in time or in space.
It always keeps the balance between colors, i.e. the sum of intervals of each color remains the same during the reduction process. Thus the total brightness of the picture doesn't change. It is important as this total brightness can be considered an "indicator of completeness" at some cases.
Here's another attempt at dynamic programming that's slightly different than Georgi Gerganov's, although the idea to try and formulate a dynamic program may have been inspired by his answer. Neither the implementation nor the concept is guaranteed to be sound but I did include a code sketch with a visual example :)
The search space in this case is not reliant on the total unit width but rather on the number of intervals. It's O(N * n^2) time and O(N * n) space, where N and n are the target and given number of (green) intervals, respectively, because we assume that any newly chosen green interval must be bound by two green intervals (rather than extend arbitrarily into the background).
The idea also utilises the prefix sum idea used to calculate runs with a majority element. We add 1 when we see the target element (in this case green) and subtract 1 for others (that algorithm is also amenable to multiple elements with parallel prefix sum tracking). (I'm not sure that restricting candidate intervals to sections with a majority of the target colour is always warranted but it may be a useful heuristic depending on the desired outcome. It's also adjustable -- we can easily adjust it to check for a different part than 1/2.)
Where Georgi Gerganov's program seeks to minimise, this dynamic program seeks to maximise two ratios. Let h(i, k) represent the best sequence of green intervals up to the ith given interval, utilising k intervals, where each is allowed to stretch back to the left edge of some previous green interval. We speculate that
h(i, k) = max(r + C*r1 + h(i-l, k-1))
where, in the current candidate interval, r is the ratio of green to the length of the stretch, and r1 is the ratio of green to the total given green. r1 is multiplied by an adjustable constant to give more weight to the volume of green covered. l is the length of the stretch.
JavaScript code (for debugging, it includes some extra variables and log lines):
function rnd(n, d=2){
let m = Math.pow(10,d)
return Math.round(m*n) / m;
}
function f(A, N, C){
let ps = [[0,0]];
let psBG = [0];
let totalG = 0;
A.unshift([0,0]);
for (let i=1; i<A.length; i++){
let [l,r,c] = A[i];
if (c == 'g'){
totalG += r - l;
let prevI = ps[ps.length-1][1];
let d = l - A[prevI][1];
let prevS = ps[ps.length-1][0];
ps.push(
[prevS - d, i, 'l'],
[prevS - d + r - l, i, 'r']
);
psBG[i] = psBG[i-1];
} else {
psBG[i] = psBG[i-1] + r - l;
}
}
//console.log(JSON.stringify(A));
//console.log('');
//console.log(JSON.stringify(ps));
//console.log('');
//console.log(JSON.stringify(psBG));
let m = new Array(N + 1);
m[0] = new Array((ps.length >> 1) + 1);
for (let i=0; i<m[0].length; i++)
m[0][i] = [0,0];
// for each in N
for (let i=1; i<=N; i++){
m[i] = new Array((ps.length >> 1) + 1);
for (let ii=0; ii<m[0].length; ii++)
m[i][ii] = [0,0];
// for each interval
for (let j=i; j<m[0].length; j++){
m[i][j] = m[i][j-1];
for (let k=j; k>i-1; k--){
// our anchors are the right
// side of each interval, k's are the left
let jj = 2*j;
let kk = 2*k - 1;
// positive means green
// is a majority
if (ps[jj][0] - ps[kk][0] > 0){
let bg = psBG[ps[jj][1]] - psBG[ps[kk][1]];
let s = A[ps[jj][1]][1] - A[ps[kk][1]][0] - bg;
let r = s / (bg + s);
let r1 = C * s / totalG;
let candidate = r + r1 + m[i-1][j-1][0];
if (candidate > m[i][j][0]){
m[i][j] = [
candidate,
ps[kk][1] + ',' + ps[jj][1],
bg, s, r, r1,k,m[i-1][j-1][0]
];
}
}
}
}
}
/*
for (row of m)
console.log(JSON.stringify(
row.map(l => l.map(x => typeof x != 'number' ? x : rnd(x)))));
*/
let result = new Array(N);
let j = m[0].length - 1;
for (let i=N; i>0; i--){
let [_,idxs,w,x,y,z,k] = m[i][j];
let [l,r] = idxs.split(',');
result[i-1] = [A[l][0], A[r][1], 'g'];
j = k - 1;
}
return result;
}
function show(A, last){
if (last[1] != A[A.length-1])
A.push(last);
let s = '';
let j;
for (let i=A.length-1; i>=0; i--){
let [l, r, c] = A[i];
let cc = c == 'g' ? 'X' : '.';
for (let j=r-1; j>=l; j--)
s = cc + s;
if (i > 0)
for (let j=l-1; j>=A[i-1][1]; j--)
s = '.' + s
}
for (let j=A[0][0]-1; j>=0; j--)
s = '.' + s
console.log(s);
return s;
}
function g(A, N, C){
const ts = f(A, N, C);
//console.log(JSON.stringify(ts));
show(A, A[A.length-1]);
show(ts, A[A.length-1]);
}
var a = [
[0,5,'b'],
[5,9,'g'],
[9,10,'b'],
[10,15,'g'],
[15,40,'b'],
[40,41,'g'],
[41,43,'b'],
[43,44,'g'],
[44,45,'b'],
[45,46,'g'],
[46,55,'b'],
[55,65,'g'],
[65,100,'b']
];
// (input, N, C)
g(a, 2, 2);
console.log('');
g(a, 3, 2);
console.log('');
g(a, 4, 2);
console.log('');
g(a, 4, 5);
I would suggest using K-means it is an algorithm used to group data(a more detailed explanation here: https://en.wikipedia.org/wiki/K-means_clustering and here https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
this would be a brief explanation of how the function should look like, hope it is helpful.
from sklearn.cluster import KMeans
import numpy as np
def downsample(input, cluster = 25):
# you will need to group your labels in a nmpy array as shown bellow
# for the sake of example I will take just a random array
X = np.array([[1, 2], [1, 4], [1, 0],[4, 2], [4, 4], [4, 0]])
# n_clusters will be the same as desired output
kmeans = KMeans(n_clusters= cluster, random_state=0).fit(X)
# then you can iterate through labels that was assigned to every entr of your input
# in our case the interval
kmeans_list = [None]*cluster
for i in range(0, X.shape[0]):
kmeans_list[kmeans.labels_[i]].append(X[i])
# after that you will basicly have a list of lists and every inner list will contain all points that corespond to a
# specific label
ret = [] #return list
for label_list in kmeans_list:
left = 10001000 # a big enough number to exced anything that you will get as an input
right = -left # same here
for entry in label_list:
left = min(left, entry[0])
right = max(right, entry[1])
ret.append([left,right])
return ret
There are 70 coins and out of which there is one fake coin. Need to detect the fake coin in minimum number of weighing. You have only a weighing scale and you know that the fake coin is lighter.
I am not sure if the below simulation of the problem is right or wrong i.e. representing it in a array and doing the comparison as i have done in my code. I am trying to simulate it with a array with all one's except one zero which is considered as fake coin. Below is my code. Please let me know if i have got it wrong.
It would be really be helpful if someone can prove/explain why 3 way division is better in simple maths.
Pseudo code for the below code:
INPUT : integer n
if n = 1 then
the coin is fake
else
divide the coins into piles of A = ceiling(n/3), B = ceiling(n/3),
and C = n-2*ceiling(n/3)
weigh A and B
if the scale balances then
iterate with C
else
iterate with the lighter of A and B
Code:
import random
def getmin(data, start, end, total_items):
if total_items == 1:
#for sure we have a fake coin
return (0, start)
elif total_items == 2:
if data[start] > data[end]:
return (1, end)
elif data[start] < data[end]:
return (1, start)
else:
partition = total_items/3
a_weight = sum(data[start:start+partition])
b_weight = sum(data[start+partition:start+2*partition])
c_weight = sum(data[start+2*partition:end])
if a_weight == b_weight:
result = getmin(data, start+2*partition, end, end-(start+2*partition))
return (1+result[0], result[1])
else:
if a_weight > b_weight:
result = getmin(data, start+partition, start+2*partition, partition)
return (1+result[0], result[1])
else:
result = getmin(data, start, start+partition, partition)
return (1+result[0], result[1])
n = int(raw_input())
data = [1]*n
data[random.randint(0, n-1)] = 0
total_weighing, position = getmin(data, 0, len(data), len(data))
print(total_weighing, position)
The complexity of this algorithm is O(log3N) because you reduce your problem size to 1/3 in each iteration. Complexity-wise O(log3(n)) == O(log2(n)) == O(log10(n)) so it doen't matter if you divide your problem size by 3 or by 10. The only better complexity is O(1) and that means regardless of size of the problem you can find the fake coin in a fixed number of operations, which is quite unlikely.
Note that in this algorithm we assume that we can find the sum of a range of elements in O(1), Otherwise the algorithm's complexity is O(n).
You ask "why 3-way division is better in simple maths." Better than what? In this problem, it's the best solution because it achieves the answer in the fewest weighings. The properties of a trivial balance scale yield three basic results: left is heavier, right is heavier, and equal weights. That's a 3-way decision, so information theory yields that the best algorithm is to divide the objects in three (if you can practically achieve it) at each phase.
You need 4 weighings for 28-81 coins.
Fortunately, your problem allows for exhaustive testing.
The code above performs one trial of random testing. That's okay for starters, but with only 70 cases to check, I recommend that you try them all. Wrap your main program in a loop over range(70), something like this:
n = 70
for bad_coin in range(70):
data = [1]*n
data[bad_coin] = 0
total_weighing, position = getmin(data, 0, n, n)
print ("trial", bad_coin)
if total_weighing != 4:
print ("Wrong number of weighings:", total_weighing)
if position != bad_coin:
print ("Wrong ID:", position)
This will quickly show any error in your program for the assigned 70 coins.
BTW, replace the if statements with assert, if you're comfortable with that feature.
Consider this way of solving the Subset sum problem:
def subset_summing_to_zero (activities):
subsets = {0: []}
for (activity, cost) in activities.iteritems():
old_subsets = subsets
subsets = {}
for (prev_sum, subset) in old_subsets.iteritems():
subsets[prev_sum] = subset
new_sum = prev_sum + cost
new_subset = subset + [activity]
if 0 == new_sum:
new_subset.sort()
return new_subset
else:
subsets[new_sum] = new_subset
return []
I have it from here:
http://news.ycombinator.com/item?id=2267392
There is also a comment which says that it is possible to make it "more efficient".
How?
Also, are there any other ways to solve the problem which are at least as fast as the one above?
Edit
I'm interested in any kind of idea which would lead to speed-up. I found:
https://en.wikipedia.org/wiki/Subset_sum_problem#cite_note-Pisinger09-2
which mentions a linear time algorithm. But I don't have the paper, perhaps you, dear people, know how it works? An implementation perhaps? Completely different approach perhaps?
Edit 2
There is now a follow-up:
Fast solution to Subset sum algorithm by Pisinger
I respect the alacrity with which you're trying to solve this problem! Unfortunately, you're trying to solve a problem that's NP-complete, meaning that any further improvement that breaks the polynomial time barrier will prove that P = NP.
The implementation you pulled from Hacker News appears to be consistent with the pseudo-polytime dynamic programming solution, where any additional improvements must, by definition, progress the state of current research into this problem and all of its algorithmic isoforms. In other words: while a constant speedup is possible, you're very unlikely to see an algorithmic improvement to this solution to the problem in the context of this thread.
However, you can use an approximate algorithm if you require a polytime solution with a tolerable degree of error. In pseudocode blatantly stolen from Wikipedia, this would be:
initialize a list S to contain one element 0.
for each i from 1 to N do
let T be a list consisting of xi + y, for all y in S
let U be the union of T and S
sort U
make S empty
let y be the smallest element of U
add y to S
for each element z of U in increasing order do
//trim the list by eliminating numbers close to one another
//and throw out elements greater than s
if y + cs/N < z ≤ s, set y = z and add z to S
if S contains a number between (1 − c)s and s, output yes, otherwise no
Python implementation, preserving the original terms as closely as possible:
from bisect import bisect
def ssum(X,c,s):
""" Simple impl. of the polytime approximate subset sum algorithm
Returns True if the subset exists within our given error; False otherwise
"""
S = [0]
N = len(X)
for xi in X:
T = [xi + y for y in S]
U = set().union(T,S)
U = sorted(U) # Coercion to list
S = []
y = U[0]
S.append(y)
for z in U:
if y + (c*s)/N < z and z <= s:
y = z
S.append(z)
if not c: # For zero error, check equivalence
return S[bisect(S,s)-1] == s
return bisect(S,(1-c)*s) != bisect(S,s)
... where X is your bag of terms, c is your precision (between 0 and 1), and s is the target sum.
For more details, see the Wikipedia article.
(Additional reference, further reading on CSTheory.SE)
While my previous answer describes the polytime approximate algorithm to this problem, a request was specifically made for an implementation of Pisinger's polytime dynamic programming solution when all xi in x are positive:
from bisect import bisect
def balsub(X,c):
""" Simple impl. of Pisinger's generalization of KP for subset sum problems
satisfying xi >= 0, for all xi in X. Returns the state array "st", which may
be used to determine if an optimal solution exists to this subproblem of SSP.
"""
if not X:
return False
X = sorted(X)
n = len(X)
b = bisect(X,c)
r = X[-1]
w_sum = sum(X[:b])
stm1 = {}
st = {}
for u in range(c-r+1,c+1):
stm1[u] = 0
for u in range(c+1,c+r+1):
stm1[u] = 1
stm1[w_sum] = b
for t in range(b,n+1):
for u in range(c-r+1,c+r+1):
st[u] = stm1[u]
for u in range(c-r+1,c+1):
u_tick = u + X[t-1]
st[u_tick] = max(st[u_tick],stm1[u])
for u in reversed(range(c+1,c+X[t-1]+1)):
for j in reversed(range(stm1[u],st[u])):
u_tick = u - X[j-1]
st[u_tick] = max(st[u_tick],j)
return st
Wow, that was headache-inducing. This needs proofreading, because, while it implements balsub, I can't define the right comparator to determine if the optimal solution to this subproblem of SSP exists.
I don't know much python, but there is an approach called meet in the middle.
Pseudocode:
Divide activities into two subarrays, A1 and A2
for both A1 and A2, calculate subsets hashes, H1 and H2, the way You do it in Your question.
for each (cost, a1) in H1
if(H2.contains(-cost))
return a1 + H2[-cost];
This will allow You to double the number of elements of activities You can handle in reasonable time.
I apologize for "discussing" the problem, but a "Subset Sum" problem where the x values are bounded is not the NP version of the problem. Dynamic programing solutions are known for bounded x value problems. That is done by representing the x values as the sum of unit lengths. The Dynamic programming solutions have a number of fundamental iterations that is linear with that total length of the x's. However, the Subset Sum is in NP when the precision of the numbers equals N. That is, the number or base 2 place values needed to state the x's is = N. For N = 40, the x's have to be in the billions. In the NP problem the unit length of the x's increases exponentially with N.That is why the dynamic programming solutions are not a polynomial time solution to the NP Subset Sum problem. That being the case, there are still practical instances of the Subset Sum problem where the x's are bounded and the dynamic programming solution is valid.
Here are three ways to make the code more efficient:
The code stores a list of activities for each partial sum. It is more efficient in terms of both memory and time to just store the most recent activity needed to make the sum, and work out the rest by backtracking once a solution is found.
For each activity the dictionary is repopulated with the old contents (subsets[prev_sum] = subset). It is faster to simply grow a single dictionary
Splitting the values in two and applying a meet in the middle approach.
Applying the first two optimisations results in the following code which is more than 5 times faster:
def subset_summing_to_zero2 (activities):
subsets = {0:-1}
for (activity, cost) in activities.iteritems():
for prev_sum in subsets.keys():
new_sum = prev_sum + cost
if 0 == new_sum:
new_subset = [activity]
while prev_sum:
activity = subsets[prev_sum]
new_subset.append(activity)
prev_sum -= activities[activity]
return sorted(new_subset)
if new_sum in subsets: continue
subsets[new_sum] = activity
return []
Also applying the third optimisation results in something like:
def subset_summing_to_zero3 (activities):
A=activities.items()
mid=len(A)//2
def make_subsets(A):
subsets = {0:-1}
for (activity, cost) in A:
for prev_sum in subsets.keys():
new_sum = prev_sum + cost
if new_sum and new_sum in subsets: continue
subsets[new_sum] = activity
return subsets
subsets = make_subsets(A[:mid])
subsets2 = make_subsets(A[mid:])
def follow_trail(new_subset,subsets,s):
while s:
activity = subsets[s]
new_subset.append(activity)
s -= activities[activity]
new_subset=[]
for s in subsets:
if -s in subsets2:
follow_trail(new_subset,subsets,s)
follow_trail(new_subset,subsets2,-s)
if len(new_subset):
break
return sorted(new_subset)
Define bound to be the largest absolute value of the elements.
The algorithmic benefit of the meet in the middle approach depends a lot on bound.
For a low bound (e.g. bound=1000 and n=300) the meet in the middle only gets a factor of about 2 improvement other the first improved method. This is because the dictionary called subsets is densely populated.
However, for a high bound (e.g. bound=100,000 and n=30) the meet in the middle takes 0.03 seconds compared to 2.5 seconds for the first improved method (and 18 seconds for the original code)
For high bounds, the meet in the middle will take about the square root of the number of operations of the normal method.
It may seem surprising that meet in the middle is only twice as fast for low bounds. The reason is that the number of operations in each iteration depends on the number of keys in the dictionary. After adding k activities we might expect there to be 2**k keys, but if bound is small then many of these keys will collide so we will only have O(bound.k) keys instead.
Thought I'd share my Scala solution for the discussed pseudo-polytime algorithm described in wikipedia. It's a slightly modified version: it figures out how many unique subsets there are. This is very much related to a HackerRank problem described at https://www.hackerrank.com/challenges/functional-programming-the-sums-of-powers. Coding style might not be excellent, I'm still learning Scala :) Maybe this is still helpful for someone.
object Solution extends App {
var input = "1000\n2"
System.setIn(new ByteArrayInputStream(input.getBytes()))
println(calculateNumberOfWays(readInt, readInt))
def calculateNumberOfWays(X: Int, N: Int) = {
val maxValue = Math.pow(X, 1.0/N).toInt
val listOfValues = (1 until maxValue + 1).toList
val listOfPowers = listOfValues.map(value => Math.pow(value, N).toInt)
val lists = (0 until maxValue).toList.foldLeft(List(List(0)): List[List[Int]]) ((newList, i) =>
newList :+ (newList.last union (newList.last.map(y => y + listOfPowers.apply(i)).filter(z => z <= X)))
)
lists.last.count(_ == X)
}
}
I have the following problem:
Given N objects (N < 30) of different values multiple of a "k" constant i.e. k, 2k, 3k, 4k, 6k, 8k, 12k, 16k, 24k and 32k, I need an algorithm that will distribute all items to M players (M <= 6) in such a way that the total value of the objects each player gets is as even as possible (in other words, I want to distribute all objects to all players in the fairest way possible).
EDIT: By fairest distribution I mean that the difference between the value of the objects any two players get is minimal.
Another similar case would be: I have N coins of different values and I need to divide them equally among M players; sometimes they don't divide exactly and I need to find the next best case of distribution (where no player is angry because another one got too much money).
I don't need (pseudo)code to solve this (also, this is not a homework :) ), but I'll appreciate any ideas or links to algorithms that could solve this.
Thanks!
The problem is strongly NP-complete. This means there is no way to ensure a correct solution in reasonable time. (See 3-partition-problem, thanks Paul).
Instead you'll wanna go for a good approximate solution generator. These can often get very close to the optimal answer in very short time. I can recommend the Simulated Annealing technique, which you will also be able to use for a ton of other NP-complete problems.
The idea is this:
Distribute the items randomly.
Continually make random swaps between two random players, as long as it makes the system more fair, or only a little less fair (see the wiki for details).
Stop when you have something fair enough, or you have run out of time.
This solution is much stronger than the 'greedy' algorithms many suggest. The greedy algorithm is the one where you continuously add the largest item to the 'poorest' player. An example of a testcase where greedy fails is [10,9,8,7,7,5,5].
I did an implementation of SA for you. It follows the wiki article strictly, for educational purposes. If you optimize it, I would say a 100x improvement wouldn't be unrealistic.
from __future__ import division
import random, math
values = [10,9,8,7,7,5,5]
M = 3
kmax = 1000
emax = 0
def s0():
s = [[] for i in xrange(M)]
for v in values:
random.choice(s).append(v)
return s
def E(s):
avg = sum(values)/M
return sum(abs(avg-sum(p))**2 for p in s)
def neighbour(s):
snew = [p[:] for p in s]
while True:
p1, p2 = random.sample(xrange(M),2)
if s[p1]: break
item = random.randrange(len(s[p1]))
snew[p2].append(snew[p1].pop(item))
return snew
def P(e, enew, T):
if enew < e: return 1
return math.exp((e - enew) / T)
def temp(r):
return (1-r)*100
s = s0()
e = E(s)
sbest = s
ebest = e
k = 0
while k < kmax and e > emax:
snew = neighbour(s)
enew = E(snew)
if enew < ebest:
sbest = snew; ebest = enew
if P(e, enew, temp(k/kmax)) > random.random():
s = snew; e = enew
k += 1
print sbest
Update: After playing around with Branch'n'Bound, I now believe this method to be superior, as it gives perfect results for the N=30, M=6 case within a second. However I guess you could play around with the simulated annealing approach just as much.
The greedy solution suggested by a few people seems like the best option, I ran it a bunch of times with some random values, and it seems to get it right every time.
If it's not optimal, it's at the very least very close, and it runs in O(nm) or so (I can't be bothered to do the math right now)
C# Implementation:
static List<List<int>> Dist(int n, IList<int> values)
{
var result = new List<List<int>>();
for (int i = 1; i <= n; i++)
result.Add(new List<int>());
var sortedValues = values.OrderByDescending(val => val);
foreach (int val in sortedValues)
{
var lowest = result.OrderBy(a => a.Sum()).First();
lowest.Add(val);
}
return result;
}
how about this:
order the k values.
order the players.
loop over the k values giving the next one to the next player.
when you get to the end of the players, turn around and continue giving the k values to the players in the opposite direction.
Repeatedly give the available object with the largest value to the player who has the least total value of objects assigned to him.
This is a straight-forward implementation of Justin Peel's answer:
M = 3
players = [[] for i in xrange(M)]
values = [10,4,3,1,1,1]
values.sort()
values.reverse()
for v in values:
lowest=sorted(players, key=lambda x: sum(x))[0]
lowest.append(v)
print players
print [sum(p) for p in players]
I am a beginner with Python, but it seems to work okay. This example will print
[[10], [4, 1], [3, 1, 1]]
[10, 5, 5]
30 ^ 6 isn't that large (it's less than 1 billion). Go through every possible allocation, and pick the one that's the fairest by whatever measure you define.
EDIT:
The purpose was to use the greedy solution with small improvement in the implementation, which is maybe transparent in C#:
static List<List<int>> Dist(int n, IList<int> values)
{
var result = new List<List<int>>();
for (int i = 1; i <= n; i++)
result.Add(new List<int>());
var sortedValues = values.OrderByDescending(val => val);//Assume the most efficient sorting algorithm - O(N log(N))
foreach (int val in sortedValues)
{
var lowest = result.OrderBy(a => a.Sum()).First();//This can be done in O(M * log(n)) [M - size of sortedValues, n - size of result]
lowest.Add(val);
}
return result;
}
Regarding this stage:
var lowest = result.OrderBy(a => a.Sum()).First();//This can be done in O(M * log(n)) [M - size of sortedValues, n - size of result]
The idea is that the list is always sorted (In this code it is done by OrderBy). Eventually, this sorting wont take more than O (log(n)) - because we just need to INSERT at most one item into a sorted list - that should take the same as a binary search.
Because we need to repeat this phase for sortedValues.Length times, the whole algorithm runs in O(M * log(n)).
So, in words, it can be rephrased as:
Repeat the steps below till you finish the Values values:
1. Add the biggest value to the smallest player
2. Check if this player still has the smallest sum
3. If yes, go to step 1.
4. Insert the last-that-was-got player to the sorted players list
Step 4 is the O (log(n)) step - as the list is always sorted.