Consider a dataset of N rows with weights. This is the basic algorithm:
Normalize the weights so that they sum to 1.
Backup the weights into another column to record sample probabilities
Randomly choose 1 row (without replacement), given the sample probabilities, and add it to the sample dataset
Remove the drawn weight from the original dataset, and recompute the sample probabilities by normalizing the weights of the remaining rows
Repeat steps 3 and 4 till sum of weights in sample reaches or exceeds threshold (assume 0.6)
Here is a toy example:
import pandas as pd
import numpy as np
def sampler(n):
df = pd.DataFrame(np.random.rand(n), columns=['weight'])
df['weight'] = df['weight']/df['weight'].sum()
df['samp_prob'] = df['weight']
samps = pd.DataFrame(columns=['weight'])
while True:
choice = np.random.choice(df.index, 1, replace=False, p=df['samp_prob'])[0]
samps.loc[choice, 'weight'] = df.loc[choice, 'weight']
df.drop(choice, axis=0, inplace=True)
df['samp_prob'] = df['weight']/df['weight'].sum()
if samps['weight'].sum() >= 0.6:
break
return samps
The problem with the toy example is the exponential growth in run times with increasing size of n:
Starting off approach
Few observations :
The dropping of rows per iteration that results in creation of new dataframes isn't helping with the performance.
Doesn't look like easy to vectorize, BUT should be easy to work with the underlying array data for performance. The idea would be to use masks and avoid re-creating dataframes or arrays. Starting off, we would be using two columns array, corresponding to the columns named : 'weights' and 'samp_prob'.
So, with those in mind, the starting approach would be something like this -
def sampler2(n):
a = np.random.rand(n,2)
a[:,0] /= a[:,0].sum()
a[:,1] = a[:,0]
N = len(a)
idx = np.arange(N)
mask = np.ones(N,dtype=bool)
while True:
choice = np.random.choice(idx[mask], 1, replace=False, p=a[mask,1])[0]
mask[choice] = 0
a_masked = a[mask,0]
a[mask,1] = a_masked/a_masked.sum()
if a[~mask,0].sum() >= 0.6:
break
out = a[~mask,0]
return out
Improvement #1
A later observation revealed that the first column of the array isn't changing across iterations. So, we could optimize for the masked summations for the first column, by pre-computing the total summation and then at each iteration, a[~mask,0].sum() would be simply the total summation minus a_masked.sum(). Thsi leads us to the first improvement, listed below -
def sampler3(n):
a = np.random.rand(n,2)
a[:,0] /= a[:,0].sum()
a[:,1] = a[:,0]
N = len(a)
idx = np.arange(N)
mask = np.ones(N,dtype=bool)
a0_sum = a[:,0].sum()
while True:
choice = np.random.choice(idx[mask], 1, replace=False, p=a[mask,1])[0]
mask[choice] = 0
a_masked = a[mask,0]
a_masked_sum = a_masked.sum()
a[mask,1] = a_masked/a_masked_sum
if a0_sum - a_masked_sum >= 0.6:
break
out = a[~mask,0]
return out
Improvement #2
Now, slicing and masking into the columns of a 2D array could be improved by using two separate arrays instead, given that the first column wasn't changing between iterations. That gives us a modified version, like so -
def sampler4(n):
a = np.random.rand(n)
a /= a.sum()
b = a.copy()
N = len(a)
idx = np.arange(N)
mask = np.ones(N,dtype=bool)
a_sum = a.sum()
while True:
choice = np.random.choice(idx[mask], 1, replace=False, p=b[mask])[0]
mask[choice] = 0
a_masked = a[mask]
a_masked_sum = a_masked.sum()
b[mask] = a_masked/a_masked_sum
if a_sum - a_masked_sum >= 0.6:
break
out = a[~mask]
return out
Runtime test -
In [250]: n = 1000
In [251]: %timeit sampler(n) # original app
...: %timeit sampler2(n)
...: %timeit sampler3(n)
...: %timeit sampler4(n)
1 loop, best of 3: 655 ms per loop
10 loops, best of 3: 50 ms per loop
10 loops, best of 3: 44.9 ms per loop
10 loops, best of 3: 38.4 ms per loop
In [252]: n = 2000
In [253]: %timeit sampler(n) # original app
...: %timeit sampler2(n)
...: %timeit sampler3(n)
...: %timeit sampler4(n)
1 loop, best of 3: 1.32 s per loop
10 loops, best of 3: 134 ms per loop
10 loops, best of 3: 119 ms per loop
10 loops, best of 3: 100 ms per loop
Thus, we are getting 17x+ and 13x+ speedups with the final version over the original method for n=1000 and n=2000 sizes!
I think you can rewrite this while loop to do it in a single pass:
while True:
choice = np.random.choice(df.index, 1, replace=False, p=df['samp_prob'])[0]
samps.loc[choice, 'weight'] = df.loc[choice, 'weight']
df.drop(choice, axis=0, inplace=True)
df['samp_prob'] = df['weight']/df['weight'].sum()
if samps['weight'].sum() >= 0.6:
break
to something more like:
n = len(df.index)
ind = np.random.choice(n, n, replace=False, p=df["samp_prob"])
res = df.iloc[ind]
i = (res.cumsum() >= 0.6).idxmax() # first index that satisfies .sum() >= 0.6
samps = res.iloc[:i+1]
The key parts are that choice can take multiple elements (indeed the entire array) whilst still respecting the probabilities. The cumsum allows you to cut off after passing the 0.6 threshold.
In this example you can see that the array is randomly chosen, but that 4 is most likely chosen nearer the top.
In [11]: np.random.choice(5, 5, replace=False, p=[0.05, 0.05, 0.1, 0.2, 0.6])
Out[11]: array([0, 4, 3, 2, 1])
In [12]: np.random.choice(5, 5, replace=False, p=[0.05, 0.05, 0.1, 0.2, 0.6])
Out[12]: array([3, 4, 1, 2, 0])
In [13]: np.random.choice(5, 5, replace=False, p=[0.05, 0.05, 0.1, 0.2, 0.6])
Out[13]: array([0, 4, 3, 1, 2])
In [14]: np.random.choice(5, 5, replace=False, p=[0.05, 0.05, 0.1, 0.2, 0.6])
Out[14]: array([4, 3, 0, 2, 1])
In [15]: np.random.choice(5, 5, replace=False, p=[0.05, 0.05, 0.1, 0.2, 0.6])
Out[15]: array([4, 2, 3, 0, 1])
In [16]: np.random.choice(5, 5, replace=False, p=[0.05, 0.05, 0.1, 0.2, 0.6])
Out[16]: array([3, 4, 2, 0, 1])
Note: The replace=False, ensures the probabilities are "reweighed" in the sense that it can't be picked again.
Related
I have a network for semantic segmentation and the last layer of my model applies a sigmoid activation, so all predictions are scaled between 0-1. There is this validation metric tf.keras.metrics.MeanIoU(num_classes), which compares classified predictions (0 or 1) with validation (0 or 1). So if i make a prediction and apply this metric, will it automatically map the continuous predictions to binary with threshold = 0.5? Are there any possibilities to manually define the threshold?
No, tf.keras.metrics.MeanIoU will not automatically map the continuous predictions to binary with threshold = 0.5.
It will convert the continuous predictions to its binary, by taking the binary digit before decimal point as predictions like 0.99 as 0, 0.50 as 0, 0.01 as 0, 1.99 as 1, 1.01 as 1 etc when num_classes=2. So basically if your predicted values are between 0 to 1 and num_classes=2, then everything is considered 0 unless the prediction is 1.
Below are the experiments to justify the behavior in tensorflow version 2.2.0:
All binary result :
import tensorflow as tf
m = tf.keras.metrics.MeanIoU(num_classes=2)
_ = m.update_state([0, 0, 1, 1], [0, 0, 1, 1])
m.result().numpy()
Output -
1.0
Change one prediction to continuous 0.99 - Here it considers 0.99 as 0.
import tensorflow as tf
m = tf.keras.metrics.MeanIoU(num_classes=2)
_ = m.update_state([0, 0, 1, 1], [0, 0, 1, 0.99])
m.result().numpy()
Output -
0.5833334
Change one prediction to continuous 0.01 - Here it considers 0.01 as 0.
import tensorflow as tf
m = tf.keras.metrics.MeanIoU(num_classes=2)
_ = m.update_state([0, 0, 1, 1], [0, 0.01, 1, 1])
m.result().numpy()
Output -
1.0
Change one prediction to continuous 1.99 - Here it considers 1.99 as 1.
%tensorflow_version 2.x
import tensorflow as tf
m = tf.keras.metrics.MeanIoU(num_classes=2)
_ = m.update_state([0, 0, 1, 1], [0, 0, 1, 1.99])
m.result().numpy()
Output -
1.0
So ideal way is to define a function to convert the continuous to binary before evaluating the MeanIoU.
Hope this answers your question. Happy Learning.
Try this(remember to replace the space with tab):
def mean_iou(y_true, y_pred):
th = 0.5
y_pred_ = tf.to_int32(y_pred > th)
score, up_opt = tf.metrics.mean_iou(y_true, y_pred_, 2)
K.get_session().run(tf.local_variables_initializer())
with tf.control_dependencies([up_opt]):
score = tf.identity(score)
return score
Say you have a vertical game board of length n (being the number of spaces). And you have a three-sided die that has the options: go forward one, stay and go back one. If you go below or above the number of board game spaces it is an invalid game. The only valid move once you reach the end of the board is "stay". Given an exact number of die rolls t, is it possible to algorithmically work out the number of unique dice rolls that result in a winning game?
So far I've tried producing a list of every possible combination of (-1,0,1) for the given number of die rolls and sorting through the list to see if any add up to the length of the board and also meet all the requirements for being a valid game. But this is impractical for dice rolls above 20.
For example:
t=1, n=2; Output=1
t=3, n=2; Output=3
You can use a dynamic programming approach. The sketch of a recurrence is:
M(0, 1) = 1
M(t, n) = T(t-1, n-1) + T(t-1, n) + T(t-1, n+1)
Of course you have to consider the border cases (like going off the board or not allowing to exit the end of the board, but it's easy to code that).
Here's some Python code:
def solve(N, T):
M, M2 = [0]*N, [0]*N
M[0] = 1
for i in xrange(T):
M, M2 = M2, M
for j in xrange(N):
M[j] = (j>0 and M2[j-1]) + M2[j] + (j+1<N-1 and M2[j+1])
return M[N-1]
print solve(3, 2) #1
print solve(2, 1) #1
print solve(2, 3) #3
print solve(5, 20) #19535230
Bonus: fancy "one-liner" with list compreehension and reduce
def solve(N, T):
return reduce(
lambda M, _: [(j>0 and M[j-1]) + M[j] + (j<N-2 and M[j+1]) for j in xrange(N)],
xrange(T), [1]+[0]*N)[-1]
Let M[i, j] be an N by N matrix with M[i, j] = 1 if |i-j| <= 1 and 0 otherwise (and the special case for the "stay" rule of M[N, N-1] = 0)
This matrix counts paths of length 1 from position i to position j.
To find paths of length t, simply raise M to the t'th power. This can be performed efficiently by linear algebra packages.
The solution can be read off: M^t[1, N].
For example, computing paths of length 20 on a board of size 5 in an interactive Python session:
>>> import numpy
>>> M = numpy.matrix('1 1 0 0 0;1 1 1 0 0; 0 1 1 1 0; 0 0 1 1 1; 0 0 0 0 1')
>>> M
matrix([[1, 1, 0, 0, 0],
[1, 1, 1, 0, 0],
[0, 1, 1, 1, 0],
[0, 0, 1, 1, 1],
[0, 0, 0, 0, 1]])
>>> M ** 20
matrix([[31628466, 51170460, 51163695, 31617520, 19535230],
[51170460, 82792161, 82787980, 51163695, 31617520],
[51163695, 82787980, 82792161, 51170460, 31628465],
[31617520, 51163695, 51170460, 31628466, 19552940],
[ 0, 0, 0, 0, 1]])
So there's M^20[1, 5], or 19535230 paths of length 20 from start to finish on a board of size 5.
Try a backtracking algorithm. Recursively "dive down" into depth t and only continue with dice values that could still result in a valid state. Propably by passing a "remaining budget" around.
For example, n=10, t=20, when you reached depth 10 of 20 and your budget is still 10 (= steps forward and backwards seemed to cancelled), the next recursion steps until depth t would discontinue the 0 and -1 possibilities, because they could not result in a valid state at the end.
A backtracking algorithms for this case is still very heavy (exponential), but better than first blowing up a bubble with all possibilities and then filtering.
Since zeros can be added anywhere, we'll multiply those possibilities by the different arrangements of (-1)'s:
X (space 1) X (space 2) X (space 3) X (space 4) X
(-1)'s can only appear in spaces 1,2 or 3, not in space 4. I got help with the mathematical recurrence that counts the number of ways to place minus ones without skipping backwards.
JavaScript code:
function C(n,k){if(k==0||n==k)return 1;var p=n;for(var i=2;i<=k;i++)p*=(n+1-i)/i;return p}
function sumCoefficients(arr,cs){
var s = 0, i = -1;
while (arr[++i]){
s += cs[i] * arr[i];
}
return s;
}
function f(n,t){
var numMinusOnes = (t - (n-1)) >> 1
result = C(t,n-1),
numPlaces = n - 2,
cs = [];
for (var i=1; numPlaces-i>=i-1; i++){
cs.push(-Math.pow(-1,i) * C(numPlaces + 1 - i,i));
}
var As = new Array(cs.length),
An;
As[0] = 1;
for (var m=1; m<=numMinusOnes; m++){
var zeros = t - (n-1) - 2*m;
An = sumCoefficients(As,cs);
As.unshift(An);
As.pop();
result += An * C(zeros + 2*m + n-1,zeros);
}
return result;
}
Output:
console.log(f(5,20))
19535230
I need help optimizing this loop. matrix_1 is a (nx 2) int matrix and matrix_2 is a (m x 2), m & n very.
index_j = 1;
for index_k = 1:size(Matrix_1,1)
for index_l = 1:size(Matrix_2,1)
M2_Index_Dist(index_j,:) = [index_l, sqrt(bsxfun(#plus,sum(Matrix_1(index_k,:).^2,2),sum(Matrix_2(index_l,:).^2,2)')-2*(Matrix_1(index_k,:)*Matrix_2(index_l,:)'))];
index_j = index_j + 1;
end
end
I need M2_Index_Dist to provide a ((n*m) x 2) matrix with the index of matrix_2 in the first column and the distance in the second column.
Output example:
M2_Index_Dist = [ 1, 5.465
2, 56.52
3, 6.21
1, 35.3
2, 56.52
3, 0
1, 43.5
2, 9.3
3, 236.1
1, 8.2
2, 56.52
3, 5.582]
Here's how to apply bsxfun with your formula (||A-B|| = sqrt(||A||^2 + ||B||^2 - 2*A*B)):
d = real(sqrt(bsxfun(#plus, dot(Matrix_1,Matrix_1,2), ...
bsxfun(#minus, dot(Matrix_2,Matrix_2,2).', 2 * Matrix_1*Matrix_2.')))).';
You can avoid the final transpose if you change your interpretation of the matrix.
Note: There shouldn't be any complex values to handle with real but it's there in case of very small differences that may lead to tiny negative numbers.
Edit: It may be faster without dot:
d = sqrt(bsxfun(#plus, sum(Matrix_1.*Matrix_1,2), ...
bsxfun(#minus, sum(Matrix_2.*Matrix_2,2)', 2 * Matrix_1*Matrix_2.'))).';
Or with just one call to bsxfun:
d = sqrt(bsxfun(#plus, sum(Matrix_1.*Matrix_1,2), sum(Matrix_2.*Matrix_2,2)') ...
- 2 * Matrix_1*Matrix_2.').';
Note: This last order of operations gives identical results to you, rather than with an error ~1e-14.
Edit 2: To replicate M2_Index_Dist:
II = ndgrid(1:size(Matrix_2,1),1:size(Matrix_2,1));
M2_Index_Dist = [II(:) d(:)];
If I understand correctly, this does what you want:
ind = repmat((1:size(Matrix_2,1)).',size(Matrix_1,1),1); %'// first column: index
d = pdist2(Matrix_2,Matrix_1); %// compute distance between each pair of rows
d = d(:); %// second column: distance
result = [ind d]; %// build result from first column and second column
As you see, this code calls pdist2 to compute the distance between every pair of rows of your matrices. By default this function uses Euclidean distance.
If you don't have pdist2 (which is part of the the Statistics Toolbox), you can replace line 2 above with bsxfun:
d = squeeze(sqrt(sum(bsxfun(#minus,Matrix_2,permute(Matrix_1, [3 2 1])).^2,2)));
I have an array of non-negative values. I want to build an array of values who's sum is 20 so that they are proportional to the first array.
This would be an easy problem, except that I want the proportional array to sum to exactly
20, compensating for any rounding error.
For example, the array
input = [400, 400, 0, 0, 100, 50, 50]
would yield
output = [8, 8, 0, 0, 2, 1, 1]
sum(output) = 20
However, most cases are going to have a lot of rounding errors, like
input = [3, 3, 3, 3, 3, 3, 18]
naively yields
output = [1, 1, 1, 1, 1, 1, 10]
sum(output) = 16 (ouch)
Is there a good way to apportion the output array so that it adds up to 20 every time?
There's a very simple answer to this question: I've done it many times. After each assignment into the new array, you reduce the values you're working with as follows:
Call the first array A, and the new, proportional array B (which starts out empty).
Call the sum of A elements T
Call the desired sum S.
For each element of the array (i) do the following:
a. B[i] = round(A[i] / T * S). (rounding to nearest integer, penny or whatever is required)
b. T = T - A[i]
c. S = S - B[i]
That's it! Easy to implement in any programming language or in a spreadsheet.
The solution is optimal in that the resulting array's elements will never be more than 1 away from their ideal, non-rounded values. Let's demonstrate with your example:
T = 36, S = 20. B[1] = round(A[1] / T * S) = 2. (ideally, 1.666....)
T = 33, S = 18. B[2] = round(A[2] / T * S) = 2. (ideally, 1.666....)
T = 30, S = 16. B[3] = round(A[3] / T * S) = 2. (ideally, 1.666....)
T = 27, S = 14. B[4] = round(A[4] / T * S) = 2. (ideally, 1.666....)
T = 24, S = 12. B[5] = round(A[5] / T * S) = 2. (ideally, 1.666....)
T = 21, S = 10. B[6] = round(A[6] / T * S) = 1. (ideally, 1.666....)
T = 18, S = 9. B[7] = round(A[7] / T * S) = 9. (ideally, 10)
Notice that comparing every value in B with it's ideal value in parentheses, the difference is never more than 1.
It's also interesting to note that rearranging the elements in the array can result in different corresponding values in the resulting array. I've found that arranging the elements in ascending order is best, because it results in the smallest average percentage difference between actual and ideal.
Your problem is similar to a proportional representation where you want to share N seats (in your case 20) among parties proportionnaly to the votes they obtain, in your case [3, 3, 3, 3, 3, 3, 18]
There are several methods used in different countries to handle the rounding problem. My code below uses the Hagenbach-Bischoff quota method used in Switzerland, which basically allocates the seats remaining after an integer division by (N+1) to parties which have the highest remainder:
def proportional(nseats,votes):
"""assign n seats proportionaly to votes using Hagenbach-Bischoff quota
:param nseats: int number of seats to assign
:param votes: iterable of int or float weighting each party
:result: list of ints seats allocated to each party
"""
quota=sum(votes)/(1.+nseats) #force float
frac=[vote/quota for vote in votes]
res=[int(f) for f in frac]
n=nseats-sum(res) #number of seats remaining to allocate
if n==0: return res #done
if n<0: return [min(x,nseats) for x in res] # see siamii's comment
#give the remaining seats to the n parties with the largest remainder
remainders=[ai-bi for ai,bi in zip(frac,res)]
limit=sorted(remainders,reverse=True)[n-1]
#n parties with remainter larger than limit get an extra seat
for i,r in enumerate(remainders):
if r>=limit:
res[i]+=1
n-=1 # attempt to handle perfect equality
if n==0: return res #done
raise #should never happen
However this method doesn't always give the same number of seats to parties with perfect equality as in your case:
proportional(20,[3, 3, 3, 3, 3, 3, 18])
[2,2,2,2,1,1,10]
You have set 3 incompatible requirements. An integer-valued array proportional to [1,1,1] cannot be made to sum to exactly 20. You must choose to break one of the "sum to exactly 20", "proportional to input", and "integer values" requirements.
If you choose to break the requirement for integer values, then use floating point or rational numbers. If you choose to break the exact sum requirement, then you've already solved the problem. Choosing to break proportionality is a little trickier. One approach you might take is to figure out how far off your sum is, and then distribute corrections randomly through the output array. For example, if your input is:
[1, 1, 1]
then you could first make it sum as well as possible while still being proportional:
[7, 7, 7]
and since 20 - (7+7+7) = -1, choose one element to decrement at random:
[7, 6, 7]
If the error was 4, you would choose four elements to increment.
A naïve solution that doesn't perform well, but will provide the right result...
Write an iterator that given an array with eight integers (candidate) and the input array, output the index of the element that is farthest away from being proportional to the others (pseudocode):
function next_index(candidate, input)
// Calculate weights
for i in 1 .. 8
w[i] = candidate[i] / input[i]
end for
// find the smallest weight
min = 0
min_index = 0
for i in 1 .. 8
if w[i] < min then
min = w[i]
min_index = i
end if
end for
return min_index
end function
Then just do this
result = [0, 0, 0, 0, 0, 0, 0, 0]
result[next_index(result, input)]++ for 1 .. 20
If there is no optimal solution, it'll skew towards the beginning of the array.
Using the approach above, you can reduce the number of iterations by rounding down (as you did in your example) and then just use the approach above to add what has been left out due to rounding errors:
result = <<approach using rounding down>>
while sum(result) < 20
result[next_index(result, input)]++
So the answers and comments above were helpful... particularly the decreasing sum comment from #Frederik.
The solution I came up with takes advantage of the fact that for an input array v, sum(v_i * 20) is divisible by sum(v). So for each value in v, I mulitply by 20 and divide by the sum. I keep the quotient, and accumulate the remainder. Whenever the accumulator is greater than sum(v), I add one to the value. That way I'm guaranteed that all the remainders get rolled into the results.
Is that legible? Here's the implementation in Python:
def proportion(values, total):
# set up by getting the sum of the values and starting
# with an empty result list and accumulator
sum_values = sum(values)
new_values = []
acc = 0
for v in values:
# for each value, find quotient and remainder
q, r = divmod(v * total, sum_values)
if acc + r < sum_values:
# if the accumlator plus remainder is too small, just add and move on
acc += r
else:
# we've accumulated enough to go over sum(values), so add 1 to result
if acc > r:
# add to previous
new_values[-1] += 1
else:
# add to current
q += 1
acc -= sum_values - r
# save the new value
new_values.append(q)
# accumulator is guaranteed to be zero at the end
print new_values, sum_values, acc
return new_values
(I added an enhancement that if the accumulator > remainder, I increment the previous value instead of the current value)
Let's say I have an array of floating point numbers, in sorted (let's say ascending) order, whose sum is known to be an integer N. I want to "round" these numbers to integers while leaving their sum unchanged. In other words, I'm looking for an algorithm that converts the array of floating-point numbers (call it fn) to an array of integers (call it in) such that:
the two arrays have the same length
the sum of the array of integers is N
the difference between each floating-point number fn[i] and its corresponding integer in[i] is less than 1 (or equal to 1 if you really must)
given that the floats are in sorted order (fn[i] <= fn[i+1]), the integers will also be in sorted order (in[i] <= in[i+1])
Given that those four conditions are satisfied, an algorithm that minimizes the rounding variance (sum((in[i] - fn[i])^2)) is preferable, but it's not a big deal.
Examples:
[0.02, 0.03, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14]
=> [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
[0.1, 0.3, 0.4, 0.4, 0.8]
=> [0, 0, 0, 1, 1]
[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
=> [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
[0.4, 0.4, 0.4, 0.4, 9.2, 9.2]
=> [0, 0, 1, 1, 9, 9] is preferable
=> [0, 0, 0, 0, 10, 10] is acceptable
[0.5, 0.5, 11]
=> [0, 1, 11] is fine
=> [0, 0, 12] is technically not allowed but I'd take it in a pinch
To answer some excellent questions raised in the comments:
Repeated elements are allowed in both arrays (although I would also be interested to hear about algorithms that work only if the array of floats does not include repeats)
There is no single correct answer - for a given input array of floats, there are generally multiple arrays of ints that satisfy the four conditions.
The application I had in mind was - and this is kind of odd - distributing points to the top finishers in a game of MarioKart ;-) Never actually played the game myself, but while watching someone else I noticed that there were 24 points distributed among the top 4 finishers, and I wondered how it might be possible to distribute the points according to finishing time (so if someone finishes with a large lead they get a larger share of the points). The game tracks point totals as integers, hence the need for this kind of rounding.
For the curious, here is the test script I used to identify which algorithms worked.
One option you could try is "cascade rounding".
For this algorithm you keep track of two running totals: one of floating point numbers so far, and one of the integers so far.
To get the next integer you add the next fp number to your running total, round the running total, then subtract the integer running total from the rounded running total:-
number running total integer integer running total
1.3 1.3 1 1
1.7 3.0 2 3
1.9 4.9 2 5
2.2 8.1 3 8
2.8 10.9 3 11
3.1 14.0 3 14
Here is one algorithm which should accomplish the task. The main difference to other algorithms is that this one rounds the numbers in correct order always. Minimizing roundoff error.
The language is some pseudo language which probably derived from JavaScript or Lua. Should explain the point. Note the one based indexing (which is nicer with x to y for loops. :p)
// Temp array with same length as fn.
tempArr = Array(fn.length)
// Calculate the expected sum.
arraySum = sum(fn)
lowerSum = 0
-- Populate temp array.
for i = 1 to fn.lengthf
tempArr[i] = { result: floor(fn[i]), // Lower bound
difference: fn[i] - floor(fn[i]), // Roundoff error
index: i } // Original index
// Calculate the lower sum
lowerSum = lowerSum + tempArr[i].result
end for
// Sort the temp array on the roundoff error
sort(tempArr, "difference")
// Now arraySum - lowerSum gives us the difference between sums of these
// arrays. tempArr is ordered in such a way that the numbers closest to the
// next one are at the top.
difference = arraySum - lowerSum
// Add 1 to those most likely to round up to the next number so that
// the difference is nullified.
for i = (tempArr.length - difference + 1) to tempArr.length
tempArr.result = tempArr.result + 1
end for
// Optionally sort the array based on the original index.
array(sort, "index")
One really easy way is to take all the fractional parts and sum them up. That number by the definition of your problem must be a whole number. Distribute that whole number evenly starting with the largest of your numbers. Then give one to the second largest number... etc. until you run out of things to distribute.
Note this is pseudocode... and may be off by one in an index... its late and I am sleepy.
float accumulator = 0;
for (i = 0; i < num_elements; i++) /* assumes 0 based array */
{
accumulator += (fn[i] - floor(fn[i]));
fn[i] = (fn[i] - floor(fn[i]);
}
i = num_elements;
while ((accumulator > 0) && (i>=0))
{
fn[i-1] += 1; /* assumes 0 based array */
accumulator -= 1;
i--;
}
Update: There are other methods of distributing the accumulated values based on how much truncation was performed on each value. This would require keeping a seperate list called loss[i] = fn[i] - floor(fn[i]). You can then repeat over the fn[i] list and give 1 to the greatest loss item repeatedly (setting the loss[i] to 0 afterwards). Its complicated but I guess it works.
How about:
a) start: array is [0.1, 0.2, 0.4, 0.5, 0.8], N=3, presuming it's sorted
b) round them all the usual way: array is [0 0 0 1 1]
c) get the sum of the new array and subtract it from N to get the remainder.
d) while remainder>0, iterate through elements, going from the last one
- check if the new value would break rule 3.
- if not, add 1
e) in case that remainder<0, iterate from first one to the last one
- check if the new value would break rule 3.
- if not, subtract 1
Essentially what you'd do is distribute the leftovers after rounding to the most likely candidates.
Round the floats as you normally would, but keep track of the delta from rounding and associated index into fn and in.
Sort the second array by delta.
While sum(in) < N, work forwards from the largest negative delta, incrementing the rounded value (making sure you still satisfy rule #3).
Or, while sum(in) > N, work backwards from the largest positive delta, decrementing the rounded value (making sure you still satisfy rule #3).
Example:
[0.02, 0.03, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14] N=1
1. [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] sum=0
and [[-0.02, 0], [-0.03, 1], [-0.05, 2], [-0.06, 3], [-0.07, 4], [-0.08, 5],
[-0.09, 6], [-0.1, 7], [-0.11, 8], [-0.12, 9], [-0.13, 10], [-0.14, 11]]
2. sorting will reverse the array
3. working from the largest negative remainder, you get [-0.14, 11].
Increment `in[11]` and you get [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1] sum=1
Done.
Can you try something like this?
in [i] = fn [i] - int (fn [i]);
fn_res [i] = fn [i] - in [i];
fn_res → is the resultant fraction.
(I thought this was basic ...), Are we missing something?
Well, 4 is the pain point. Otherwise you could do things like "usually round down and accumulate leftover; round up when accumulator >= 1". (edit: actually, that might still be OK as long as you swapped their position?)
There might be a way to do it with linear programming? (that's maths "programming", not computer programming - you'd need some maths to find the feasible solution, although you could probably skip the usual "optimisation" part).
As an example of the linear programming - with the example [1.3, 1.7, 1.9, 2.2, 2.8, 3.1] you could have the rules:
1 <= i < 2
1 <= j < 2
1 <= k < 2
2 <= l < 3
3 <= m < 4
i <= j <= k <= l <= m
i + j + k + l + m = 13
Then apply some linear/matrix algebra ;-p Hint: there are products to do the above based on things like the "Simplex" algorithm. Common university fodder, too (I wrote one at uni for my final project).
The problem, as I see it, is that the sorting algorithm is not specified. Or more like - whether it's a stable sort or not.
Consider the following array of floats:
[ 0.2 0.2 0.2 0.2 0.2 ]
The sum is 1. The integer array then should be:
[ 0 0 0 0 1 ]
However, if the sorting algorithm isn't stable, it could sort the "1" somewhere else in the array...
Make the summed diffs are to be under 1, and check to be sorted.
some like,
while(i < sizeof(fn) / sizeof(float)) {
res += fn[i] - floor(fn[i]);
if (res >= 1) {
res--;
in[i] = ceil(fn[i]);
}
else
in[i] = floor(fn[i]);
if (in[i-1] > in[i])
swap(in[i-1], in[i++]);
}
(it's paper code, so i didn't check the validity.)
Below a python and numpy implementation of #mikko-rantanen 's code. It took me a bit to put this together, so this may be helpful to future Googlers despite the age of the topic.
import numpy as np
from math import floor
original_array = np.array([1.2, 1.5, 1.4, 1.3, 1.7, 1.9])
# Calculate length of original array
# Need to substract 1, as indecies start at 0, but product of dimensions
# results in a count starting at 1
array_len = original_array.size - 1 # Index starts at 0, but product at 1
# Calculate expected sum of original values (must be integer)
expected_sum = np.sum(original_array)
# Collect values for temporary array population
array_list = []
lower_sum = 0
for i, j in enumerate(np.nditer(original_array)):
array_list.append([i, floor(j), j - floor(j)]) # Original index, lower bound, roundoff error
# Calculate the lower sum of values
lower_sum += floor(j)
# Populate temporary array
temp_array = np.array(array_list)
# Sort temporary array based on roundoff error
temp_array = temp_array[temp_array[:,2].argsort()]
# Calculate difference between expected sum and the lower sum
# This is the number of integers that need to be rounded up from the lower sum
# The sort order (roundoff error) ensures that the value closest to be
# rounded up is at the bottom of the array
difference = int(expected_sum - lower_sum)
# Add one to the number most likely to round up to eliminate the difference
temp_array_len, _ = temp_array.shape
for i in xrange(temp_array_len - difference, temp_array_len):
temp_array[i,1] += 1
# Re-sort the array based on original index
temp_array = temp_array[temp_array[:,0].argsort()]
# Return array to one-dimensional format of original array
array_list = []
for i in xrange(temp_array_len):
array_list.append(int(temp_array[i,1]))
new_array = np.array(array_list)
Calculate sum of floor and sum of numbers.
Round sum of numbers, and subtract with sum of floor, the difference is how many ceiling we need to patch(how many +1 we need).
Sorting the array with its difference of ceiling to number, from small to large.
For diff times(diff is how many ceiling we need to patch), we set result as ceiling of number. Others set result as floor of numbers.
public class Float_Ceil_or_Floor {
public static int[] getNearlyArrayWithSameSum(double[] numbers) {
NumWithDiff[] numWithDiffs = new NumWithDiff[numbers.length];
double sum = 0.0;
int floorSum = 0;
for (int i = 0; i < numbers.length; i++) {
int floor = (int)numbers[i];
int ceil = floor;
if (floor < numbers[i]) ceil++; // check if a number like 4.0 has same floor and ceiling
floorSum += floor;
sum += numbers[i];
numWithDiffs[i] = new NumWithDiff(ceil,floor, ceil - numbers[i]);
}
// sort array by its diffWithCeil
Arrays.sort(numWithDiffs, (a,b)->{
if(a.diffWithCeil < b.diffWithCeil) return -1;
else return 1;
});
int roundSum = (int) Math.round(sum);
int diff = roundSum - floorSum;
int[] res = new int[numbers.length];
for (int i = 0; i < numWithDiffs.length; i++) {
if(diff > 0 && numWithDiffs[i].floor != numWithDiffs[i].ceil){
res[i] = numWithDiffs[i].ceil;
diff--;
} else {
res[i] = numWithDiffs[i].floor;
}
}
return res;
}
public static void main(String[] args) {
double[] arr = { 1.2, 3.7, 100, 4.8 };
int[] res = getNearlyArrayWithSameSum(arr);
for (int i : res) System.out.print(i + " ");
}
}
class NumWithDiff {
int ceil;
int floor;
double diffWithCeil;
public NumWithDiff(int c, int f, double d) {
this.ceil = c;
this.floor = f;
this.diffWithCeil = d;
}
}
Without minimizing the variance, here's a trivial one:
Sort values from left to right.
Round all down to the next integer.
Let the sum of those integers be K. Increase the N-K rightmost values by 1.
Restore original order.
This obviously satisfies your conditions 1.-4. Alternatively, you could round to the closest integer, and increase N-K of the ones you had rounded down. You can do this greedily by the difference between the original and rounded value, but each run of rounded-down values must only be increased from right to left, to maintain sorted order.
If you can accept a small change in the total while improving the variance this will probabilistically preserve totals in python:
import math
import random
integer_list = [int(x) + int(random.random() <= math.modf(x)[0]) for x in my_list]
to explain it rounds all numbers down and adds one with a probability equal to the fractional part i.e. one in ten 0.1 will become 1 and the rest 0
this works for statistical data where you are converting a large numbers of fractional persons into either 1 person or 0 persons