Related
LightGBM predict method with pred_contrib=True returns an array of shape =(n_samples, (n_features + 1) * n_classes).
What is the order of data in the second dimension of this array?
In other words, there are two questions:
What is the correct way to reshape this array to use it: shape = (n_samples, n_features + 1, n_classes) or shape = (n_samples, n_classes, n_features + 1)?
In the feature dimension, there are n_features entries, one for each feature, and a (useless) entry for the contribution not related to any feature. What is the order of these entries: feature contributions in the entries 1,..., n_features in the same order they appear in the dataset, with the remaining (useless) entry at index 0, or some other way?
The answers are as follows:
The correct shape is (n_samples, n_classes, n_features + 1).
The feature contributions are in the entries 1,..., n_features in the same order they appear in the dataset, with the remaining (useless) entry at index 0.
The following code shows it convincingly:
import lightgbm, pandas, numpy
params = {'objective': 'multiclass', 'num_classes': 4, 'num_iterations': 10000,
'metric': 'multiclass', 'early_stopping_rounds': 10}
train_df = pandas.DataFrame({'f0': [0, 1, 2, 3] * 50, 'f1': [0, 0, 1] * 66 + [1, 2]}, dtype=float)
val_df = train_df.copy()
train_target = pandas.Series([0, 1, 2, 3] * 50)
val_target = pandas.Series([0, 1, 2, 3] * 50)
train_set = lightgbm.Dataset(train_df, train_target)
val_set = lightgbm.Dataset(val_df, val_target)
model = lightgbm.train(params=params, train_set=train_set, valid_sets=[val_set, train_set])
feature_contribs = model.predict(val_df, pred_contrib=True)
print('Shape of SHAP:', feature_contribs.shape)
# Shape of SHAP: (200, 12)
print('Averages over samples:', numpy.mean(feature_contribs, axis=0))
# Averages over samples: [ 3.99942301e-13 -4.02281771e-13 -4.30029167e+00 -1.90606677e-05
# 1.90606677e-05 -4.04157656e+00 2.24205077e-05 -2.24205077e-05
# -4.04265615e+00 -3.70370401e-15 5.20335728e-18 -4.30029167e+00]
feature_contribs.shape = (200, 4, 3)
print('Mean feature contribs:', numpy.mean(feature_contribs, axis=(0, 1)))
# Mean feature contribs: [ 8.39960111e-07 -8.39960113e-07 -4.17120401e+00]
(Each output appears as a comment in the following line.)
The explanation is as follows.
I have created a dataset with two features and with labels identical to the second of these features.
I would expect significant contribution from the second feature only.
After averaging the SHAP output over the samples, we get an array of the shape (12,) with nonzero values at the positions 2, 5, 8, 11 (zero-based).
This shows that the correct shape of this array is (4, 3).
After reshaping this way and averaging over the samples and the classes, we get an array of the shape (3,) with the nonzero entry at the end.
This shows that the last entry of this array corresponds to the last feature. This means that the entry at the position 0 does not correspond to any feature and the following entries correspond to features.
I'm trying to figure out how to solve a problem that seems a tricky variation of a common algorithmic problem but require additional logic to handle specific requirements.
Given a list of coins and an amount, I need to count the total number of possible ways to extract the given amount using an unlimited supply of available coins (and this is a classical change making problem https://en.wikipedia.org/wiki/Change-making_problem easily solved using dynamic programming) that also satisfy some additional requirements:
extracted coins are splittable into two sets of equal size (but not necessarily of equal sum)
the order of elements inside the set doesn't matter but the order of set does.
Examples
Amount of 6 euros and coins [1, 2]: solutions are 4
[(1,1), (2,2)]
[(1,1,1), (1,1,1)]
[(2,2), (1,1)]
[(1,2), (1,2)]
Amount of 8 euros and coins [1, 2, 6]: solutions are 7
[(1,1,2), (1,1,2)]
[(1,2,2), (1,1,1)]
[(1,1,1,1), (1,1,1,1)]
[(2), (6)]
[(1,1,1), (1,2,2)]
[(2,2), (2,2)]
[(6), (2)]
By now I tried different approaches but the only way I found was to collect all the possible solution (using dynamic programming) and then filter non-splittable solution (with an odd number of coins) and duplicates. I'm quite sure there is a combinatorial way to calculate the total number of duplication but I can't figure out how.
(The following method first enumerates partitions. My other answer generates the assignments in a bottom-up fashion.) If you'd like to count splits of the coin exchange according to coin count, and exclude redundant assignments of coins to each party (for example, where splitting 1 + 2 + 2 + 1 into two parts of equal cardinality is only either (1,1) | (2,2), (2,2) | (1,1) or (1,2) | (1,2) and element order in each part does not matter), we could rely on enumeration of partitions where order is disregarded.
However, we would need to know the multiset of elements in each partition (or an aggregate of similar ones) in order to count the possibilities of dividing them in two. For example, to count the ways to split 1 + 2 + 2 + 1, we would first count how many of each coin we have:
Python code:
def partitions_with_even_number_of_parts_as_multiset(n, coins):
results = []
def C(m, n, s, p):
if n < 0 or m <= 0:
return
if n == 0:
if not p:
results.append(s)
return
C(m - 1, n, s, p)
_s = s[:]
_s[m - 1] += 1
C(m, n - coins[m - 1], _s, not p)
C(len(coins), n, [0] * len(coins), False)
return results
Output:
=> partitions_with_even_number_of_parts_as_multiset(6, [1,2,6])
=> [[6, 0, 0], [2, 2, 0]]
^ ^ ^ ^ this one represents two 1's and two 2's
Now since we are counting the ways to choose half of these, we need to find the coefficient of x^2 in the polynomial multiplication
(x^2 + x + 1) * (x^2 + x + 1) = ... 3x^2 ...
which represents the three ways to choose two from the multiset count [2,2]:
2,0 => 1,1
0,2 => 2,2
1,1 => 1,2
In Python, we can use numpy.polymul to multiply polynomial coefficients. Then we lookup the appropriate coefficient in the result.
For example:
import numpy
def count_split_partitions_by_multiset_count(multiset):
coefficients = (multiset[0] + 1) * [1]
for i in xrange(1, len(multiset)):
coefficients = numpy.polymul(coefficients, (multiset[i] + 1) * [1])
return coefficients[ sum(multiset) / 2 ]
Output:
=> count_split_partitions_by_multiset_count([2,2,0])
=> 3
(Posted a similar answer here.)
Here is a table implementation and a little elaboration on algrid's beautiful answer. This produces an answer for f(500, [1, 2, 6, 12, 24, 48, 60]) in about 2 seconds.
The simple declaration of C(n, k, S) = sum(C(n - s_i, k - 1, S[i:])) means adding all the ways to get to the current sum, n using k coins. Then if we split n into all ways it can be partitioned in two, we can just add all the ways each of those parts can be made from the same number, k, of coins.
The beauty of fixing the subset of coins we choose from to a diminishing list means that any arbitrary combination of coins will only be counted once - it will be counted in the calculation where the leftmost coin in the combination is the first coin in our diminishing subset (assuming we order them in the same way). For example, the arbitrary subset [6, 24, 48], taken from [1, 2, 6, 12, 24, 48, 60], would only be counted in the summation for the subset [6, 12, 24, 48, 60] since the next subset, [12, 24, 48, 60] would not include 6 and the previous subset [2, 6, 12, 24, 48, 60] has at least one 2 coin.
Python code (see it here; confirm here):
import time
def f(n, coins):
t0 = time.time()
min_coins = min(coins)
m = [[[0] * len(coins) for k in xrange(n / min_coins + 1)] for _n in xrange(n + 1)]
# Initialize base case
for i in xrange(len(coins)):
m[0][0][i] = 1
for i in xrange(len(coins)):
for _i in xrange(i + 1):
for _n in xrange(coins[_i], n + 1):
for k in xrange(1, _n / min_coins + 1):
m[_n][k][i] += m[_n - coins[_i]][k - 1][_i]
result = 0
for a in xrange(1, n + 1):
b = n - a
for k in xrange(1, n / min_coins + 1):
result = result + m[a][k][len(coins) - 1] * m[b][k][len(coins) - 1]
total_time = time.time() - t0
return (result, total_time)
print f(500, [1, 2, 6, 12, 24, 48, 60])
I'm looking for solution to my problem. Say I have a number X, now I want to generate 20 random numbers whose sum would equal to X, but I want those random numbers to have enthropy in them. So for example, if X = 50, the algorithm should generate
3
11
0
6
19
7
etc. The sum of given numbres should equal to 50.
Is there any simple way to do that?
Thanks
Simple way:
Generate random number between 1 and X : say R1;
subtract R1 from X, now generate a random number between 1 and (X - R1) : say R2. Repeat the process until all Ri add to X : i.e. (X-Rn) is zero. Note: each consecutive number Ri will be smaller then the first. If you want the final sequence to look more random, simply permute the resulting Ri numbers. I.e. if you generate for X=50, an array like: 22,11,9,5,2,1 - permute it to get something like 9,22,2,11,1,5. You can also put a limit to how large any random number can be.
One fairly straightforward way to get k random values that sum to N is to create an array of size k+1, add values 0 and N, and fill the rest of the array with k-1 randomly generated values between 1 and N-1. Then sort the array and take the differences between successive pairs.
Here's an implementation in Ruby:
def sum_k_values_to_n(k = 20, n = 50)
a = Array.new(k + 1) { 1 + rand(n - 1) }
a[0] = 0
a[-1] = n
a.sort!
(1..(a.length - 1)).collect { |i| a[i] - a[i-1] }
end
p sum_k_values_to_n(3, 10) # produces, e.g., [2, 3, 5]
p sum_k_values_to_n # produces, e.g., [5, 2, 3, 1, 6, 0, 4, 4, 5, 0, 2, 1, 0, 5, 7, 2, 1, 1, 0, 1]
I bumped into this question and I am not sure if my solution is optimal.
Problem
Given N weighted (Wi) and possibly overlapping intervals (representing meeting schedules) , find the minimum number "&" capacity of meeting rooms needed to conduct all meetings.
Example
|---10------|. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .|---------8---------|
|------8-----| |----------10-----------|
|--------6-------|
For the above schedule, we would need two meeting rooms of 10 and 10 capacitities. ( am i correct ? )
My Solution
Take a set of rooms, and traverse the intervals from the left, if we have a meeting room available with a capacity greater than needed use it, if there is none that meets the criteria, make a new room or increment the existing rooms with the new capacity.
Example:
Start of 10 - { 10 }
Start of 8 - { 10, 8 }
End of 10 - { 10-free, 8 }
Start of 6 - { 10, 8 }
End of 8 - {10, 8-free }
Start of 10 = { 10, 8+=2 } OR {10, 10 }
and so on.....
this is essentially greedy..
Can someone prove this non-optimal?
Whats the solution if this is non-optimal? DP ?
I believe that this problem is equivalent to "Minimum Number of Platforms Required for a Railway/Bus Station" problem.
This article http://www.geeksforgeeks.org/minimum-number-platforms-required-railwaybus-station/ explains well how to approach it.
Intuition
I will give it a try. The naive approach is to enumerate all possible solutions and pick the best one. With this in mind, finding k rooms which can accommodate n meetings is equivalent to finding a k-way partition of n points. An example of a 2-way partition of 5 meetings is [ 0,2,4 ] and [ 1,3 ] in the OP example:
|---0------| |---------4---------|
|------1-----| |----------3-----------|
|--------2-------|
So the basic idea is to enumerate all k-way partitions of n meetings, with the constraint that two overlapping meetings cannot belong to the same cluster. For example, [ 0,1,2 ] and [ 3,4 ] is not a valid partition because meetings [ 0,1,2 ] cannot take place in the room; same goes for meetings [ 3,4 ]. Fortunately, the constraint is easy to implement when using a recursive approach.
Algorithm
With Python, it looks like this:
def kWay( A, k, overlap ) :
"""
A = list of meeting IDs, k = number of rooms,
overlap[ meeting ID m ] = set of meetings overlapping with m
"""
if k == 1 : # only 1 room: all meetings go there
yield [ A[:] ]
elif k == len(A) : # n rooms and n meetings: put 1 meeting per room
yield [ [a] for a in A ]
else :
for partition in kWay( A[1:], k, overlap ) : # add new meeting to one existing room
for i, ci in enumerate( partition ) :
isCompatible = all( A[0] not in overlap[x] for x in ci ) # avoid 2 overlapping meetings in the same room
res = partition[:i] + [ ci + [ A[0] ] ] + partition[ i+1: ]
if isCompatible :
yield res
for partition in kWay( A[1:], k-1, overlap ) : # add new meeting to a new room
isValid = ( set(A[1:]) & set.union( * ( overlap[a] for a in A[ 1: ] ) ) == set() ) # avoid 2 overlapping meetings in the same room
if (k-1>1) or ( k-1==1 and isValid ) :
yield partition + [ [ A[0] ] ]
This looks a bit complicated but it's actually quite simple when you realize that it is simply the recursive algorithm for kway partitioning + 2 extra lines to guarantee that we only consider valid partitions.
Solution of OP example
Ok now let's prepare the input data using the OP example:
import collections
n = 5
k = 2
#
A = range(n)
# prepare overlap dictionary
pairs = [ (0,1), (1,2), (2,3), (3,4) ] # overlapping meetings
size = dict( ( (0,10), (1,8), (2,6) , (3,10), (4,8) ) )
overlap = collections.defaultdict(set)
for (i,j) in pairs :
overlap[i].add(j)
overlap[j].add(i)
defaultdict(<type 'set'>, {0: set([1]), 1: set([0, 2]), 2: set([1, 3]), 3: set([2, 4]), 4: set([3])})
{0: 10, 1: 8, 2: 6, 3: 10, 4: 8}
Now we just iterate over the valid 2-way partitions and print the room sizes. There is only one valid partition, so this our solution:
for partition in kWay( A, k, overlap ) :
print partition, [ max( size[x] for x in c ) for c in partition ]
[[3, 1], [4, 2, 0]] [10, 10]
Ok so meetings 1,3 go a room of size 10, and meetings 0,2,4 go in a room of size 10.
A slightly more complicated example
But there was only one valid 2-way partition, so of course this was also the optimal solution. How boring! Let's add a new meeting 5 and a new room to the OP example to make it more interesting :
|---0------| |---5---| |---------4---------|
|------1-----| |----------3-----------|
|--------2-------|
Corresponding input data:
n = 6
k = 3
#
A = range(n)
pairs = [ (0,1), (1,2), (2,3), (3,4), (5,2), (5,3) ] # overlapping meetings
size = dict( ( (0,10), (1,8), (2,6) , (3,10), (4,8), (5,2) ) )
overlap = collections.defaultdict(set)
for (i,j) in pairs :
overlap[i].add(j)
overlap[j].add(i)
defaultdict(<type 'set'>, {0: set([1]), 1: set([0, 2]), 2: set([1, 3, 5]), 3: set([2, 4, 5]), 4: set([3]), 5: set([2, 3])})
{0: 10, 1: 8, 2: 6, 3: 10, 4: 8, 5: 2}
And the result:
for partition in kWay( A, k, overlap ) :
print partition, [ max( size[x] for x in c ) for c in partition ]
[[3, 1], [4, 2, 0], [5]] [10, 10, 2]
[[3, 1], [4, 2], [5, 0]] [10, 8, 10]
[[3, 0], [4, 2], [5, 1]] [10, 8, 8]
[[3], [4, 2, 0], [5, 1]] [10, 10, 8]
[[4, 5, 1], [3, 0], [2]] [8, 10, 6]
[[4, 5, 1], [3], [2, 0]] [8, 10, 10]
[[4, 5, 0], [3, 1], [2]] [10, 10, 6]
[[4, 5], [3, 1], [2, 0]] [8, 10, 10]
The optimal 3-way partition is [[3, 1], [4, 2, 0], [5]] and the optimal room sizes are [10, 10, 2]. You can also get the minimum size of all rooms directly:
min( sum( [ max( size[x] for x in c ) for c in partition ] ) for partition in kWay( A, k, overlap ) )
22
Consider this scenario:
(m1) |-3-|
(m2) |--2--|
(m3) |--1--|
(m4) |-1-|
(m5) |-2-|
Your solution will proceed as such:
{3} (First room created)
{3, 2} (Two meetings at same time, second room needed)
{3, 2, 1} (Three meetings at same time, third room needed)
{3, 2, 1} (m1 is over so m4 goes into the 3-room)
{3, 2, 1, 2} (Four meetings at same time, fourth room needed, create room at same size as newest meeting)
This solution has a cumulative capacity of 8.
Now consider this solution: {3, 2, 1, 1}. It has a cumulative capacity of 7.
At step (4) above, m4 will go into the unoccupied 1-room and the 3-room is still open. Thus, that is where m5 will go.
Assumptions Made
The optimal solution is first ranked on number of rooms: it will
have the lowest number of rooms. The second criteria is that it will
have the lowest cumulative capacity: the sum of the capacities of
each room.
As your solution is identified as greedy when you have
to create a room you will create one of the size of the room being
evaluated.
Two meetings cannot be in the same room at one time, regardless of size.
Algorithm Alteration
Update: I just realized that even with this alteration creating a room can still lead to sub-optimal solutions. The reason is that one could resize existing rooms before creating a new room.
As an example, say we have four meetings in four rooms.
m1 (size 4) is in a 4-room
m2 (size 2) is in a 4-room
m3 (size 1) is in a 2-room
m4 (size 1) is in a 2-room
And we seek to add m5 (size 5). My proposed algorithm alteration would create a new 5-room, adding 5 to the cumulative capacity. However, we could resize m2's room to be a 5-room, have m5 go there, and create a new room for m2 of size 2. This would only add 2 to the cumulative capacity.
One may wonder why not put m2 into one of 2-rooms (displacing m3) and create a new 1-room. Resizing rooms is more difficult as we can't guarantee that room will be open when the meeting that needs it starts. Adding rooms is easier because then that room will always have been there; it wasn't being used since we just created it at this step in the algorithm.
Sub-optimal Algorithm Alteration
As noted above, this is proven to be sub-optimal but I'm keeping it here until I can think of a better alternative.
To account for the scenario above you will need to do some extra work anytime you have to create a new room:
Find a list of all meetings currently active (including the one you're currently evaluating).
Start at the largest meeting and assign each meeting to a room.
When you reach a meeting that cannot be assigned that is the size of the room you must create.
Thus, in the example above, this alteration comes into play at step 5 when a new room needs to be created. Explanation per step above:
All meetings currently active: {m2, m3, m4, m5}. For the record, current rooms are {3, 2, 1}
Starting with largest, assign each meeting to a room {m2 goes to 3-room, m5 goes to 2-room, m3 goes to 1-room}
m4 is stuck without a room. Thus we must create a room for it. m4 is size 1 so the new room is also size 1.
To find the minimum number and capacity of meeting rooms needed to conduct all meetings,
you first need to schedule those meetings optionally to the rooms (with a score function that minimizes the number of capacity of rooms). That scheduling (similar to course scheduling) is NP-complete or NP-hard. That implies that your problem is too.
That, in turn, implies that there's no known algorithm for your problem that is optimal and scales out. Greedy algorithms (including your example) won't be consistently optimal (or not even near optimal if you have more constraints) - but at least they'll scale :) To get even better results (if needed), look into optimization algorithms, such as metaheuristics.
import java.util.*;
class Codechef
{
//Sorting by exchange
public static int[] Sort(int arr[],int n)
{
int temp=0;
for(int i=0;i<n-1;++i)
{
for(int j=i+1;j<n;++j)
{
if(arr[i]>arr[j])
{
temp=arr[i];
arr[i]=arr[j];
arr[j]=temp;
}
}
}
return arr;
}
public static void main (String[] args) throws java.lang.Exception
{
Scanner sc = new Scanner(System.in);
int n=0; //n : Total number of trains arriving on the platform
n=sc.nextInt();
String UserInp;
String [] inp = new String[n]; //inp[] : Accepting the user input ....Arrival time#Departure time
int []Ar = new int[n];
int []Dp = new int[n];
for(int i=0;i<n;++i)
{
UserInp=sc.next();
inp[i]=UserInp;
}
System.out.println("Displaying the input:\n");
for(int i=0;i<n;++i)
{
System.out.println("inp[i] : "+inp[i]);
}
for(int i=0;i<n;++i)
{
String temp=inp[i];
String a=temp.substring(0,2);
String b=temp.substring(3,5);
String c=temp.substring(6,8);
String d=temp.substring(9);
System.out.println("a : "+a);
System.out.println("b : "+b);
String x=a+b;
Ar[i]=Integer.parseInt(x);
System.out.println("x : "+x);
System.out.println("c : "+c);
System.out.println("d : "+d);
String y=c+d;
Dp[i]=Integer.parseInt(y);
System.out.println("y : "+y);
}
System.out.println("Displaying the arrival time : ");
for(int i=0;i<n;++i)
{
System.out.println(Ar[i]);
}
System.out.println("Displaying the departure time : ");
for(int i=0;i<n;++i)
{
System.out.println(Dp[i]);
}
Ar=Sort(Ar,n);
System.out.println("Displaying arrival time in ascending order :");
for(int i=0;i<n;++i)
{
System.out.println(Ar[i]);
}
Dp=Sort(Dp,n);
System.out.println("Displaying departure time in ascending order :");
for(int i=0;i<n;++i)
{
System.out.println(Dp[i]);
}
int count=0;
int need=0;
int i=0,j=0;
while(i<n && j<n)
{
if(Ar[i]<Dp[j])
{
++need;
if(need>count)
{
count=need;
}
++i;
}
else if(Ar[i]>Dp[j])
{
--need;
++j;
}
if(need==-1)
{
break;
}
}
if(need!=-1)
{
System.out.println("Required answer : "+count);
}
else
{
System.out.println("Invalid input");
}
}
}
Input:
6
09:00#09:10
12:00#09:40
09:50#11:20
11:00#11:30
15:00#19:00
18:00#20:00
Output:
Displaying the input:
inp[i] : 09:00#09:10
inp[i] : 12:00#09:40
inp[i] : 09:50#11:20
inp[i] : 11:00#11:30
inp[i] : 15:00#19:00
inp[i] : 18:00#20:00
a : 09
b : 00
x : 0900
c : 09
d : 10
y : 0910
a : 12
b : 00
x : 1200
c : 09
d : 40
y : 0940
a : 09
b : 50
x : 0950
c : 11
d : 20
y : 1120
a : 11
b : 00
x : 1100
c : 11
d : 30
y : 1130
a : 15
b : 00
x : 1500
c : 19
d : 00
y : 1900
a : 18
b : 00
x : 1800
c : 20
d : 00
y : 2000
Displaying the arrival time :
900
1200
950
1100
1500
1800
Displaying the departure time :
910
940
1120
1130
1900
2000
Displaying arrival time in ascending order :
900
950
1100
1200
1500
1800
Displaying departure time in ascending order :
910
940
1120
1130
1900
2000
Invalid input
The above is a detailed solution for the approach stated in the link below:
http://www.geeksforgeeks.org/minimum-number-platforms-required-railwaybus-station/
Here is my solution by java.
class Meeting{
LocalTime start;
LocalTime end;
Meeting(LocalTime start, LocalTime end){
this.start = start;
this.end = end;
}
}
public static int meeingRoom(List<Meeting> list){
//use queue structure to store the room in use
Queue<Meeting> rooms = new LinkedList<Meeting>();
rooms.add(list.get(0));
for(int i = 1; i< list.size(); i++){
Meeting current = list.get(i);
//max: keep the max of ever occupied
//occupied: so far occupied room
int max = 1, occupied = 1;
List<Meeting> rooms = new ArrayList<Meeting>();
rooms.add(list.get(0));
for(int i = 1; i< list.size(); i++){
Meeting current = list.get(i);
int roomSize = rooms.size();
//check all previous rooms to release finish room
for(int j = 0; j < roomSize; j++){
if(j >= rooms.size()) break;
Meeting previous = rooms.get(j);
if(current.start.compareTo(previous.end) >= 0){
rooms.remove(j);
}
rooms.add(current);
//when all the rooms once occupied, all remove
//reset the occupied
if(rooms.size() == 1 ){
max = Math.max(occupied, max);
occupied = 1;
}else{
occupied = Math.max(occupied, rooms.size());
};
}
//the last time added room hasn't been check
return Math.max(occupied, max);
}
I have an array of non-negative values. I want to build an array of values who's sum is 20 so that they are proportional to the first array.
This would be an easy problem, except that I want the proportional array to sum to exactly
20, compensating for any rounding error.
For example, the array
input = [400, 400, 0, 0, 100, 50, 50]
would yield
output = [8, 8, 0, 0, 2, 1, 1]
sum(output) = 20
However, most cases are going to have a lot of rounding errors, like
input = [3, 3, 3, 3, 3, 3, 18]
naively yields
output = [1, 1, 1, 1, 1, 1, 10]
sum(output) = 16 (ouch)
Is there a good way to apportion the output array so that it adds up to 20 every time?
There's a very simple answer to this question: I've done it many times. After each assignment into the new array, you reduce the values you're working with as follows:
Call the first array A, and the new, proportional array B (which starts out empty).
Call the sum of A elements T
Call the desired sum S.
For each element of the array (i) do the following:
a. B[i] = round(A[i] / T * S). (rounding to nearest integer, penny or whatever is required)
b. T = T - A[i]
c. S = S - B[i]
That's it! Easy to implement in any programming language or in a spreadsheet.
The solution is optimal in that the resulting array's elements will never be more than 1 away from their ideal, non-rounded values. Let's demonstrate with your example:
T = 36, S = 20. B[1] = round(A[1] / T * S) = 2. (ideally, 1.666....)
T = 33, S = 18. B[2] = round(A[2] / T * S) = 2. (ideally, 1.666....)
T = 30, S = 16. B[3] = round(A[3] / T * S) = 2. (ideally, 1.666....)
T = 27, S = 14. B[4] = round(A[4] / T * S) = 2. (ideally, 1.666....)
T = 24, S = 12. B[5] = round(A[5] / T * S) = 2. (ideally, 1.666....)
T = 21, S = 10. B[6] = round(A[6] / T * S) = 1. (ideally, 1.666....)
T = 18, S = 9. B[7] = round(A[7] / T * S) = 9. (ideally, 10)
Notice that comparing every value in B with it's ideal value in parentheses, the difference is never more than 1.
It's also interesting to note that rearranging the elements in the array can result in different corresponding values in the resulting array. I've found that arranging the elements in ascending order is best, because it results in the smallest average percentage difference between actual and ideal.
Your problem is similar to a proportional representation where you want to share N seats (in your case 20) among parties proportionnaly to the votes they obtain, in your case [3, 3, 3, 3, 3, 3, 18]
There are several methods used in different countries to handle the rounding problem. My code below uses the Hagenbach-Bischoff quota method used in Switzerland, which basically allocates the seats remaining after an integer division by (N+1) to parties which have the highest remainder:
def proportional(nseats,votes):
"""assign n seats proportionaly to votes using Hagenbach-Bischoff quota
:param nseats: int number of seats to assign
:param votes: iterable of int or float weighting each party
:result: list of ints seats allocated to each party
"""
quota=sum(votes)/(1.+nseats) #force float
frac=[vote/quota for vote in votes]
res=[int(f) for f in frac]
n=nseats-sum(res) #number of seats remaining to allocate
if n==0: return res #done
if n<0: return [min(x,nseats) for x in res] # see siamii's comment
#give the remaining seats to the n parties with the largest remainder
remainders=[ai-bi for ai,bi in zip(frac,res)]
limit=sorted(remainders,reverse=True)[n-1]
#n parties with remainter larger than limit get an extra seat
for i,r in enumerate(remainders):
if r>=limit:
res[i]+=1
n-=1 # attempt to handle perfect equality
if n==0: return res #done
raise #should never happen
However this method doesn't always give the same number of seats to parties with perfect equality as in your case:
proportional(20,[3, 3, 3, 3, 3, 3, 18])
[2,2,2,2,1,1,10]
You have set 3 incompatible requirements. An integer-valued array proportional to [1,1,1] cannot be made to sum to exactly 20. You must choose to break one of the "sum to exactly 20", "proportional to input", and "integer values" requirements.
If you choose to break the requirement for integer values, then use floating point or rational numbers. If you choose to break the exact sum requirement, then you've already solved the problem. Choosing to break proportionality is a little trickier. One approach you might take is to figure out how far off your sum is, and then distribute corrections randomly through the output array. For example, if your input is:
[1, 1, 1]
then you could first make it sum as well as possible while still being proportional:
[7, 7, 7]
and since 20 - (7+7+7) = -1, choose one element to decrement at random:
[7, 6, 7]
If the error was 4, you would choose four elements to increment.
A naïve solution that doesn't perform well, but will provide the right result...
Write an iterator that given an array with eight integers (candidate) and the input array, output the index of the element that is farthest away from being proportional to the others (pseudocode):
function next_index(candidate, input)
// Calculate weights
for i in 1 .. 8
w[i] = candidate[i] / input[i]
end for
// find the smallest weight
min = 0
min_index = 0
for i in 1 .. 8
if w[i] < min then
min = w[i]
min_index = i
end if
end for
return min_index
end function
Then just do this
result = [0, 0, 0, 0, 0, 0, 0, 0]
result[next_index(result, input)]++ for 1 .. 20
If there is no optimal solution, it'll skew towards the beginning of the array.
Using the approach above, you can reduce the number of iterations by rounding down (as you did in your example) and then just use the approach above to add what has been left out due to rounding errors:
result = <<approach using rounding down>>
while sum(result) < 20
result[next_index(result, input)]++
So the answers and comments above were helpful... particularly the decreasing sum comment from #Frederik.
The solution I came up with takes advantage of the fact that for an input array v, sum(v_i * 20) is divisible by sum(v). So for each value in v, I mulitply by 20 and divide by the sum. I keep the quotient, and accumulate the remainder. Whenever the accumulator is greater than sum(v), I add one to the value. That way I'm guaranteed that all the remainders get rolled into the results.
Is that legible? Here's the implementation in Python:
def proportion(values, total):
# set up by getting the sum of the values and starting
# with an empty result list and accumulator
sum_values = sum(values)
new_values = []
acc = 0
for v in values:
# for each value, find quotient and remainder
q, r = divmod(v * total, sum_values)
if acc + r < sum_values:
# if the accumlator plus remainder is too small, just add and move on
acc += r
else:
# we've accumulated enough to go over sum(values), so add 1 to result
if acc > r:
# add to previous
new_values[-1] += 1
else:
# add to current
q += 1
acc -= sum_values - r
# save the new value
new_values.append(q)
# accumulator is guaranteed to be zero at the end
print new_values, sum_values, acc
return new_values
(I added an enhancement that if the accumulator > remainder, I increment the previous value instead of the current value)