Algorithms to deal with apportionment problems - algorithm

I need an algorithm, or technique, or any guidance to optimize the following problem:
I have two companies:
Company A has 324 employees
Company B has 190 employees
The total of employees (A+B) is 514. I need to randomly select 28% of these 514 employees.
Ok, so let's do it: 28% of 514 is 143.92; Oh... this is bad, we are dealing with people here, so we cannot have decimal places. Ok then, I'll try rounding that up or down.
If I round down: 143 is 27,82101167% which is not good, since I must have at least 28%, so I must round up to 144.
So now I know that 144 employees must be selected.
The main problem comes now... It's time to check how much percentage I must use for each company to get the total of 144. How do I do that in order to have the percentage as close as possible to 28% for each company?
I'll exemplify:
If I just apply 28% for each company I get:
Company A has 324 employers: 0.28 * 324 = 90.72
Company B has 190 employers: 0.28 * 190 = 53.2
Again, I end up with decimal places. So I must figure out which ones I should round up, and which ones should I round down to get 144 total.
Note: For this example I only used two companies, but in the real problem I have 30 companies.

There are many methods to perform apportionment, and no objective best method.
The following is in terms of states and seats rather than companies and people. Credit probably goes to Dr. Larry Bowen who is cited on the base site for the first link.
Hamilton’s Method
Also known as the Method of Largest Remainders and sometimes as Vinton's Method.
Procedure:
Calculate the Standard Divisor.
Calculate each state’s Standard Quota.
Initially assign each state its Lower Quota.
If there are surplus seats, give them, one at a time, to states in descending order of the fractional parts of their Standard Quota.
Here, the Standard Divisor can be found by dividing the total population (the sum of the population of each company) by the number of people you want to sample (144 in this case). The Standard Quota is the company's population divided by the Standard Divisor. The Lower Quota is this value rounded down. However, this method has some flaws.
Problems:
The Alabama Paradox An increase in the total number of seats to be apportioned causes a state to lose a seat.
The Population Paradox An increase in a state’s population can cause it to lose a seat.
The New States Paradox Adding a new state with its fair share of seats can affect the number of seats due other states.
This is probably the most simple method to implement. Below are some other methods with their accompanying implementations and drawbacks.
Jefferson’s Method Also known as the Method of Greatest Divisors and in Europe as the Method of d'Hondt or the Hagenbach-Bischoff Method.
Procedure:
Calculate the Standard Divisor.
Calculate each state’s Standard Quota.
Initially assign each state its Lower Quota.
Check to see if the sum of the Lower Quotas is equal to the correct number of seats to be apportioned.
If the sum of the Lower Quotas is equal to the correct number of seats to be apportioned, then apportion to each state the number of seats equal to its Lower Quota.
If the sum of the Lower Quotas is NOT equal to the correct number of seats to be apportioned, then, by trial and error, find a number, MD, called the Modified Divisor to use in place of the Standard Divisor so that when the Modified Quota, MQ, for each state (computed by dividing each State's Population by MD instead of SD) is rounded DOWN, the sum of all the rounded (down) Modified Quotas is the exact number of seats to be apportioned. (Note: The MD will always be smaller than the Standard Divisor.) These rounded (down) Modified Quotas are sometimes called Modified Lower Quotas. Apportion each state its Modified Lower Quota.
Problem:
Violates the Quota Rule. (However, it can only violate Upper Quota—never Lower Quota.)
Webster’s Method Also known as the Webster-Willcox Method as well as the Method of Major Fractions.
Procedure:
Calculate the Standard Divisor.
Calculate each state’s Standard Quota.
Initially assign a state its Lower Quota if the fractional part of its Standard Quota is less than 0.5. Initially assign a state its Upper Quota if the fractional part of its Standard Quota is greater than or equal to 0.5. [In other words, round down or up based on the arithmetic mean (average).]
Check to see if the sum of the Quotas (Lower and/or Upper from Step 3) is equal to the correct number of seats to be apportioned.
If the sum of the Quotas (Lower and/or Upper from Step 3) is equal to the correct number of seats to be apportioned, then apportion to each state the number of seats equal to its Quota (Lower or Upper from Step 3).
If the sum of the Quotas (Lower and/or Upper from Step 3) is NOT equal to the correct number of seats to be apportioned, then, by trial and error, find a number, MD, called the Modified Divisor to use in place of the Standard Divisor so that when the Modified Quota, MQ, for each state (computed by dividing each State's Population by MD instead of SD) is rounded based on the arithmetic mean (average) , the sum of all the rounded Modified Quotas is the exact number of seats to be apportioned. Apportion each state its Modified Rounded Quota.
Problem:
Violates the Quota Rule. (However, violations are rare and are usually associated with contrived situations.)
Huntington-Hill Method Also known as the Method of Equal Proportions.
Current method used to apportion U.S. House
Developed around 1911 by Joseph A. Hill, Chief Statistician of the Bureau of the Census and Edward V. Huntington, Professor of Mechanics & Mathematics, Harvard
Preliminary terminology: The Geometric Mean
Procedure:
Calculate the Standard Divisor.
Calculate each state’s Standard Quota.
Initially assign a state its Lower Quota if the fractional part of its Standard Quota is less than the Geometric Mean of the two whole numbers that the Standard Quota is immediately between (for example, 16.47 is immediately between 16 and 17). Initially assign a state its Upper Quota if the fractional part of its Standard Quota is greater than or equal to the Geometric Mean of the two whole numbers that the Standard Quota is immediately between (for example, 16.47 is immediately between 16 and 17). [In other words, round down or up based on the geometric mean.]
Check to see if the sum of the Quotas (Lower and/or Upper from Step 3) is equal to the correct number of seats to be apportioned.
If the sum of the Quotas (Lower and/or Upper from Step 3) is equal to the correct number of seats to be apportioned, then apportion to each state the number of seats equal to its Quota (Lower or Upper from Step 3).
If the sum of the Quotas (Lower and/or Upper from Step 3) is NOT equal to the correct number of seats to be apportioned, then, by trial and error, find a number, MD, called the Modified Divisor to use in place of the Standard Divisor so that when the Modified Quota, MQ, for each state (computed by dividing each State's Population by MD instead of SD) is rounded based on the geometric mean, the sum of all the rounded Modified Quotas is the exact number of seats to be apportioned. Apportion each state its Modified Rounded Quota.
Problem:
Violates the Quota Rule.
For reference, the Quota Rule :
Quota Rule
An apportionment method that always allocates only lower and/or upper bounds follows the quota rule.

The problem can be framed as that of finding the closest integer approximation to a set of ratios. For instance, if you want to assign respectively A, B, C ≥ 0 members from 3 groups to match the ratios a, b, c ≥ 0 (with a + b + c = N > 0), where N = A + B + C > 0 is the total allocation desired, then you're approximating (a, b, c) by (A, B, C) with A, B and C restricted to integers.
One way to solve this may be to set it up as a least squares problem - that of minimizing |a - A|² + |b - B|² + |c - C|²; subject to the constraints A + B + C = N and A, B, C ≥ 0.
A necessary condition for the optimum is that it be a local optimum with respect discrete unit changes. For instance, (A,B,C) → (A+1,B-1,C), if B > 0 ... which entails the condition (A - B ≥ a - b - 1 or B = 0).
For the situation at hand, the optimization problem is:
|A - a|² + |B - b|²
a = 144×324/(324+190) ≅ 90.770, b = 144×190/(324+190) ≅ 53.230
which leads to the conditions:
A - B ≥ a - b - 1 ≅ +36.541 or B = 0
B - A ≥ b - a - 1 ≅ -38.541 or A = 0
A + B = 144
Since they are integers the inequalities can be strengthened:
A - B ≥ +37 or B = 0
B - A ≥ -38 or A = 0
A + B = 144
The boundary cases A = 0 and B = 0 are ruled out, since they don't satisfy all three conditions. So, you're left with 37 ≤ A - B ≤ 38 or, since A + B = 144: 181 ≤ 2A ≤ 182 or A = 91 ... and B = 53.
It is quite possible that this way of framing the problem may be equivalent, in terms of its results, to one of the algorithms cited in an earlier reply.

My suggestion is to just take 28% of each company and round up to the nearest person.
In your case, you would go with 91 and 54. Admittedly, this does result in having a bit over 28%.
The most accurate method is as follows:
Calculate the exact number that you want.
Take 28% for each company and round down.
Sort the companies in descending order by the remainder.
Go through the list and choose the top n elements until you get exactly the number you want.

Since I originally posted this question I came across a description of this exact problem in Martin Fowler's book "Patterns of Enterprise Application Architecture" (page 489 and 490).
Martin talks about a "Matt Foemmel’s simple conundrum" of dividing 5 cents between two accounts, but must obey the distribution of 70% and 30%. This describes my problem in a much simpler way.
Here are the solutions he presents in his book to that problem:
Perhaps the most common is to ignore it—after all, it’s only a penny here
and there. However this tends to make accountants understandably nervous.
When allocating you always do the last allocation by subtracting from
what you’ve allocated so far. This avoids losing pennies, but you can get a
cumulative amount of pennies on the last allocation.
Allow users of a Money class to declare the rounding scheme when they call
the method. This permits a programmer to say that the 70% case rounds up and
the 30% rounds down. Things can get complicated when you allocate across ten
accounts instead of two. You also have to remember to round. To encourage people to remember I’ve seen some Money classes force a rounding parameter into
the multiply operation. Not only does this force the programmer to think about
what rounding she needs, it also might remind her of the tests to write. However,
it gets messy if you have a lot of tax calculations that all round the same way.
My favorite solution: have an allocator function on the money. The
parameter to the allocator is a list of numbers, representing the ratio to be allocated (it would look something like aMoney.allocate([7,3])). The allocator returns
a list of monies, guaranteeing that no pennies get dropped by scattering them
across the allocated monies in a way that looks pseudo-random from the outside. The allocator has faults: You have to remember to use it and any precise
rules about where the pennies go are difficult to enforce.

Related

Is that proof of optimism correct?

I have got the following problem given: There are given n Files with length z1,...zn and a usage u1,....un where the sum of all u1,...un equals 1. 0 < u_i < 1
They want to have a sorting where the Time taken to take a file from store is minimal. Example, if z1 = 12, l2 = 3 and u1 = 0,9 and u2 = 0,1 where file 1 is taken first, the approximate time to access is 12* 0,8 + 15* 0,1
My task: Prove that this (greedy) algorithm is optimal.
My Question: Is my answer to that question correct or what should I improve?
My answer:
Suppose the Algorithm is not optimal, there has to exist an order that if more efficennt. For that there have to be two factors being mentioned. The usage and the length. The more a file is used, the shorter its time to access has to be. For that the length of files in the first places before has to be as short as possible. If the formula z1/u1 is being sorted descending. The files with a high usage will be placed lastly. Hence the accesstime is made by all lengths before * usage it means that more often files will be accessed slowly. It is a contradiction to the efficency. Suppose the formula z_i / u_i is unefficient. The division by u_i has the consequence that if more usage is given, the term will be smaller. Which means more often used terms will be accessed faster, hence u_i is < 1 and > 0. If differing from that division, terms with higher usage wont be preferred then, which would be a contradiciton to the efficiency. Also because of z_i at the top of fraction lower lengths will be preferred first. Differing from that, it would mean that terms with longer length should be preferred also. Taking longer terms first is a contradiction to efficiency. Hence all other alternatives of taking another sorting could be contradicted it can be proved that the term is optimal and correct.

Maximize variable x and minimize variable y

I could frame it like this. A bunch of people enter a pool willing to fund something. They can offer the funds at whatever interest rate they think is right. So basically they make bids. I want to find the best and cheapest route to fund the money that was required by the guy who created the pool. I want maximum funds and minimum interest rate while also minimizing the total number of people the guy will have to pay back to
You have 3 different things you are trying to maximize/minimize. You can use weighting variables to determine what is most important to your case.
Suppose α is the importance you want to put on total capital. β is the weight you want to put on the number of lenders. α & β are subject to the constraint α + β ≤ 1.
A general formula to maximize your pool (where positive numbers are desirable) is:
Result = α*(funds)+β*(-lenders)+(1-α-β)*(-interest rate)
The positive numbers wouldn't be apples to apples comparisons unless you normalized the funds, lenders, and interest rate values.

Determining an algorithm to calculate individual item prices for "deal" listings

Here are a couple of example scenarios about what I'm trying to figure out:
Let's say a grocery store item is listed as 4 for 5.00. How do we go about figuring the unit price for each item, according to the deal that is listed?
A simple solution would be to divide the total price by the quantity listed, and in this case, you would get 1.25.
However, in a situation that is a bit more complicated, such as 3 for 5.00, dividing the price by the quantity gives roughly 1.6666666666666667, which would round to 1.67.
If we round all three items to 1.67, the total price is not 5.00, but in fact 5.01. The individual prices would need to be calculated as 1.67, 1.67, and 1.66 in order to add up correctly.
The same goes for something like 3 for 4.00. The mathematical unit price would be 1.3333333333333333, rounding to 1.33. However, we need to adjust one of them again because the actual price without adjustments would be 3.99. The individual prices would need to be 1.34, 1.33, and 1.33 to add up correctly.
Is there an efficient way to determine how to split up a price deal like this and how to determine adjusted amounts so that the individual prices add up correctly?
If you want to divide an integer number (e.g. of pence) up into as equal portions as possible one way is to mimic dividing up that portion of a line by making marks in it, so portion i of n (counting from 0) when the total is T is of length floor((T * (i + 1)) / n) - floor((T * i) / n).
Whether it makes sense to say the the individual prices of 3 items are 1.67, 1.67, and 1.66 is another matter. How do you decide which item is the cheap one?
Since we're talking about basic math here, I'd say efficiency depends more on your implementation than the algorithm. When you say "efficient", do you mean you don't want to add up all prices to check the remainder? That can be done:
The premise is, you're selling x items for a price of y, where x is obviously integer and y is obviously a float rounded to 2 decimals.
First, you need to calculate the remainder: R = (y*100)%x.
x-R of the items will cost (y * 100 : x) /100 (where ":" means integer division and "/" means float division)
and R of the items will cost (y * 100 : x) / 100 + 0.01
That algorithm is theoretically speaking efficient since it has a complexity of O(1). But I think I remember that in hardware not much is more efficient than adding floats (Don't take my word for it, I didn't pay that much attention to my hardware lectures), so maybe the crude approach is still better.

Calculate new probability based on past probability per value

I want to calculate a percentage probability based on a list of past occurrences.
The data looks similar to this simplified table, for instance when the first value has been 8 in the past there has been a 72% chance of the event occurring.
1 76%
2 64%
4 80%
6 85%
7 83%
8 72%
11 70%
The full table ranges from 0 to 1030 and has 377 rows but changes daily. I want pass the function a value such as 3 and be returned a percentage probability of the event occurring. I don't need exact code, but would appreciate being pointed in the right direction.
Thanks
Based on your answers in the comments of the question, I would suggest an interpolation---linear interpolation is the simplest answer. It doesn't look like a probabilistic model would be appropriate based on the series in the spreadsheet (there doesn't appear to be a clear relationship between the column 1 and column 3).
To give an example of how this would work: imagine you want the probability for some point p, which is unobserved in the data. The biggest value you observe which is less than p is p_low (with corresponding probability f(p_low)), and the smallest value greater than p is p_high (with probability f(p_high)). Your estimate for p is:
interval = p_high - p_low
f_p_hat = ((p-p_low)/interval*f_p_low) + ((p_high-p)/interval*f_p_high)
This is going to make your estimate for p a weighted average of the values at p_low and p_high, with weights given by the distances between p and p_low, and p and p_high. E.g. if p is equidistant between p_low and p_high, f_p_hat (your estimate for f(p)) is just the mean of p_low and p_high.
Now, linear interpolation may not work if you have reason to suspect that the estimates at the endpoints are inaccurate (possibly due to small sample sizes). If so, it would be possible to do a (possibly weighted) least squares fit to a neighbourhood of points around p, and use that as a prediction. If this is the case I can go into a bit more detail.

Open-ended tournament pairing algorithm

I'm developing a tournament model for a virtual city commerce game (Urbien.com) and would love to get some algorithm suggestions. Here's the scenario and current "basic" implementation:
Scenario
Entries are paired up duel-style, like on the original Facemash or Pixoto.com.
The "player" is a judge, who gets a stream of dueling pairs and must choose a winner for each pair.
Tournaments never end, people can submit new entries at any time and winners of the day/week/month/millenium are chosen based on the data at that date.
Problems to be solved
Rating algorithm - how to rate tournament entries and how to adjust their ratings after each match?
Pairing algorithm - how to choose the next pair to feed the player?
Current solution
Rating algorithm - the Elo rating system currently used in chess and other tournaments.
Pairing algorithm - our current algorithm recognizes two imperatives:
Give more duels to entries that have had less duels so far
Match people with similar ratings with higher probability
Given:
N = total number of entries in the tournament
D = total number of duels played in the tournament so far by all players
Dx = how many duels player x has had so far
To choose players x and y to duel, we first choose player x with probability:
p(x) = (1 - (Dx / D)) / N
Then choose player y the following way:
Sort the players by rating
Let the probability of choosing player j at index jIdx in the sorted list be:
p(j) = ...
0, if (j == x)
n*r^abs(jIdx - xIdx) otherwise
where 0 < r < 1 is a coefficient to be chosen, and n is a normalization factor.
Basically the probabilities in either direction from x form a geometic series, normalized so they sum to 1.
Concerns
Maximize informational value of a duel - pairing the lowest rated entry against the highest rated entry is very unlikely to give you any useful information.
Speed - we don't want to do massive amounts of calculations just to choose one pair. One alternative is to use something like the Swiss pairing system and pair up all entries at once, instead of choosing new duels one at a time. This has the drawback (?) that all entries submitted in a given timeframe will experience roughly the same amount of duels, which may or may not be desirable.
Equilibrium - Pixoto's ImageDuel algorithm detects when entries are unlikely to further improve their rating and gives them less duels from then on. The benefits of such detection are debatable. On the one hand, you can save on computation if you "pause" half the entries. On the other hand, entries with established ratings may be the perfect matches for new entries, to establish the newbies' ratings.
Number of entries - if there are just a few entries, say 10, perhaps a simpler algorithm should be used.
Wins/Losses - how does the player's win/loss ratio affect the next pairing, if at all?
Storage - what to store about each entry and about the tournament itself? Currently stored:
Tournament Entry: # duels so far, # wins, # losses, rating
Tournament: # duels so far, # entries
instead of throwing in ELO and ad-hoc probability formulae, you could use a standard approach based on the maximum likelihood method.
The maximum likelihood method is a method for parameter estimation and it works like this (example). Every contestant (player) is assigned a parameter s[i] (1 <= i <= N where N is total number of contestants) that measures the strength or skill of that player. You pick a formula that maps the strengths of two players into a probability that the first player wins. For example,
P(i, j) = 1/(1 + exp(s[j] - s[i]))
which is the logistic curve (see http://en.wikipedia.org/wiki/Sigmoid_function). When you have then a table that shows the actual results between the users, you use global optimization (e.g. gradient descent) to find those strength parameters s[1] .. s[N] that maximize the probability of the actually observed match result. E.g. if you have three contestants and have observed two results:
Player 1 won over Player 2
Player 2 won over Player 3
then you find parameters s[1], s[2], s[3] that maximize the value of the product
P(1, 2) * P(2, 3)
Incidentally, it can be easier to maximize
log P(1, 2) + log P(2, 3)
Note that if you use something like the logistics curve, it is only the difference of the strength parameters that matters so you need to anchor the values somewhere, e.g. choose arbitrarily
s[1] = 0
In order to have more recent matches "weigh" more, you can adjust the importance of the match results based on their age. If t measures the time since a match took place (in some time units), you can maximize the value of the sum (using the example)
e^-t log P(1, 2) + e^-t' log P(2, 3)
where t and t' are the ages of the matches 1-2 and 2-3, so that those games that occurred more recently weigh more.
The interesting thing in this approach is that when the strength parameters have values, the P(...) formula can be used immediately to calculate the win/lose probability for any future match. To pair contestants, you can pair those where the P(...) value is close to 0.5, and then prefer those contestants whose time-adjusted number of matches (sum of e^-t1 + e^-t2 + ...) for match ages t1, t2, ... is low. The best thing would be to calculate the total impact of a win or loss between two players globally and then prefer those matches that have the largest expected impact on the ratings, but that could require lots of calculations.
You don't need to run the maximum likelihood estimation / global optimization algorithm all the time; you can run it e.g. once a day as a batch run and use the results for the next day for matching people together. The time-adjusted match masses can be updated real time anyway.
On algorithm side, you can sort the players after the maximum likelihood run base on their s parameter, so it's very easy to find equal-strength players quickly.

Resources