algorithm tipping point - algorithm

I am a complete algorithm dunce and I have this problem where I need to find the lowest cost of buying an unknown quantity of widgets either by unit price or by buying in batches of x widgets... an example will help I'm sure:-
1) Widget price per unit is $0.05
2) Batches of widgets are priced at $4.00 per 100 widgets
Say I wish to buy 140 widgets -
a) costing it by unit is 140 x $0.05c => $7.00
b) costing it by batch is 2 batches of 100 # $4.00 => $8.00 (the excess 60 widgets can be ignored)
so buying by unit in this case is cheaper by $1.00
However if I wish to buy 190 widgets then -
a) costing it by unit is 190 x $0.05c => $9.50
b) costing it by batch is 2 batches of 100 # $4.00 => $8.00 (the excess 10 widgets can be ignored)
The price is cheaper in this case by batch buying...
So what I need to find out how to programatically find out where is the 'tipping point' between the 2 methods to get the cheapest price.
I hope I have explained it OK and I'm sure it's a simple answer but my brain has faded fast today!
TIA
EDIT::
Ok sorry -- I realise I haven't been as clear as I should -- as someone has pointed out a mix of batches and units is also possible so for the 140 widget example it could also be 1 batch and 40 units.
What I am trying to achieve is to programatically find the cheapest way to purchase X number of widgets priced each at $XX and also given a batch price $YY of NN widgets.
Any excess widgets from buying a batch is no problem i.e. it can be more than X purchased but it cannot be less than X
So for the 140 example 1 batch # $4.00 + 40 units # $0.05 => $6.00 which is the cheapest I think. And
for the 190 example 2 batches is still the cheapest I think as 1 batch + 90 units is $8.50...
I was hoping there would be some neat equation out there which would do this :)

I wrote a basic brute-force style script in python to compare the prices of the two options (up to 1,000 elements). It's not the fastest nor most elegant approach but it appears to work.
The format on the output is: (<unit-count>, (<per-unit-cost>, <per-batch-cost>))
import math
import itertools
import pprint
unitList = range(1000)
pricePerUnit = .05
pricePerBatch = 4.0
numberPerBatch = 100.0
def calculatePerUnit(units):
""" Calculate the price of buying per unit
"""
return units * pricePerUnit
def calculatePerBatch(units):
""" Calculate the price of buying per batch
"""
return math.ceil(units / numberPerBatch) * pricePerBatch
def main():
""" Execute the script
"""
perUnit = map(calculatePerUnit, unitList)
perBatch = map(calculatePerBatch, unitList)
comparisonList = zip(perUnit, perBatch)
perUnitCheaperPriceList = list(itertools.ifilter(lambda x: x[0] < x[1], comparisonList))
perUnitCheaperUnitList = map(lambda x: int(x[0] / .05), perUnitCheaperPriceList)
pprint.pprint(zip(perUnitCheaperUnitList, perUnitCheaperPriceList))
if __name__=="__main__":
main()
And the results:
[gizmo#test ~]$ python TippingPoint.py
[(1, (0.050000000000000003, 4.0)),
... These are sequential values I left out for brevity ...
(79, (3.9500000000000002, 4.0)),
(101, (5.0500000000000007, 8.0)),
... These are sequential values I left out for brevity ...
(159, (7.9500000000000002, 8.0)),
(201, (10.050000000000001, 12.0)),
... These are sequential values I left out for brevity ...
(239, (11.950000000000001, 12.0)),
(301, (15.050000000000001, 16.0)),
... These are sequential values I left out for brevity ...
(319, (15.950000000000001, 16.0))]

The answer to this is astonishingly easy, if you just think of it the other way around. You already know the 'tipping point' in price. It is $8.00. What you do not know is that point in units, which is $8.00 divided by $0.05 (per units) => 160 units.
Please note that there also is the chance of buying 100 units # $4.00 and 60 units # $0.05, resulting in $7.00 total. That is a third possibility that you may or may not account for.
Edit: After your edit, things are getting even more simple:
you have 1 item at $XX, and a batch of NN items at $YY. Assuming that $YY/NN < $XX (meaning, the batch actually saves money), all you have to do is this:
Divide your target quantity by NN, round that down. You will have to buy this number of batches
Multiply the remaining quantity with $XX. If that is more than $YY just buy an extra batch, otherwise buy the remaining quantity at $XX

Related

Generate a number within a range and considering a mean val

I want to generate a random number within a range while considering a mean value.
I have a solution for generating the range:
turtles-own [age]
to setup
crt 2 [
get-age
]
end
to get-age
let min-age 65
let max-age 105
set age ( min-age + random ( max-age - min-age ) )
end
However, if I use this approach every number can be created with the same probability, which doesn't make much sense in this case as way more people are 65 than 105 years old.
Therefore, I want to include a mean value. I found random-normal but as I don't have a standard deviation and my values are not normally distributed, I can't use this approach.
Edit:
An example: I have two agent typologies. Agent typology 1 has the mean age 79 and the age range 67-90. Agent typology 2 has the mean age 77 and the age range 67-92.
If I implement the agent typologies in NetLogo as described above, I get for agent typlogy 1 the mean age 78 and for agent typology 2 the mean age 79. The reason for that is that for every age the exact same number of agents is generated. This gives me in the end the wrong result for my artificial population.
[Editor's note: Comment from asker added here.]
I want a distribution of values with most values for the min value and fewest values for the max value. However, the curve of the distribution is not necessarily negative linear. Therefore, I need the mean value. I need this approach because there is the possibility that one agent typology has the range for age 65 - 90 and the mean age 70 and another agent typology has the same age range but the mean age is 75. So the real age distribution for the agents would look different.
This is a maths problem rather than a NetLogo problem. You haven't worked out what you want your distribution to look like (lots of different curves can have the same min, max and mean). If you don't know what your curve looks like, it's pretty hard to code it in NetLogo.
However, let's take the simplest curve. This is two uniform distributions, one from the min to the mean and the other from the mean to the max. While it's not decreasing along the length, it will give you the min, max and mean that you want and the higher values will have lower probability as long as the mean is less than the midway point from min to max (as it is if your target is decreasing). The only question is what is the probability to select from each of the two uniform distributions.
If L is your min (low value), H is your max (high value) and M for mean, then you need to find the probability P to select from the lower range, with (1-P) for the upper range. But you know that the total probability of the lower range must equal the total probability of the upper range must equal 0.5 because you want to switch ranges at the mean and the mean must also be the mean of the combined distribution. Therefore, each rectangle is the same size. That is P(M-L) = (1-P)(H-M). Solving for P gets you:
P = (H-M) / (H - L)
Put it into a function:
to-report random-tworange [#min #max #mean]
let prob (#max - #mean) / (#max - #min)
ifelse random-float 1 < prob
[ report #min + random-float (#mean - #min) ]
[ report #mean + random-float (#max - #mean) ]
end
To test this, try different values in the following code:
to testme
let testvals []
let low 77
let high 85
let target 80
repeat 10000 [set testvals lput (random-tworange low high target) testvals]
print mean testvals
end
One other thing you should think about - how much does age matter? This is a design question. You only need to include things that change an agent's behaviour. If agents with age 70 make the same decisions as those with age 80, then all you really need is that the age is in this range and not the specific value.

Maximize weekly $ given changing price for parts & wholes

I have a dollar amount I wish to maximize my spend on products to purchase for a given list of projects.
For example $60. I cannot spend more than $60 a week but can hold over any left over amount. So, if I spend $58 I get to spend $62 the next week. If I purchase a product on week 1 I can use the left over amount on week 2 thereby not needing to re-purchase the same item.
The solution needs to be generic so that I can maximize between a lot of products and a lot of projects given that fixed dollar amount per week.
I need to generate a report with the list of products to purchase and the list of projects to do for that week. The prices get updates weekly so I will need to recalculate the max spend weekly (meaning forecasting is not really part of the solution) and I need to reuse the amount from the products/inventory already purchased.
I have all the data and there aren't any unknown variables. I just need to be able to figure out how to maximize what to purchase given parts and wholes under a fixed dollar amount with history.
To make it more concrete through example (although abstracted from my real data):
I have a database of products (12.5 K different ones) and their corresponding prices. I also have a list of fixed list projects (let's say 2500) I wish to do with those products. For each project I have the corresponding products needed for each. Each project takes a different amount of product. Projects can have overlapping or unique products per project.
So for example:
project 1 is building my model airplanes;
project 2 is fixing my picture frames;
project 3 is building a bird house;
etc.
Project 1 may need:
glue (1oz)
paint (3 qt)
balsam wood (2 lbs)
Project 2 may need:
glue (2 oz)
nails (10 count)
project 3 may need:
glue (10 oz)
paint (5 qts)
nails (40 count)
wood balsam (3 lbs)
wood pine (50 lbs)
Products:
Glue 4oz - $10
Paint 3qts - $30
Nails 15 count - $7
Wood Balsam 8 pounds - $12
Wood Pine 12 pounds - $8
For example, if I buy a bottle of glue (4 oz) at $10 I can use it for my airplanes and my picture frames but not my bird house. I need to do an exhaustive analysis of all products and all projects weekly given my dollar amount to spend since the prices change (sales/demand/etc.).
How can I best spend the $60 to do as many projects as possible in a given week? Week 2 I get a new $60 to spend, (most likely have leftover) money, and have product (such as glue) left over from the week before?
Any python code / projects that do something similar or exactly this already that I may be able to import and modify for my needs?
Any help in terms of algorithms, sample code, full solution!?!, ideas, etc. would be appreciated...
Thanks in advance!!! (FYI: his is for a personal project.)
This is a problem which is very well suited to be tackled with mathematical programming. With mathematical optimization you can optimize variables (e. g. a variable says if a project is conducted at some point) with an objective like the numbers of projects conducted while also considering a set of constraints. For Python there are several free libraries for optimization of mathematical programs, I will show how to get started with your problem using PuLP. Please note that free software for these kind of problems usually performs way worse than commercial one, which can be very expensive. For small or easy problems the free software suffices though.
To get started:
easy_install pulp
Now, import pulp and as a little help itertools.product. There are many ways to represent your data, I choose to declare some ranges which serve as index sets. So r = 0 would be glue and p = 0 build a model air plane. The number of time periods you have to choose. With 4 time periods all projects can be conducted eventually.
from pulp import *
from itertools import product
R = range(5) # Resources
P = range(3) # Projects
T = range(4) # Periods
Your parameters could be represented as follows. project_consumption[0, 0] expresses that project 0 needs 1/4 of material 0 (glue) to be conducted.
resource_prices = (10, 30, 7, 12, 8) # Price per unit of resource
# Needed percentage of resource r for project p
project_consumption = {(0, 0): 1/4, (0, 1): 3/3, (0, 2): 0/15, (0, 3): 2/8, (0, 4): 0/12,
(1, 0): 2/4, (1, 1): 0/3, (1, 2): 10/15, (1, 3): 0/8, (1, 4): 0/12,
(2, 0): 10/4, (2, 1): 5/3, (2, 2): 40/15, (2, 3): 3/8, (2, 4): 50/12,}
budget = 60
Next, we declare our problem formulation. We want to maximize the number of projects, so we declare the sense LpMaximize. The decision variables are declared next:
planned_project[p, t]: 1 if project p is conducted in period t, else 0
stocked_material[r, t]: Amount of material r which is on stock in t
consumption_material[r, t]: Amount of r that is consumed in period t
purchase_material[r, t]: Amount of r purchased in t
budget[t]: Money balance in t
Declare our problem:
m = LpProblem("Project planning", LpMaximize)
planned_project = LpVariable.dicts('pp', product(P, T), lowBound = 0, upBound = 1, cat = LpInteger)
stocked_material = LpVariable.dicts('ms', product(R, T), lowBound = 0)
consumption_material = LpVariable.dicts('cm', product(R, T), lowBound = 0)
purchase_material = LpVariable.dicts('pm', product(R, T), lowBound = 0, cat = LpInteger)
budget = LpVariable.dicts('b', T, lowBound = 0)
Our objective is added to the problem as follows. I multiply every variable with (len(T) - t), that means a project is worth more early rather than later.
m += lpSum((len(T) - t) * planned_project[p, t] for p in P for t in T)
Now we can restrict the values of our decision variables by adding the necessary constraints. The first constraint restricts our material stock to the difference of purchased and consumed materials.
for r in R:
for t in T:
if t != 0:
m += stocked_material[r, t] == stocked_material[r, t-1] + purchase_material[r, t] - consumption_material[r, t]
else:
m += stocked_material[r, t] == 0 + purchase_material[r, 0] - consumption_material[r, 0]
The second constraint makes sure that the correct amount of materials is consumed for the projects conducted in each period.
for r in R:
for t in T:
m += lpSum([project_consumption[p, r] * planned_project[p, t] for p in P]) <= consumption_material[r, t]
The third constraint ensures that we do not spend more than our budget, additionally the leftover amount can be used in future periods.
for t in T:
if t > 0:
m += budget[t] == budget[t-1] + 60 - lpSum([resource_prices[r] * purchase_material[r, t] for r in R])
else:
m += budget[0] == 60 - lpSum([resource_prices[r] * purchase_material[r, 0] for r in R])
Finally, each project shall only be carried out once.
for p in P:
m += lpSum([planned_project[p, t] for t in T]) <= 1
We can optimize our problem by calling:
m.solve()
After optimization we can access each optimal decision variable value with its .value() method. To print some useful information about our optimal plan of action:
for (p, t), var in planned_project.items():
if var.value() == 1:
print("Project {} is conducted in period {}".format(p, t))
for t, var in budget.items():
print("At time {} we have a balance of {} $".format(t, var.value()))
for (r, t), var in purchase_material.items():
if var.value() > 0:
print("At time {}, we purchase {} of material {}.".format(t, var.value(), r))
Output:
Project 0 is conducted in period 0
Project 2 is conducted in period 3
Project 1 is conducted in period 0
At time 0 we have a balance of 1.0 $
At time 1 we have a balance of 1.0 $
At time 2 we have a balance of 61.0 $
At time 3 we have a balance of 0.0 $
At time 0, we purchase 1.0 of material 0.
At time 3, we purchase 1.0 of material 3.
At time 0, we purchase 1.0 of material 3.
At time 1, we purchase 2.0 of material 1.
At time 0, we purchase 1.0 of material 2.
At time 3, we purchase 3.0 of material 2.
At time 3, we purchase 6.0 of material 4.
At time 0, we purchase 1.0 of material 1.
At time 3, we purchase 4.0 of material 0.
Note in the solution we purchase 6 units of material 4 (6*12 wood pine) at time 3. We never really use that much but the solution is still considered optimal, since we do not have budget in our objective and it does not impact the amounts of project we can do if we buy more or less. So there are multiple optimal solutions. As a multi-criteria optimization problem, you could use Big-M values to also minimize budget utilization in the objective.
I hope this gets you started for your problem. You can find countless resources and examples for mathematical programming on the internet.
Something like this could work, creating a list of materials for each project, a list of projects, and dictionary of prices. Each call of compareAll() would show the cheapest project in the list. You could also add a loop which removes the cheapest project from the list and adds it to a to-do list each time it runs, so that the next run finds the next cheapest.
p1 = ["glue","wood","nails"]
p2 = ["screws","wood"]
p3 = ["screws","wood","glue","nails"]
projects = [p1,p2,p3]
prices = {"glue":1,"wood":4,"nails":2,"screws":1}
def check(project,prices):
for i in project:
iPrice = 0
projectTotal = 0
total = 0
for n in prices:
if(i == n):
iPrice = prices[n]
total = total + iPrice
print("Total: " + str(total))
return total
def compareAll(projectList):
best = 100 #Or some other number which exceeds your budget
for i in projectList:
if (check(i,prices) < best):
best = check(i,prices)

Algorithm to calculate sum of points for groups with varying member count [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
Let's start with an example. In Harry Potter, Hogwarts has 4 houses with students sorted into each house. The same happens on my website and I don't know how many users are in each house. It could be 20 in one house 50 in another and 100 in the third and fourth.
Now, each student can earn points on the website and at the end of the year, the house with the most points will win.
But it's not fair to "only" do a sum of the points, as the house with a 100 students will have a much higher chance to win, as they have more users to earn points. So I need to come up with an algorithm which is fair.
You can see an example here: https://worldofpotter.dk/points
What I do now is to sum all the points for a house, and then divide it by the number of users who have earned more than 10 points. This is still not fair, though.
Any ideas on how to make this calculation more fair?
Things we need to take into account:
* The percent of users earning points in each house
* Few users earning LOTS of points
* Many users earning FEW points (It's not bad earning few points. It still counts towards the total points of the house)
Link to MySQL dump(with users, houses and points): https://worldofpotter.dk/wop_points_example.sql
Link to CSV of points only: https://worldofpotter.dk/points.csv
I'd use something like Discounted Cumulative Gain which is used for measuring the effectiveness of search engines.
The concept is as it follows:
FUNCTION evalHouseScore (0_INDEXED_SORTED_ARRAY scores):
score = 0;
FOR (int i = 0; i < scores.length; i++):
score += scores[i]/log2(i);
END_FOR
RETURN score;
END_FUNCTION;
This must be somehow modified as this way of measuring focuses on the first result. As this is subjective you should decide on your the way you would modify it. Below I'll post the code which some constants which you should try with different values:
FUNCTION evalHouseScore (0_INDEXED_SORTED_ARRAY scores):
score = 0;
FOR (int i = 0; i < scores.length; i++):
score += scores[i]/log2(i+K);
END_FOR
RETURN L*score;
END_FUNCTION
Consider changing the logarithm.
Tests:
int[] g = new int[] {758,294,266,166,157,132,129,116,111,88,83,74,62,60,60,52,43,40,28,26,25,24,18,18,17,15,15,15,14,14,12,10,9,5,5,4,4,4,4,3,3,3,2,1,1,1,1,1};
int[] s = new int[] {612,324,301,273,201,182,176,139,130,121,119,114,113,113,106,86,77,76,65,62,60,58,57,54,54,42,42,40,36,35,34,29,28,23,22,19,17,16,14,14,13,11,11,9,9,8,8,7,7,7,6,4,4,3,3,3,3,2,2,2,2,2,2,2,1,1,1};
int[] h = new int[] {813,676,430,382,360,323,265,235,192,170,107,103,80,70,60,57,43,41,21,17,15,15,12,10,9,9,9,8,8,6,6,6,4,4,4,3,2,2,2,1,1,1};
int[] r = new int[] {1398,1009,443,339,242,215,210,205,177,168,164,144,144,92,85,82,71,61,58,47,44,33,21,19,18,17,12,11,11,9,8,7,7,6,5,4,3,3,3,3,2,2,2,1,1,1,1};
The output is for different offsets:
1182
1543
1847
2286
904
1231
1421
1735
813
1120
1272
1557
It sounds like some sort of constraint between the houses may need to be introduced. I might suggest finding the person that earned the most points out of all the houses and using it as the denominator when rolling up the scores. This will guarantee the max value of a user's contribution is 1, then all the scores for a house can be summed and then divided by the number of users to normalize the house's score. That should give you a reasonable comparison. It does introduce issues with low numbers of users in a house that are high achievers in which you may want to consider lower limits to the number of house members. Another technique may be to introduce handicap scores for users to balance the scales. The algorithm will most likely flex over time based on the data you receive. To keep it fair it will take some responsive action after the initial iteration. Players can come up with some creative ways to make scoring systems work for them. Here is some pseudo-code in PHP that you may use:
<?php
$mostPointsEarned; // Find the user that earned the most points
$houseScores = [];
foreach ($houses as $house) {
$numberOfUsers = 0;
$normalizedScores = [];
foreach ($house->getUsers() as $user) {
$normalizedScores[] = $user->getPoints() / $mostPointsEarned;
$numberOfUsers++;
}
$houseScores[] = array_sum($normalizedScores) / $numberOfUsers;
}
var_dump($houseScores);
You haven't given any examples on what should be preferred state, and what are situations against which you want to be immune. (3,2,1,1 compared to 5,2 etc.)
It's also a pity you haven't provided us the dataset in some nice way to play.
scala> val input = Map( // as seen on 2016-09-09 14:10 UTC on https://worldofpotter.dk/points
'G' -> Seq(758,294,266,166,157,132,129,116,111,88,83,74,62,60,60,52,43,40,28,26,25,24,18,18,17,15,15,15,14,14,12,10,9,5,5,4,4,4,4,3,3,3,2,1,1,1,1,1),
'S' -> Seq(612,324,301,273,201,182,176,139,130,121,119,114,113,113,106,86,77,76,65,62,60,58,57,54,54,42,42,40,36,35,34,29,28,23,22,19,17,16,14,14,13,11,11,9,9,8,8,7,7,7,6,4,4,3,3,3,3,2,2,2,2,2,2,2,1,1,1),
'H' -> Seq(813,676,430,382,360,323,265,235,192,170,107,103,80,70,60,57,43,41,21,17,15,15,12,10,9,9,9,8,8,6,6,6,4,4,4,3,2,2,2,1,1,1),
'R' -> Seq(1398,1009,443,339,242,215,210,205,177,168,164,144,144,92,85,82,71,61,58,47,44,33,21,19,18,17,12,11,11,9,8,7,7,6,5,4,3,3,3,3,2,2,2,1,1,1,1)
) // and the results on the website were: 1. R 1951, 2. H 1859, 3. S 990, 4. G 954
Here is what I thought of:
def singleValuedScore(individualScores: Seq[Int]) = individualScores
.sortBy(-_) // sort from most to least
.zipWithIndex // add indices e.g. (best, 0), (2nd best, 1), ...
.map { case (score, index) => score * (1 + index) } // here is the 'logic'
.max
input.mapValues(singleValuedScore)
res: scala.collection.immutable.Map[Char,Int] =
Map(G -> 1044,
S -> 1590,
H -> 1968,
R -> 2018)
The overall positions would be:
Ravenclaw with 2018 aggregated points
Hufflepuff with 1968
Slytherin with 1590
Gryffindor with 1044
Which corresponds to the ordering on that web: 1. R 1951, 2. H 1859, 3. S 990, 4. G 954.
The algorithms output is maximal product of score of user and rank of the user within a house.
This measure is not affected by "long-tail" of users having low score compared to the active ones.
There are no hand-set cutoffs or thresholds.
You could experiment with the rank attribution (score * index or score * Math.sqrt(index) or score / Math.log(index + 1) ...)
I take it that the fair measure is the number of points divided by the number of house members. Since you have the number of points, the exercise boils down to estimate the number of members.
We are in short supply of data here as the only hint we have on member counts is the answers on the website. This makes us vulnerable to manipulation, members can trick us into underestimating their numbers. If the suggested estimation method to "count respondents with points >10" would be known, houses would only encourage the best to do the test to hide members from our count. This is a real problem and the only thing I will do about it is to present a "manipulation indicator".
How could we then estimate member counts? Since we do not know anything other than test results, we have to infer the propensity to do the test from the actual results. And we have little other to assume than that we would have a symmetric result distribution (of the logarithm of the points) if all members tested. Now let's say the strong would-be respondents are more likely to actually test than weak would-be respondents. Then we could measure the extra dropout ratio for the weak by comparing the numbers of respondents in corresponding weak and strong test-point quantiles.
To be specific, of the 205 answers, there are 27 in the worst half of the overall weakest quartile, while 32 in the strongest half of the best quartile. So an extra 5 respondents of the very weakest have dropped out from an assumed all-testing symmetric population, and to adjust for this, we are going to estimate member count from this quantile by multiplying the number of responses in it by 32/27=about 1.2. Similarly, we have 29/26 for the next less-extreme half quartiles and 41/50 for the two mid quartiles.
So we would estimate members by simply counting the number of respondents but multiplying the number of respondents in the weak quartiles mentioned above by 1.2, 1.1 and 0.8 respectively. If however any result distribution within a house would be conspicuously skewed, which is not the case now, we would have to suspect manipulation and re-design our member count.
For the sample at hand however, these adjustments to member counts are minor, and yields the same house ranks as from just counting the respondents without adjustments.
I got myself to amuse me a little bit with your question and some python programming with some random generated data. As some people mentioned in the comments you need to define what is fairness. If as you said you don't know the number of people in each of the houses, you can use the number of participations of each house, thus you motivate participation (it can be unfair depending on the number of people of each house, but as you said you don't have this data on the first place).
The important part of the code is the following.
import numpy as np
from numpy.random import randint # import random int
# initialize random seed
np.random.seed(4)
houses = ["Gryffindor","Slytherin", "Hufflepuff", "Ravenclaw"]
houses_points = []
# generate random data for each house
for _ in houses:
# houses_points.append(randint(0, 100, randint(60,100)))
houses_points.append(randint(0, 50, randint(2,10)))
# count participation
houses_participations = []
houses_total_points = []
for house_id in xrange(len(houses)):
houses_total_points.append(np.sum(houses_points[house_id]))
houses_participations.append(len(houses_points[house_id]))
# sum the total number of participations
total_participations = np.sum(houses_participations)
# proposed model with weighted total participation points
houses_partic_points = []
for house_id in xrange(len(houses)):
tmp = houses_total_points[house_id]*houses_participations[house_id]/total_participations
houses_partic_points.append(tmp)
The results of this method are the following:
House Points per Participant
Gryffindor: [46 5 1 40]
Slytherin: [ 8 9 39 45 30 40 36 44 38]
Hufflepuff: [42 3 0 21 21 9 38 38]
Ravenclaw: [ 2 46]
House Number of Participations per House
Gryffindor: 4
Slytherin: 9
Hufflepuff: 8
Ravenclaw: 2
House Total Points
Gryffindor: 92
Slytherin: 289
Hufflepuff: 172
Ravenclaw: 48
House Points weighted by a participation factor
Gryffindor: 16
Slytherin: 113
Hufflepuff: 59
Ravenclaw: 4
You'll find the complete file with printing results here (https://gist.github.com/silgon/5be78b1ea0b55a20d90d9ec3e7c515e5).
You should enter some more rules to define the fairness.
Idea 1
You could set up the rule that anyone has to earn at least 10 points to enter the competition.
Then you can calculate the average points for each house.
Positive: Everyone needs to show some motivation.
Idea 2
Another approach would be to set the rule that from each house only the 10 best students will count for the competition.
Positive: Easy rule to calculate the points.
Negative: Students might become uninterested if they see they can't reach the top 10 places of their house.
From my point of view, your problem is diveded in a few points:
The best thing to do would be to re - assignate the player in the different Houses so that each House has the same number of players. (as explain by #navid-vafaei)
If you don't want to do that because you believe that it may affect your game popularity with player whom are in House that they don't want because you can change the choice of the Sorting Hat at least in the movie or books.
In that case, you can sum the point of the student's house and divide by the number of students. You may just remove the number of student with a very low score. You may remove as well the student with a very low activity because students whom skip school might be fired.
The most important part for me n your algorithm is weather or not you give points for all valuables things:
In the Harry Potter's story, the students earn point on the differents subjects they chose at school and get point according to their score.
At the end of the year, there is a special award event. At that moment, the Director gave points for valuable things which cannot be evaluated in the subject at school suche as the qualites (bravery for example).

What is a better way to sort by a 5 star rating?

I'm trying to sort a bunch of products by customer ratings using a 5 star system. The site I'm setting this up for does not have a lot of ratings and continue to add new products so it will usually have a few products with a low number of ratings.
I tried using average star rating but that algorithm fails when there is a small number of ratings.
Example a product that has 3x 5 star ratings would show up better than a product that has 100x 5 star ratings and 2x 2 star ratings.
Shouldn't the second product show up higher because it is statistically more trustworthy because of the larger number of ratings?
Prior to 2015, the Internet Movie Database (IMDb) publicly listed the formula used to rank their Top 250 movies list. To quote:
The formula for calculating the Top Rated 250 Titles gives a true Bayesian estimate:
weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C
where:
R = average for the movie (mean)
v = number of votes for the movie
m = minimum votes required to be listed in the Top 250 (currently 25000)
C = the mean vote across the whole report (currently 7.0)
For the Top 250, only votes from regular voters are considered.
It's not so hard to understand. The formula is:
rating = (v / (v + m)) * R +
(m / (v + m)) * C;
Which can be mathematically simplified to:
rating = (R * v + C * m) / (v + m);
The variables are:
R – The item's own rating. R is the average of the item's votes. (For example, if an item has no votes, its R is 0. If someone gives it 5 stars, R becomes 5. If someone else gives it 1 star, R becomes 3, the average of [1, 5]. And so on.)
C – The average item's rating. Find the R of every single item in the database, including the current one, and take the average of them; that is C. (Suppose there are 4 items in the database, and their ratings are [2, 3, 5, 5]. C is 3.75, the average of those numbers.)
v – The number of votes for an item. (To given another example, if 5 people have cast votes on an item, v is 5.)
m – The tuneable parameter. The amount of "smoothing" applied to the rating is based on the number of votes (v) in relation to m. Adjust m until the results satisfy you. And don't misinterpret IMDb's description of m as "minimum votes required to be listed" – this system is perfectly capable of ranking items with less votes than m.
All the formula does is: add m imaginary votes, each with a value of C, before calculating the average. In the beginning, when there isn't enough data (i.e. the number of votes is dramatically less than m), this causes the blanks to be filled in with average data. However, as votes accumulates, eventually the imaginary votes will be drowned out by real ones.
In this system, votes don't cause the rating to fluctuate wildly. Instead, they merely perturb it a bit in some direction.
When there are zero votes, only imaginary votes exist, and all of them are C. Thus, each item begins with a rating of C.
See also:
A demo. Click "Solve".
Another explanation of IMDb's system.
An explanation of a similar Bayesian star-rating system.
Evan Miller shows a Bayesian approach to ranking 5-star ratings:
where
nk is the number of k-star ratings,
sk is the "worth" (in points) of k stars,
N is the total number of votes
K is the maximum number of stars (e.g. K=5, in a 5-star rating system)
z_alpha/2 is the 1 - alpha/2 quantile of a normal distribution. If you want 95% confidence (based on the Bayesian posterior distribution) that the actual sort criterion is at least as big as the computed sort criterion, choose z_alpha/2 = 1.65.
In Python, the sorting criterion can be calculated with
def starsort(ns):
"""
http://www.evanmiller.org/ranking-items-with-star-ratings.html
"""
N = sum(ns)
K = len(ns)
s = list(range(K,0,-1))
s2 = [sk**2 for sk in s]
z = 1.65
def f(s, ns):
N = sum(ns)
K = len(ns)
return sum(sk*(nk+1) for sk, nk in zip(s,ns)) / (N+K)
fsns = f(s, ns)
return fsns - z*math.sqrt((f(s2, ns)- fsns**2)/(N+K+1))
For example, if an item has 60 five-stars, 80 four-stars, 75 three-stars, 20 two-stars and 25 one-stars, then its overall star rating would be about 3.4:
x = (60, 80, 75, 20, 25)
starsort(x)
# 3.3686975120774694
and you can sort a list of 5-star ratings with
sorted([(60, 80, 75, 20, 25), (10,0,0,0,0), (5,0,0,0,0)], key=starsort, reverse=True)
# [(10, 0, 0, 0, 0), (60, 80, 75, 20, 25), (5, 0, 0, 0, 0)]
This shows the effect that more ratings can have upon the overall star value.
You'll find that this formula tends to give an overall rating which is a bit
lower than the overall rating reported by sites such as Amazon, Ebay or Wal-mart
particularly when there are few votes (say, less than 300). This reflects the
higher uncertainy that comes with fewer votes. As the number of votes increases
(into the thousands) all overall these rating formulas should tend to the
(weighted) average rating.
Since the formula only depends on the frequency distribution of 5-star ratings
for the item itself, it is easy to combine reviews from multiple sources (or,
update the overall rating in light of new votes) by simply adding the frequency
distributions together.
Unlike the IMDb formula, this formula does not depend on the average score
across all items, nor an artificial minimum number of votes cutoff value.
Moreover, this formula makes use of the full frequency distribution -- not just
the average number of stars and the number of votes. And it makes sense that it
should since an item with ten 5-stars and ten 1-stars should be treated as
having more uncertainty than (and therefore not rated as highly as) an item with
twenty 3-star ratings:
In [78]: starsort((10,0,0,0,10))
Out[78]: 2.386028063783418
In [79]: starsort((0,0,20,0,0))
Out[79]: 2.795342687927806
The IMDb formula does not take this into account.
See this page for a good analysis of star-based rating systems, and this one for a good analysis of upvote-/downvote- based systems.
For up and down voting you want to estimate the probability that, given the ratings you have, the "real" score (if you had infinite ratings) is greater than some quantity (like, say, the similar number for some other item you're sorting against).
See the second article for the answer, but the conclusion is you want to use the Wilson confidence. The article gives the equation and sample Ruby code (easily translated to another language).
Well, depending on how complex you want to make it, you could have ratings additionally be weighted based on how many ratings the person has made, and what those ratings are. If the person has only made one rating, it could be a shill rating, and might count for less. Or if the person has rated many things in category a, but few in category b, and has an average rating of 1.3 out of 5 stars, it sounds like category a may be artificially weighed down by the low average score of this user, and should be adjusted.
But enough of making it complex. Let’s make it simple.
Assuming we’re working with just two values, ReviewCount and AverageRating, for a particular item, it would make sense to me to look ReviewCount as essentially being the “reliability” value. But we don’t just want to bring scores down for low ReviewCount items: a single one-star rating is probably as unreliable as a single 5 star rating. So what we want to do is probably average towards the middle: 3.
So, basically, I’m thinking of an equation something like X * AverageRating + Y * 3 = the-rating-we-want. In order to make this value come out right we need X+Y to equal 1. Also we need X to increase in value as ReviewCount increases...with a review count of 0, x should be 0 (giving us an equation of “3”), and with an infinite review count X should be 1 (which makes the equation = AverageRating).
So what are X and Y equations? For the X equation want the dependent variable to asymptotically approach 1 as the independent variable approaches infinity. A good set of equations is something like:
Y = 1/(factor^RatingCount)
and (utilizing the fact that X must be equal to 1-Y)
X = 1 – (1/(factor^RatingCount)
Then we can adjust "factor" to fit the range that we're looking for.
I used this simple C# program to try a few factors:
// We can adjust this factor to adjust our curve.
double factor = 1.5;
// Here's some sample data
double RatingAverage1 = 5;
double RatingCount1 = 1;
double RatingAverage2 = 4.5;
double RatingCount2 = 5;
double RatingAverage3 = 3.5;
double RatingCount3 = 50000; // 50000 is not infinite, but it's probably plenty to closely simulate it.
// Do the calculations
double modfactor = Math.Pow(factor, RatingCount1);
double modRating1 = (3 / modfactor)
+ (RatingAverage1 * (1 - 1 / modfactor));
double modfactor2 = Math.Pow(factor, RatingCount2);
double modRating2 = (3 / modfactor2)
+ (RatingAverage2 * (1 - 1 / modfactor2));
double modfactor3 = Math.Pow(factor, RatingCount3);
double modRating3 = (3 / modfactor3)
+ (RatingAverage3 * (1 - 1 / modfactor3));
Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}",
RatingAverage1, RatingCount1, modRating1));
Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}",
RatingAverage2, RatingCount2, modRating2));
Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}",
RatingAverage3, RatingCount3, modRating3));
// Hold up for the user to read the data.
Console.ReadLine();
So you don’t bother copying it in, it gives this output:
RatingAverage: 5, RatingCount: 1, Adjusted Rating: 3.67
RatingAverage: 4.5, RatingCount: 5, Adjusted Rating: 4.30
RatingAverage: 3.5, RatingCount: 50000, Adjusted Rating: 3.50
Something like that? You could obviously adjust the "factor" value as needed to get the kind of weighting you want.
You could sort by median instead of arithmetic mean. In this case both examples have a median of 5, so both would have the same weight in a sorting algorithm.
You could use a mode to the same effect, but median is probably a better idea.
If you want to assign additional weight to the product with 100 5-star ratings, you'll probably want to go with some kind of weighted mode, assigning more weight to ratings with the same median, but with more overall votes.
If you just need a fast and cheap solution that will mostly work without using a lot of computation here's one option (assuming a 1-5 rating scale)
SELECT Products.id, Products.title, avg(Ratings.score), etc
FROM
Products INNER JOIN Ratings ON Products.id=Ratings.product_id
GROUP BY
Products.id, Products.title
ORDER BY (SUM(Ratings.score)+25.0)/(COUNT(Ratings.id)+20.0) DESC, COUNT(Ratings.id) DESC
By adding in 25 and dividing by the total ratings + 20 you're basically adding 10 worst scores and 10 best scores to the total ratings and then sorting accordingly.
This does have known issues. For example, it unfairly rewards low-scoring products with few ratings (as this graph demonstrates, products with an average score of 1 and just one rating score a 1.2 while products with an average score of 1 and 1k+ ratings score closer to 1.05). You could also argue it unfairly punishes high-quality products with few ratings.
This chart shows what happens for all 5 ratings over 1-1000 ratings:
http://www.wolframalpha.com/input/?i=Plot3D%5B%2825%2Bxy%29/%2820%2Bx%29%2C%7Bx%2C1%2C1000%7D%2C%7By%2C0%2C6%7D%5D
You can see the dip upwards at the very bottom ratings, but overall it's a fair ranking, I think. You can also look at it this way:
http://www.wolframalpha.com/input/?i=Plot3D%5B6-%28%2825%2Bxy%29/%2820%2Bx%29%29%2C%7Bx%2C1%2C1000%7D%2C%7By%2C0%2C6%7D%5D
If you drop a marble on most places in this graph, it will automatically roll towards products with both higher scores and higher ratings.
Obviously, the low number of ratings puts this problem at a statistical handicap. Never the less...
A key element to improving the quality of an aggregate rating is to "rate the rater", i.e. to keep tabs of the ratings each particular "rater" has supplied (relative to others). This allows weighing their votes during the aggregation process.
Another solution, more of a cope out, is to supply the end-users with a count (or a range indication thereof) of votes for the underlying item.
One option is something like Microsoft's TrueSkill system, where the score is given by mean - 3*stddev, where the constants can be tweaked.
After look for a while, I choose the Bayesian system.
If someone is using Ruby, here a gem for it:
https://github.com/wbotelhos/rating
I'd highly recommend the book Programming Collective Intelligence by Toby Segaran (OReilly) ISBN 978-0-596-52932-1 which discusses how to extract meaningful data from crowd behaviour. The examples are in Python, but its easy enough to convert.

Algorithm to distribute points between weighted items?

I have 100 points to dole out to a set of items. Each item must receive a proportional number of points relative to others (a weight). Some items may have a weight of 0, but they must receive some points.
I tried to do it by giving each item 5 points, then doling out the remaining points proportionally, but when I have 21 items, the algorithm assigns 0 to one item. Of course, I could hand out 1 point initially, but then the problem remains for 101 items or more. Normally, this algorithm should deal with less than 20 items, but I want the algorithm to be robust in the face of more items.
I know using floats/fractions would be perfect, but the underlying system must receive integers, and the total must be 100.
This is framework / language agnostic, although I will implement in Ruby.
Currently, I have this (pseudo-code):
total_weight = sum_of(items.weight)
if total_weight == 0 then
# Distribute all points equally between each item
items.points = 100 / number_of_items
# Apply remaining points in case of rounding errors (100 / 3 == [33, 33, 34])
else
items.points = 5
points_to_dole_out = 100 - number_of_items * 5
for(item in items)
item.points += item.weight * total_weight / 100
end
end
First, give every item one point. This is to meet your requirement that all items get points. Then get the % of the total weight that each item has, and award it that % of the remaining points (round down).
There will be some portion of points left over. Sort the set of items by the size of their decimal parts, and then dole out the remaining points one at a time in order from biggest decimal part to smallest.
So if an item has a weight of twelve and the total weight of all items is 115, you would first award it 1 point. If there were 4 other items, there would be 110 points left after doling out the minimum scores. You would then award the item 10 points because its percentage of the total weight is 9.58 and 10.538 9.58% of 110. Then you would sort it based on the .538 and if it were near the top end, it might end up getting bumped to a 11.
The 101 case cannot be solved given the two constraints {total == 100 } { each item > 0 } So in order to be robust you must get a solution to that. That's a business problem not a technical one.
The minimum score case is actually a business problem too. Clearly the meaning of your results if you dole out a min of 5 per item is quite different from a min score of 1 - the gap between the low, non-zero score and the low scores is potentially compressed. Hence you really should get clarity from the users of the system rather than just pick a number: how will they use this data?

Resources