I have a dollar amount I wish to maximize my spend on products to purchase for a given list of projects.
For example $60. I cannot spend more than $60 a week but can hold over any left over amount. So, if I spend $58 I get to spend $62 the next week. If I purchase a product on week 1 I can use the left over amount on week 2 thereby not needing to re-purchase the same item.
The solution needs to be generic so that I can maximize between a lot of products and a lot of projects given that fixed dollar amount per week.
I need to generate a report with the list of products to purchase and the list of projects to do for that week. The prices get updates weekly so I will need to recalculate the max spend weekly (meaning forecasting is not really part of the solution) and I need to reuse the amount from the products/inventory already purchased.
I have all the data and there aren't any unknown variables. I just need to be able to figure out how to maximize what to purchase given parts and wholes under a fixed dollar amount with history.
To make it more concrete through example (although abstracted from my real data):
I have a database of products (12.5 K different ones) and their corresponding prices. I also have a list of fixed list projects (let's say 2500) I wish to do with those products. For each project I have the corresponding products needed for each. Each project takes a different amount of product. Projects can have overlapping or unique products per project.
So for example:
project 1 is building my model airplanes;
project 2 is fixing my picture frames;
project 3 is building a bird house;
etc.
Project 1 may need:
glue (1oz)
paint (3 qt)
balsam wood (2 lbs)
Project 2 may need:
glue (2 oz)
nails (10 count)
project 3 may need:
glue (10 oz)
paint (5 qts)
nails (40 count)
wood balsam (3 lbs)
wood pine (50 lbs)
Products:
Glue 4oz - $10
Paint 3qts - $30
Nails 15 count - $7
Wood Balsam 8 pounds - $12
Wood Pine 12 pounds - $8
For example, if I buy a bottle of glue (4 oz) at $10 I can use it for my airplanes and my picture frames but not my bird house. I need to do an exhaustive analysis of all products and all projects weekly given my dollar amount to spend since the prices change (sales/demand/etc.).
How can I best spend the $60 to do as many projects as possible in a given week? Week 2 I get a new $60 to spend, (most likely have leftover) money, and have product (such as glue) left over from the week before?
Any python code / projects that do something similar or exactly this already that I may be able to import and modify for my needs?
Any help in terms of algorithms, sample code, full solution!?!, ideas, etc. would be appreciated...
Thanks in advance!!! (FYI: his is for a personal project.)
This is a problem which is very well suited to be tackled with mathematical programming. With mathematical optimization you can optimize variables (e. g. a variable says if a project is conducted at some point) with an objective like the numbers of projects conducted while also considering a set of constraints. For Python there are several free libraries for optimization of mathematical programs, I will show how to get started with your problem using PuLP. Please note that free software for these kind of problems usually performs way worse than commercial one, which can be very expensive. For small or easy problems the free software suffices though.
To get started:
easy_install pulp
Now, import pulp and as a little help itertools.product. There are many ways to represent your data, I choose to declare some ranges which serve as index sets. So r = 0 would be glue and p = 0 build a model air plane. The number of time periods you have to choose. With 4 time periods all projects can be conducted eventually.
from pulp import *
from itertools import product
R = range(5) # Resources
P = range(3) # Projects
T = range(4) # Periods
Your parameters could be represented as follows. project_consumption[0, 0] expresses that project 0 needs 1/4 of material 0 (glue) to be conducted.
resource_prices = (10, 30, 7, 12, 8) # Price per unit of resource
# Needed percentage of resource r for project p
project_consumption = {(0, 0): 1/4, (0, 1): 3/3, (0, 2): 0/15, (0, 3): 2/8, (0, 4): 0/12,
(1, 0): 2/4, (1, 1): 0/3, (1, 2): 10/15, (1, 3): 0/8, (1, 4): 0/12,
(2, 0): 10/4, (2, 1): 5/3, (2, 2): 40/15, (2, 3): 3/8, (2, 4): 50/12,}
budget = 60
Next, we declare our problem formulation. We want to maximize the number of projects, so we declare the sense LpMaximize. The decision variables are declared next:
planned_project[p, t]: 1 if project p is conducted in period t, else 0
stocked_material[r, t]: Amount of material r which is on stock in t
consumption_material[r, t]: Amount of r that is consumed in period t
purchase_material[r, t]: Amount of r purchased in t
budget[t]: Money balance in t
Declare our problem:
m = LpProblem("Project planning", LpMaximize)
planned_project = LpVariable.dicts('pp', product(P, T), lowBound = 0, upBound = 1, cat = LpInteger)
stocked_material = LpVariable.dicts('ms', product(R, T), lowBound = 0)
consumption_material = LpVariable.dicts('cm', product(R, T), lowBound = 0)
purchase_material = LpVariable.dicts('pm', product(R, T), lowBound = 0, cat = LpInteger)
budget = LpVariable.dicts('b', T, lowBound = 0)
Our objective is added to the problem as follows. I multiply every variable with (len(T) - t), that means a project is worth more early rather than later.
m += lpSum((len(T) - t) * planned_project[p, t] for p in P for t in T)
Now we can restrict the values of our decision variables by adding the necessary constraints. The first constraint restricts our material stock to the difference of purchased and consumed materials.
for r in R:
for t in T:
if t != 0:
m += stocked_material[r, t] == stocked_material[r, t-1] + purchase_material[r, t] - consumption_material[r, t]
else:
m += stocked_material[r, t] == 0 + purchase_material[r, 0] - consumption_material[r, 0]
The second constraint makes sure that the correct amount of materials is consumed for the projects conducted in each period.
for r in R:
for t in T:
m += lpSum([project_consumption[p, r] * planned_project[p, t] for p in P]) <= consumption_material[r, t]
The third constraint ensures that we do not spend more than our budget, additionally the leftover amount can be used in future periods.
for t in T:
if t > 0:
m += budget[t] == budget[t-1] + 60 - lpSum([resource_prices[r] * purchase_material[r, t] for r in R])
else:
m += budget[0] == 60 - lpSum([resource_prices[r] * purchase_material[r, 0] for r in R])
Finally, each project shall only be carried out once.
for p in P:
m += lpSum([planned_project[p, t] for t in T]) <= 1
We can optimize our problem by calling:
m.solve()
After optimization we can access each optimal decision variable value with its .value() method. To print some useful information about our optimal plan of action:
for (p, t), var in planned_project.items():
if var.value() == 1:
print("Project {} is conducted in period {}".format(p, t))
for t, var in budget.items():
print("At time {} we have a balance of {} $".format(t, var.value()))
for (r, t), var in purchase_material.items():
if var.value() > 0:
print("At time {}, we purchase {} of material {}.".format(t, var.value(), r))
Output:
Project 0 is conducted in period 0
Project 2 is conducted in period 3
Project 1 is conducted in period 0
At time 0 we have a balance of 1.0 $
At time 1 we have a balance of 1.0 $
At time 2 we have a balance of 61.0 $
At time 3 we have a balance of 0.0 $
At time 0, we purchase 1.0 of material 0.
At time 3, we purchase 1.0 of material 3.
At time 0, we purchase 1.0 of material 3.
At time 1, we purchase 2.0 of material 1.
At time 0, we purchase 1.0 of material 2.
At time 3, we purchase 3.0 of material 2.
At time 3, we purchase 6.0 of material 4.
At time 0, we purchase 1.0 of material 1.
At time 3, we purchase 4.0 of material 0.
Note in the solution we purchase 6 units of material 4 (6*12 wood pine) at time 3. We never really use that much but the solution is still considered optimal, since we do not have budget in our objective and it does not impact the amounts of project we can do if we buy more or less. So there are multiple optimal solutions. As a multi-criteria optimization problem, you could use Big-M values to also minimize budget utilization in the objective.
I hope this gets you started for your problem. You can find countless resources and examples for mathematical programming on the internet.
Something like this could work, creating a list of materials for each project, a list of projects, and dictionary of prices. Each call of compareAll() would show the cheapest project in the list. You could also add a loop which removes the cheapest project from the list and adds it to a to-do list each time it runs, so that the next run finds the next cheapest.
p1 = ["glue","wood","nails"]
p2 = ["screws","wood"]
p3 = ["screws","wood","glue","nails"]
projects = [p1,p2,p3]
prices = {"glue":1,"wood":4,"nails":2,"screws":1}
def check(project,prices):
for i in project:
iPrice = 0
projectTotal = 0
total = 0
for n in prices:
if(i == n):
iPrice = prices[n]
total = total + iPrice
print("Total: " + str(total))
return total
def compareAll(projectList):
best = 100 #Or some other number which exceeds your budget
for i in projectList:
if (check(i,prices) < best):
best = check(i,prices)
Related
Currently I am working on a optimalization problem in Python. I hope this is the right place to post,because I think the question has more to do with the math than the actual coding.
The context:
There is a whole group of students. The students have different characteristics.
Social skills
Motivation
Discipline
Intelligence
In addition, each students gets a relative score on the variables:
alpha
beta
gamma
Lastly, each student needs support from a teacher, but some more than the others. Therefore, a cost parameter is introduced. The 'worth' of a student is determined as follows:
$potential = \frac{(0.2 * socialskills + 0.3 * motivation + 0.4 * discipline + 0.1 * intelligence)}{costs}$
The goal is to find a group of the best $n$ students, but the with the characteristics alpha, beta and gamma evenly distributed. This is a multi-objective optimalization problem.
Important to note here, on a personal level, I have a background in engineering and I have had no formal training on mathematical optimalization. If there happens to be a mathematical more sound way, I am happy to learn.
The following blocks are just to generate the data. In the first block I create the student
class Student:
def __init__(self, alpha, beta, gamma, social_skills, motivation, discipline, intelligence, costs):
potential = 0.2 * social_skills + 0.3 * motivation + 0.4 * discipline + 0.1 * intelligence
self.alpha_relative = alpha/(alpha+beta+gamma)
self.beta_relative = beta/(alpha+beta+gamma)
self.gamma_relative = gamma/(alpha+beta+gamma)
self.ratio = potential / costs
This is also part of the data generation. Here I create a list of student objects.
def create_student_list(num_students):
list_students = []
for i in range(num_students):
#define the student characteristics
alpha = r.randint(1,10)
beta = r.randint(1,10)
gamma = r.randint(1,10)
social_skills = r.randint(1,10)
motivation = r.randint(1,10)
discipline = r.randint(1,10)
intelligence = r.randint(1,10)
costs = r.randint(1,2)
#create student objects
student = Student(alpha, beta, gamma, social_skills, motivation, discipline, intelligence, costs)
#append objects to a list
list_students.append(student)
return list_students
Now the actual work begins. I take a random sample of students, which form a group.I add up their 'worth' in the variable 'ratio_sum'.
def create_group(students, group_size):
#create array of random intergers to pick students randomly
num_students = len(students)
random_pick = r.sample(range(0,num_students),group_size)
#initate variables
ratio_sum = 0
alpha_sum = 0
beta_sum = 0
gamma_sum = 0
#calculate the characteristics of the group
for j in random_pick:
ratio_sum += students[j].ratio
alpha_sum += students[j].alpha_relative
beta_sum += students[j].beta_relative
gamma_sum += students[j].gamma_relative
var = np.var([alpha_sum, beta_sum, gamma_sum])
return random_pick, ratio_sum, var
I also compute the alpha, beta, gamma score of the group. Because the goal is have a diverse group I calculate the variance of the array:
[alpha_sum, beta_sum, gamma_sum]
In the last function the actual simulation is performed. For testing purposes I did the simulation only 10 times and formed groups of 3 students.
def simulate(num_simulations, students, group_size):
simulation = []
for i in range(num_simulations):
simulation.append(create_group(students, group_size))
return simulation
In the output, the first index of each line is the sample index. The second index is the total 'worth'. The last index is the variance.
[([4, 7, 2], 16.799999999999997, 0.06279772521858522),
([5, 3, 0], 11.0, 0.14400476660092046),
([9, 6, 4], 14.2, 0.04469349176343509),
([6, 0, 4], 13.05, 4.139179301210081e-05),
([8, 0, 2], 14.85, 0.015072541362223899),
([3, 1, 8], 15.8, 0.15864323507180644),
([4, 8, 3], 15.600000000000001, 0.061507936507936546),
([4, 3, 5], 14.250000000000002, 0.08901065357109311),
([5, 9, 4], 13.649999999999999, 0.03338968723584108),
([1, 3, 0], 15.55, 0.053101851851851845)]
One of these lines is the optimal solution. But here I start to reach the limits of my knowlegde.
First question: is it okay to use the variance as a measure of how equal the groups are in terms of alpha, beta and gamma?
Second question: I want to maximize the worth/cost ratio and minimize the variance. What is the best approach? I find it rather difficult, because the variance is of different order than the ratio, so in the (to be determined) optimization function, how do I pick the constants such that both parameters are equally important or the other is $x$ times more important than the other?
A whole story, but I hope somebody could advice me on the best way to proceed.
Thank you,
Tim
Let's say I have different bundles of products, each associated with a price.
Name Price Products
Fruit Overdose 5$ 2 Apples, 1 Orange, 1 Banana
Doctors Darling 1$ 1 Apple
The Exotic 3.50$ 2 Oranges, 1 Banana
Vitamin C 1.5$ 1 Orange
And I have a shopping list, e.g. I want to buy:
4 Apples, 1 Orange, 2 Bananas.
The question
How would I go about finding the cheapest combination of bundles to buy for the given shopping list? Buying more than requested by the shopping list is valid.
I just need a language agnostic hint how I could approch this problem most efficiently.
My real world problem is a little bit more complex also including a list of products I already own; but this shouldn't really matter too much.
This is a dual linear programming problem:
The objective you want to minimise is the total price, which is a linear function.
The constraints can be formulated as a matrix equation, where each row corresponds with a bundle, and each column corresponds with a kind of product.
You can solve it by constructing the dual which will be a standard linear programming problem, applying a standard algorithm such as the simplex algorithm, then converting the solution back into a solution to the original dual problem.
This kind of problem can be solved handily with a SAT/SMT solver. Z3 is an open source solver with bindings for many languages, including a nicely integrated binding with Python.
In the solver, you just declare a few variables (4 for the weights of each bundle, and 1 for the total price). Then you write down the different constraints. In this case, the constraints are:
The weights should be positive.
The total is calculated by summing the price times the weight of each bundle.
For each of the fruits, the desired minimum number should be bought.
Note that I calculated the price in cents to be able to work with integers, although that is not strictly necessary for Z3.
The Python code then looks like:
from z3 import *
bundles = [["Fruit Overdose", 5, {'apple': 2, 'orange': 1, 'banana': 1}],
["Doctors Darling", 1, {'apple': 1}],
["The Exotic", 3.50, {'orange': 2, 'banana': 1}],
["Vitamin C", 1.5, {'orange': 1}]]
desired = {'apple': 4, 'orange': 1, 'banana': 2}
num_bundles = len(bundles)
W = [Int(f'W_{i}') for i in range(len(bundles))] # weight of each bundle: how many to buy of each bundle
TotalPrice = Int('Total')
s = Optimize()
s.add(TotalPrice == Sum([W[i] * int(b[1] * 100) for i, b in enumerate(bundles)]))
s.add([W[i] >= 0 for i in range(len(W))]) # weights can not be negative
for f in desired:
s.add(Sum([W[i] * b[2][f] for i, b in enumerate(bundles) if f in b[2]]) >= desired[f])
h1 = s.minimize(TotalPrice)
result = s.check()
print("optimizer result:", result)
if result == sat:
s.lower(h1)
m = s.model()
print(f"The lowest price is: {m[TotalPrice].as_long()/100:.2f}")
for i,b in enumerate(bundles):
print(f" Buying {m[W[i]].as_long()} of {b[0]}")
Output:
The lowest price is: 10.00
Buying 2 of Fruit Overdose
Buying 0 of Doctors Darling
Buying 0 of The Exotic
Buying 0 of Vitamin C
If you simply change the price of the Fruit Overdose to 6, the result would be:
Buying 0 of Fruit Overdose
Buying 4 of Doctors Darling
Buying 2 of The Exotic
Buying 0 of Vitamin C
The algorithm guarantees to find the best solution. In case there are multiple equally good solutions, just one of them is returned.
I am trying to find a solution in which a given resource (eg. budget) will be best distributed to different options which yields different results on the resource provided.
Let's say I have N = 1200 and some functions. (a, b, c, d are some unknown variables)
f1(x) = a * x
f2(x) = b * x^c
f3(x) = a*x + b*x^2 + c*x^3
f4(x) = d^x
f5(x) = log x^d
...
And also, let's say there n number of these functions that yield different results based on its input x, where x = 0 or x >= m, where m is a constant.
Although I am not able to find exact formula for the given functions, I am able to find the output. This means that I can do:
X = f1(N1) + f2(N2) + f3(N3) + ... + fn(Nn) where (N1 + ... Nn) = N as many times as there are ways of distributing N into n numbers, and find a specific case where X is the greatest.
How would I actually go about finding the best distribution of N with the least computation power, using whatever libraries currently available?
If you are happy with allocations constrained to be whole numbers then there is a dynamic programming solution of cost O(Nn) - so you can increase accuracy by scaling if you want, but this will increase cpu time.
For each i=1 to n maintain an array where element j gives the maximum yield using only the first i functions giving them a total allowance of j.
For i=1 this is simply the result of f1().
For i=k+1 consider when working out the result for j consider each possible way of splitting j units between f_{k+1}() and the table that tells you the best return from a distribution among the first k functions - so you can calculate the table for i=k+1 using the table created for k.
At the end you get the best possible return for n functions and N resources. It makes it easier to find out what that best answer is if you maintain of a set of arrays telling the best way to distribute k units among the first i functions, for all possible values of i and k. Then you can look up the best allocation for f100(), subtract off the value this allocated to f100() from N, look up the best allocation for f99() given the resulting resources, and carry on like this until you have worked out the best allocations for all f().
As an example suppose f1(x) = 2x, f2(x) = x^2 and f3(x) = 3 if x>0 and 0 otherwise. Suppose we have 3 units of resource.
The first table is just f1(x) which is 0, 2, 4, 6 for 0,1,2,3 units.
The second table is the best you can do using f1(x) and f2(x) for 0,1,2,3 units and is 0, 2, 4, 9, switching from f1 to f2 at x=2.
The third table is 0, 3, 5, 9. I can get 3 and 5 by using 1 unit for f3() and the rest for the best solution in the second table. 9 is simply the best solution in the second table - there is no better solution using 3 resources that gives any of them to f(3)
So 9 is the best answer here. One way to work out how to get there is to keep the tables around and recalculate that answer. 9 comes from f3(0) + 9 from the second table so all 3 units are available to f2() + f1(). The second table 9 comes from f2(3) so there are no units left for f(1) and we get f1(0) + f2(3) + f3(0).
When you are working the resources to use at stage i=k+1 you have a table form i=k that tells you exactly the result to expect from the resources you have left over after you have decided to use some at stage i=k+1. The best distribution does not become incorrect because that stage i=k you have worked out the result for the best distribution given every possible number of remaining resources.
I'm trying to sort a bunch of products by customer ratings using a 5 star system. The site I'm setting this up for does not have a lot of ratings and continue to add new products so it will usually have a few products with a low number of ratings.
I tried using average star rating but that algorithm fails when there is a small number of ratings.
Example a product that has 3x 5 star ratings would show up better than a product that has 100x 5 star ratings and 2x 2 star ratings.
Shouldn't the second product show up higher because it is statistically more trustworthy because of the larger number of ratings?
Prior to 2015, the Internet Movie Database (IMDb) publicly listed the formula used to rank their Top 250 movies list. To quote:
The formula for calculating the Top Rated 250 Titles gives a true Bayesian estimate:
weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C
where:
R = average for the movie (mean)
v = number of votes for the movie
m = minimum votes required to be listed in the Top 250 (currently 25000)
C = the mean vote across the whole report (currently 7.0)
For the Top 250, only votes from regular voters are considered.
It's not so hard to understand. The formula is:
rating = (v / (v + m)) * R +
(m / (v + m)) * C;
Which can be mathematically simplified to:
rating = (R * v + C * m) / (v + m);
The variables are:
R – The item's own rating. R is the average of the item's votes. (For example, if an item has no votes, its R is 0. If someone gives it 5 stars, R becomes 5. If someone else gives it 1 star, R becomes 3, the average of [1, 5]. And so on.)
C – The average item's rating. Find the R of every single item in the database, including the current one, and take the average of them; that is C. (Suppose there are 4 items in the database, and their ratings are [2, 3, 5, 5]. C is 3.75, the average of those numbers.)
v – The number of votes for an item. (To given another example, if 5 people have cast votes on an item, v is 5.)
m – The tuneable parameter. The amount of "smoothing" applied to the rating is based on the number of votes (v) in relation to m. Adjust m until the results satisfy you. And don't misinterpret IMDb's description of m as "minimum votes required to be listed" – this system is perfectly capable of ranking items with less votes than m.
All the formula does is: add m imaginary votes, each with a value of C, before calculating the average. In the beginning, when there isn't enough data (i.e. the number of votes is dramatically less than m), this causes the blanks to be filled in with average data. However, as votes accumulates, eventually the imaginary votes will be drowned out by real ones.
In this system, votes don't cause the rating to fluctuate wildly. Instead, they merely perturb it a bit in some direction.
When there are zero votes, only imaginary votes exist, and all of them are C. Thus, each item begins with a rating of C.
See also:
A demo. Click "Solve".
Another explanation of IMDb's system.
An explanation of a similar Bayesian star-rating system.
Evan Miller shows a Bayesian approach to ranking 5-star ratings:
where
nk is the number of k-star ratings,
sk is the "worth" (in points) of k stars,
N is the total number of votes
K is the maximum number of stars (e.g. K=5, in a 5-star rating system)
z_alpha/2 is the 1 - alpha/2 quantile of a normal distribution. If you want 95% confidence (based on the Bayesian posterior distribution) that the actual sort criterion is at least as big as the computed sort criterion, choose z_alpha/2 = 1.65.
In Python, the sorting criterion can be calculated with
def starsort(ns):
"""
http://www.evanmiller.org/ranking-items-with-star-ratings.html
"""
N = sum(ns)
K = len(ns)
s = list(range(K,0,-1))
s2 = [sk**2 for sk in s]
z = 1.65
def f(s, ns):
N = sum(ns)
K = len(ns)
return sum(sk*(nk+1) for sk, nk in zip(s,ns)) / (N+K)
fsns = f(s, ns)
return fsns - z*math.sqrt((f(s2, ns)- fsns**2)/(N+K+1))
For example, if an item has 60 five-stars, 80 four-stars, 75 three-stars, 20 two-stars and 25 one-stars, then its overall star rating would be about 3.4:
x = (60, 80, 75, 20, 25)
starsort(x)
# 3.3686975120774694
and you can sort a list of 5-star ratings with
sorted([(60, 80, 75, 20, 25), (10,0,0,0,0), (5,0,0,0,0)], key=starsort, reverse=True)
# [(10, 0, 0, 0, 0), (60, 80, 75, 20, 25), (5, 0, 0, 0, 0)]
This shows the effect that more ratings can have upon the overall star value.
You'll find that this formula tends to give an overall rating which is a bit
lower than the overall rating reported by sites such as Amazon, Ebay or Wal-mart
particularly when there are few votes (say, less than 300). This reflects the
higher uncertainy that comes with fewer votes. As the number of votes increases
(into the thousands) all overall these rating formulas should tend to the
(weighted) average rating.
Since the formula only depends on the frequency distribution of 5-star ratings
for the item itself, it is easy to combine reviews from multiple sources (or,
update the overall rating in light of new votes) by simply adding the frequency
distributions together.
Unlike the IMDb formula, this formula does not depend on the average score
across all items, nor an artificial minimum number of votes cutoff value.
Moreover, this formula makes use of the full frequency distribution -- not just
the average number of stars and the number of votes. And it makes sense that it
should since an item with ten 5-stars and ten 1-stars should be treated as
having more uncertainty than (and therefore not rated as highly as) an item with
twenty 3-star ratings:
In [78]: starsort((10,0,0,0,10))
Out[78]: 2.386028063783418
In [79]: starsort((0,0,20,0,0))
Out[79]: 2.795342687927806
The IMDb formula does not take this into account.
See this page for a good analysis of star-based rating systems, and this one for a good analysis of upvote-/downvote- based systems.
For up and down voting you want to estimate the probability that, given the ratings you have, the "real" score (if you had infinite ratings) is greater than some quantity (like, say, the similar number for some other item you're sorting against).
See the second article for the answer, but the conclusion is you want to use the Wilson confidence. The article gives the equation and sample Ruby code (easily translated to another language).
Well, depending on how complex you want to make it, you could have ratings additionally be weighted based on how many ratings the person has made, and what those ratings are. If the person has only made one rating, it could be a shill rating, and might count for less. Or if the person has rated many things in category a, but few in category b, and has an average rating of 1.3 out of 5 stars, it sounds like category a may be artificially weighed down by the low average score of this user, and should be adjusted.
But enough of making it complex. Let’s make it simple.
Assuming we’re working with just two values, ReviewCount and AverageRating, for a particular item, it would make sense to me to look ReviewCount as essentially being the “reliability” value. But we don’t just want to bring scores down for low ReviewCount items: a single one-star rating is probably as unreliable as a single 5 star rating. So what we want to do is probably average towards the middle: 3.
So, basically, I’m thinking of an equation something like X * AverageRating + Y * 3 = the-rating-we-want. In order to make this value come out right we need X+Y to equal 1. Also we need X to increase in value as ReviewCount increases...with a review count of 0, x should be 0 (giving us an equation of “3”), and with an infinite review count X should be 1 (which makes the equation = AverageRating).
So what are X and Y equations? For the X equation want the dependent variable to asymptotically approach 1 as the independent variable approaches infinity. A good set of equations is something like:
Y = 1/(factor^RatingCount)
and (utilizing the fact that X must be equal to 1-Y)
X = 1 – (1/(factor^RatingCount)
Then we can adjust "factor" to fit the range that we're looking for.
I used this simple C# program to try a few factors:
// We can adjust this factor to adjust our curve.
double factor = 1.5;
// Here's some sample data
double RatingAverage1 = 5;
double RatingCount1 = 1;
double RatingAverage2 = 4.5;
double RatingCount2 = 5;
double RatingAverage3 = 3.5;
double RatingCount3 = 50000; // 50000 is not infinite, but it's probably plenty to closely simulate it.
// Do the calculations
double modfactor = Math.Pow(factor, RatingCount1);
double modRating1 = (3 / modfactor)
+ (RatingAverage1 * (1 - 1 / modfactor));
double modfactor2 = Math.Pow(factor, RatingCount2);
double modRating2 = (3 / modfactor2)
+ (RatingAverage2 * (1 - 1 / modfactor2));
double modfactor3 = Math.Pow(factor, RatingCount3);
double modRating3 = (3 / modfactor3)
+ (RatingAverage3 * (1 - 1 / modfactor3));
Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}",
RatingAverage1, RatingCount1, modRating1));
Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}",
RatingAverage2, RatingCount2, modRating2));
Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}",
RatingAverage3, RatingCount3, modRating3));
// Hold up for the user to read the data.
Console.ReadLine();
So you don’t bother copying it in, it gives this output:
RatingAverage: 5, RatingCount: 1, Adjusted Rating: 3.67
RatingAverage: 4.5, RatingCount: 5, Adjusted Rating: 4.30
RatingAverage: 3.5, RatingCount: 50000, Adjusted Rating: 3.50
Something like that? You could obviously adjust the "factor" value as needed to get the kind of weighting you want.
You could sort by median instead of arithmetic mean. In this case both examples have a median of 5, so both would have the same weight in a sorting algorithm.
You could use a mode to the same effect, but median is probably a better idea.
If you want to assign additional weight to the product with 100 5-star ratings, you'll probably want to go with some kind of weighted mode, assigning more weight to ratings with the same median, but with more overall votes.
If you just need a fast and cheap solution that will mostly work without using a lot of computation here's one option (assuming a 1-5 rating scale)
SELECT Products.id, Products.title, avg(Ratings.score), etc
FROM
Products INNER JOIN Ratings ON Products.id=Ratings.product_id
GROUP BY
Products.id, Products.title
ORDER BY (SUM(Ratings.score)+25.0)/(COUNT(Ratings.id)+20.0) DESC, COUNT(Ratings.id) DESC
By adding in 25 and dividing by the total ratings + 20 you're basically adding 10 worst scores and 10 best scores to the total ratings and then sorting accordingly.
This does have known issues. For example, it unfairly rewards low-scoring products with few ratings (as this graph demonstrates, products with an average score of 1 and just one rating score a 1.2 while products with an average score of 1 and 1k+ ratings score closer to 1.05). You could also argue it unfairly punishes high-quality products with few ratings.
This chart shows what happens for all 5 ratings over 1-1000 ratings:
http://www.wolframalpha.com/input/?i=Plot3D%5B%2825%2Bxy%29/%2820%2Bx%29%2C%7Bx%2C1%2C1000%7D%2C%7By%2C0%2C6%7D%5D
You can see the dip upwards at the very bottom ratings, but overall it's a fair ranking, I think. You can also look at it this way:
http://www.wolframalpha.com/input/?i=Plot3D%5B6-%28%2825%2Bxy%29/%2820%2Bx%29%29%2C%7Bx%2C1%2C1000%7D%2C%7By%2C0%2C6%7D%5D
If you drop a marble on most places in this graph, it will automatically roll towards products with both higher scores and higher ratings.
Obviously, the low number of ratings puts this problem at a statistical handicap. Never the less...
A key element to improving the quality of an aggregate rating is to "rate the rater", i.e. to keep tabs of the ratings each particular "rater" has supplied (relative to others). This allows weighing their votes during the aggregation process.
Another solution, more of a cope out, is to supply the end-users with a count (or a range indication thereof) of votes for the underlying item.
One option is something like Microsoft's TrueSkill system, where the score is given by mean - 3*stddev, where the constants can be tweaked.
After look for a while, I choose the Bayesian system.
If someone is using Ruby, here a gem for it:
https://github.com/wbotelhos/rating
I'd highly recommend the book Programming Collective Intelligence by Toby Segaran (OReilly) ISBN 978-0-596-52932-1 which discusses how to extract meaningful data from crowd behaviour. The examples are in Python, but its easy enough to convert.
Here is my problem. Imagine I am buying 3 different items, and I have up to 5 coupons. The coupons are interchangeable, but worth different amounts when used on different items.
Here is the matrix which gives the result of spending different numbers of coupons on different items:
coupons: 1 2 3 4 5
item 1 $10 off $15 off
item 2 $5 off $15 off $25 off $35 off
item 3 $2 off
I have manually worked out the best actions for this example:
If I have 1 coupon, item 1 gets it for $10 off
If I have 2 coupons, item 1 gets them for $15 off
If I have 3 coupons, item 1 gets 2, and item 3 gets 1, for $17 off
If I have 4 coupons, then either:
Item 1 gets 1 and item 2 gets 3 for a total of $25 off, or
Item 2 gets all 4 for $25 off.
If I have 5 coupons, then item 2 gets all 5 for $35 off.
However, I need to develop a general algorithm which will handle different matrices and any number of items and coupons.
I suspect I will need to iterate through every possible combination to find the best price for n coupons. Does anyone here have any ideas?
This seems like a good candidate for dynamic programming:
//int[,] discountTable = new int[NumItems][NumCoupons+1]
// bestDiscount[i][c] means the best discount if you can spend c coupons on items 0..i
int[,] bestDiscount = new int[NumItems][NumCoupons+1];
// the best discount for a set of one item is just use the all of the coupons on it
for (int c=1; c<=MaxNumCoupons; c++)
bestDiscount[0, c] = discountTable[0, c];
// the best discount for [i, c] is spending x coupons on items 0..i-1, and c-x coupons on item i
for (int i=1; i<NumItems; i++)
for (int c=1; c<=NumCoupons; c++)
for (int x=0; x<c; x++)
bestDiscount[i, c] = Math.Max(bestDiscount[i, c], bestDiscount[i-1, x] + discountTable[i, c-x]);
At the end of this, the best discount will be the highest value of bestDiscount[NumItems][x]. To rebuild the tree, follow the graph backwards:
edit to add algorithm:
//int couponsLeft;
for (int i=NumItems-1; i>=0; i++)
{
int bestSpend = 0;
for (int c=1; c<=couponsLeft; c++)
if (bestDiscount[i, couponsLeft - c] > bestDiscount[i, couponsLeft - bestSpend])
bestSpend = c;
if (i == NumItems - 1)
Console.WriteLine("Had {0} coupons left over", bestSpend);
else
Console.WriteLine("Spent {0} coupons on item {1}", bestSpend, i+1);
couponsLeft -= bestSpend;
}
Console.WriteLine("Spent {0} coupons on item 0", couponsLeft);
Storing the graph in your data structure is also a good way, but that was the way I had thought of.
It's the Knapsack problem, or rather a variation of it. Doing some research on algorithms related to this problem will point you in the best direction.
I think dynamic programming should do this. Basically, you keep track of an array A[n, c] whose values mean the optimal discount while buying the n first items having spent c coupons. The values for a[n, 0] should be 0 for all values of n, so that is a good start. Also, A[0, c] is 0 for all c.
When you evaluate A[n,c], you loop over all discount offers for item n, and add the discount for that particular offer to A[n-1,c-p] where p is the price in coupons for this particular discount. A[n-1, c-p] must of course be calculated (in the same way) prior to this. Keep the best combination and store in the array.
A recursive implementation would probably give the cleanest implementation. In that case, you should find the answer in A[N,C] where N is the total number of items and C is the total number of available coupons.
This can be written as a linear programming problem. For most 'typical' problems, the simplex method is a fast, relatively simple way to solve such problems, or there are open source LP solvers available.
For your example:
Let 0 <= xi <= 1
x1 = One if one coupon is spent on item 1, zero otherwise
x2 = One if two coupons are spent on item 1, zero otherwise
x3 = One if one coupon is spent on item 2, zero otherwise
x4 = One if two coupons are spent on item 2, zero otherwise
x5 = One if three coupons are spent on item 3, zero otherwise
...
Assume that if I spend two coupons on item 1, then both x1 and x2 are one. This implies the constraint
x1 >= x2
With similar constraints for the other items, e.g.,
x3 >= x4
x4 >= x5
The amount saved is
Saved = 10 x1 + 5 x2 + 0 x3 + 5 x4 + 10 x5 + ...
If you want to find the most money saved with a fixed number of coupons, then you want to minimize Saved subject to the constraints above and the additional constraint:
coupon count = x1 + x2 + x3 + ...
This works for any matrix and number of items. Changing notation (and feeling sad that I can't do subscripts), let 0 <= y_ij <= 1 be one if j coupons are spent on item number i. The we have the constraints
y_i(j-1) >= y_ij
If the amount saved from spending j coupons on item i is M_ij, where we define M_i0 = 0, then maximize
Saved = Sum_ij (M_ij - M_i(j-1)) y_ij
subject to the above constraints and
coupon count = Sum_ij y_ij
(The italics formatting doesn't seem to be working here)
I suspect some kind of sorted list for memoizing each coupon-count can help here.
for example, if you have 4 coupons, the optimum is possibly:
using all 4 on something. You have to check all these new prices.
using 3 and 1. either the 3-item is the optimal solution for 3 coupons, or that item overlaps with the three top choices for 1-coupon items, in which case you need to find the best combination of one of the three best 1-coupon and 3-coupon items.
using 2 and 2. find top 3 2-items. if #1 and #2 overlap, #1 and #3 , unless they also overlap, in which case #2 and #3 don't.
this answer is pretty vague.. I need to put more thought into it.
This problem is similar in concept to the Traveling Salesman problem where O(n!) is the best for finding the optimal solution. There are several shortcuts that can be taken but they required lots and lots of time to discover, which I doubt you have.
Checking each possible combination is going to be the best use of your time, assuming we are dealing with small numbers of coupons. Make the client wait a bit instead of you spending years on it.