How to get a gaussian distributed random variable? - random

I need to do something like this
personstrength(P) ~ gaussian(10,3) :- person(P).
winner(A,B) :- personstrength(A) > personstrength(B).
I want the strength of the person to be gaussian distributed with expected value 10 and variance 3. How do I do this in prolog?

I don't know prolog, but it looks like the plrand library may be of interest to you.
Note that since all persons have the same scaling, unless you want to keep the strength attribute around for later you can just use standard normals. Person A will have strength SA = (sqrt(3) * GA + 10), person B will have SB = (sqrt(3) * GB + 10), so SA > SB if and only if GA > GB.

Related

Searching for proper optimization algorithm

I'm developing a mobile application with an algorithm. It's about food.
So I have a database that contains some meals. A meal contains several foods. A food is an item like an Apple, and it contains the nutrient values of the apple.
A Meal for example could be an apple pie, which contains apples, sugar, etc...
Each Food in a meal has a minimum and maximum amount, to make them more flexible. So when im making pasta with sauce, you can put in 150-250g of pasta and 50-75ml of sauce.
GOAL: The Algorithm is supposed to take in some values for total daily intake of carbohydrates, protein and fats. Then it should do some kind of algorithm to figure out a good combination of meals to best match the expectations of daily carbohydrates, proteins and fats.
While finding a solution, the algorithm can swap out meals, change the scaling of the foods inside the meals, and with n meals per day, and n foods per meal, that's quite a lot of variables.
I've tried a genetic algorithm approach, but I couldn't find a way to properly pair 2 solutions to make a child, and I couldn't figure out a good way for mutation.
Any help is appreciated, I don't want any code, just some inspiration at what i should look at, I'm not afraid of reading something (if it's not a book^^).
Or if anyone has an approach on how to handle this problem, that would be very nice aswell!
You can solve this problem using Mixed Integer Programming (MIP/MILP).
A handy tutorial for this is here.
The idea will be to introduce a binary variable for each meal indicating whether or not that meal is used. For each food item in the meal, there will be a variable indicating what portion of that food item is used.
The portions are constrained by user specified values, as are the health values.
Thus, for two meals m1 and m2 such that m1 has a portion f1 of food 1 and f2 of food 2 and m2 has f3 of food 3 and f4 of food 4, if we take m1 and m2 to be the aforementioned binary variables, then we have constraints like the following:
daily_protein_min<=m1*(f1*protein_f1+f2*protein_f2)+m2*(f3*protein_f3+f4*protein_f4)<=daily_protein_max
f1_min<=f1<=f1_max
f2_min<=f2<=f2_max
f3_min<=f3<=f3_max
f4_min<=f4<=f4_max
We can also constraint the number of meals per day:
m1+m2=1
Since multiplying a variable by a variable produces a potentially non-convex, non-linear result, we refer to the MIP tutorial for a way to transform such constrains into a usable form.
Finally, we use disciplined convex programming to build a model using cvxpy to pass to a solver. GLPK is a free MIP solver. Commercial solvers like Gurobi are generally faster, allow larger problems, are more stable, and quite expensive, though possibly free for students.
Having done all this, putting the problem into code isn't too terrible. Since I wasn't sure what values to use or what sort of food database was reasonable, I've used essentially random inputs (garbage in, garbage out!).
#!/usr/bin/env python3
import cvxpy as cp
import csv
from io import StringIO
import random
#CSV file for food data
#Drawn from: https://think.cs.vt.edu/corgis/csv/food/food.html
food_data = """1st Household Weight,1st Household Weight Description,2nd Household Weight,2nd Household Weight Description,Alpha Carotene,Ash,Beta Carotene,Beta Cryptoxanthin,Calcium,Carbohydrate,Category,Cholesterol,Choline,Copper,Description,Fiber,Iron,Kilocalories,Lutein and Zeaxanthin,Lycopene,Magnesium,Manganese,Monosaturated Fat,Niacin,Nutrient Data Bank Number,Pantothenic Acid,Phosphorus,Polysaturated Fat,Potassium,Protein,Refuse Percentage,Retinol,Riboflavin,Saturated Fat,Selenium,Sodium,Sugar Total,Thiamin,Total Lipid,Vitamin A - IU,Vitamin A - RAE,Vitamin B12,Vitamin B6,Vitamin C,Vitamin E,Vitamin K,Water,Zinc
28.35,1 oz,454,1 lb,0,0.75,0,0,13,0.0,LAMB,68,0,0.12,"LAMB,AUS,IMP,FRSH,RIB,LN&FAT,1/8""FAT,RAW",0.0,1.34,289,0,0,18,0.009,9.765,5.103,17314,0.501,156,0.985,254,16.46,26,0,0.232,11.925,6.9,68,0.0,0.145,24.2,0,0,1.62,0.347,0.0,0.0,0.0,59.01,2.51
0.0,,0,,0,0.9,27,0,141,7.0,SOUR CREAM,35,19,0.01,"SOUR CREAM,REDUCED FAT",0.0,0.06,181,0,0,11,0.0,4.1,0.07,1178,0.0,85,0.5,211,7.0,0,117,0.24,8.7,4.1,70,0.300000012,0.04,14.1,436,119,0.3,0.02,0.9,0.4,0.7,71.0,0.27
78.0,"1 bar, 2.8 oz",0,,0,0.0,0,0,192,46.15,CANDIES,13,0,0.0,"CANDIES,HERSHEY'S POT OF GOLD ALMOND BAR",3.799999952,1.85,577,0,0,0,0.0,0.0,0.0,19130,0.0,0,0.0,0,12.82,0,0,0.0,16.667,0.0,64,38.45999908,0.0,38.46,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
425.0,"1 package, yields",228,1 serving,0,1.4,0,0,0,9.5,OLD EL PASO CHILI W/BNS,16,0,0.0,"OLD EL PASO CHILI W/BNS,CND ENTREE",4.300000191,1.18,109,0,0,0,0.0,1.87,0.0,22514,0.0,0,0.985,0,7.7,0,0,0.0,0.904,0.0,258,0.0,0.0,4.5,0,0,0.0,0.0,0.0,0.0,0.0,76.9,0.0
28.0,1 slice,0,,0,3.6,0,0,0,1.31,TURKEY,48,0,0.0,"TURKEY,BREAST,SMOKED,LEMON PEPPER FLAVOR,97% FAT-FREE",0.0,0.0,95,0,0,0,0.0,0.25,0.0,7943,0.0,0,0.19,0,20.9,0,0,0.0,0.22,0.0,1160,0.0,0.0,0.69,0,0,0.0,0.0,0.0,0.0,0.0,73.5,0.0
140.0,"1 cup, diced",113,"1 cup, shredded",0,5.85,63,0,772,2.1,CHEESE,85,36,0.027,"CHEESE,PAST PROCESS,SWISS,W/DI NA PO4",0.0,0.61,334,0,0,29,0.014,7.046,0.038,1044,0.26,762,0.622,216,24.73,0,192,0.276,16.045,15.9,1370,1.230000019,0.014,25.01,746,198,1.23,0.036,0.0,0.34,2.2,42.31,3.61
263.0,"1 piece, cooked, excluding refuse (yield from 1 lb raw meat with refuse)",85,3 oz,0,1.22,0,0,20,0.0,LAMB,91,0,0.108,"LAMB,DOM,SHLDR,WHL (ARM&BLD),LN&FAT,1/8""FAT,CHOIC,CKD,RSTD",0.0,1.97,269,0,0,23,0.022,7.79,6.04,17245,0.7,185,1.56,251,22.7,24,0,0.24,7.98,26.4,66,0.0,0.09,19.08,0,0,2.65,0.13,0.0,0.0,0.0,56.98,5.44
17.0,1 piece,0,,0,0.7,10,0,49,76.44,CANDIES,14,10,0.329,"CANDIES,FUDGE,CHOC,PREPARED-FROM-RECIPE",1.700000048,1.77,411,4,0,36,0.422,2.943,0.176,19100,0.14,71,0.373,134,2.39,0,43,0.085,6.448,2.5,45,73.12000275,0.026,10.41,159,44,0.09,0.012,0.0,0.18,1.4,9.81,1.11
124.0,"1 serving, 1/2 cup",0,,0,1.98,0,0,32,9.68,CAMPBELL SOUP,4,0,0.0,"CAMPBELL SOUP,CAMPBELL'S RED & WHITE,BROCCOLI CHS SOUP,COND",0.0,0.0,81,0,0,0,0.0,0.0,0.0,6014,0.0,0,0.0,0,1.61,0,0,0.0,1.613,0.0,661,1.610000014,0.0,3.63,806,0,0.0,0.0,1.0,0.0,0.0,83.1,0.0
142.0,"1 item, 5 oz",0,,0,3.23,0,0,109,22.26,MCDONALD'S,167,0,0.073,"MCDONALD'S,BACON EGG & CHS BISCUIT",0.899999976,2.13,304,0,0,12,0.137,5.546,1.959,21360,0.642,335,2.625,121,13.45,0,0,0.416,8.262,0.0,863,2.180000067,0.262,18.77,399,0,0.0,0.093,2.1,0.0,0.0,42.29,0.9"""
#Convert to dictionary
food_data = [dict(x) for x in csv.DictReader(StringIO(food_data))]
food_data = [x for x in food_data if float(x['1st Household Weight'])!=0]
#Values to track
quantities = ['Protein', 'Carbohydrate', 'Total Lipid', 'Kilocalories']
#Create random meals
meals = []
for mealnum in range(10):
meal = {"name": "Meal #{0}".format(mealnum), "foods":[]}
for x in range(random.randint(2,4)): #Choose a random number of foods to be in meal
food = random.choice(food_data)
food['min'] = 10 #Use large bounds to ensure we have a feasible solution
food['max'] = 1000
weight = float(food['1st Household Weight']) #Number of units in a standard serving
for q in quantities:
#Convert quantities to per-unit measures
food[q] = float(food[q])/weight
meal['foods'].append(food)
meals.append(meal)
#Create an optimization problem from the meals
total_daily_carbs_min = 225
total_daily_carbs_max = 325
total_daily_protein_min = 46
total_daily_protein_max = 56
total_daily_lipids_min = 44
total_daily_lipids_max = 78
#Construct variables, totals, and some constraints
constraints = []
for meal in meals:
#Create a binary variable indicating whether we are using the variable
meal['use_meal'] = cp.Variable(boolean=True)
for q in quantities:
meal[q] = 0
for food in meal['foods']:
food['portion'] = cp.Variable(pos=True)
#Ensure that we only use an appropriate amount of this food
constraints.append( food['min'] <= food['portion'] )
constraints.append( food['portion'] <= food['max'] )
#Calculate this meal's contributions to the totals.
#Each items contribution is the portion times the per-unit quantity times a
#boolean (0, 1) variable indicating whether or not we use the meal
for q in quantities:
meal[q] += food['portion']*food[q]
#Dictionary with no sums of meals, yet
totals = {q:0 for q in quantities}
#See: "http://www.ie.boun.edu.tr/~taskin/pdf/IP_tutorial.pdf", "Nonlinear Product Terms"
#Since multiplying to variables produces a non-convex, nonlinear function, we
#have to use some trickery
#Let w have the value of `meal['use_meal']*meal['Protein']`
#Let x=meal['use_meal'] and y=meal['Protein']
#Let u be an upper bound on the value of meal['Protein']
#We will make constraints such that
# w <= u*y
# w >=0 if we don't use the meal, `w` is zero
# w <= y if we do use the meal, `w` must not be larger than the meal's value
# w >= u*(x-1)+y if we use the meal, then `w>=y` and `w<=y`, so `w==y`; otherwise, `w>=-u+y`
u = 9999 #An upper bound on the value of any meal quantity
for meal in meals:
for q in quantities:
w = cp.Variable()
constraints.append( w<=u*meal['use_meal'] )
constraints.append( w>=0 )
constraints.append( w<=meal[q] )
constraints.append( w>=u*(meal['use_meal']-1)+meal[q] )
totals[q] += w
#Construct constraints. The totals must be within the ranges given
constraints.append( total_daily_protein_min <= totals['Protein'] )
constraints.append( total_daily_carbs_min <= totals['Carbohydrate'] )
constraints.append( total_daily_lipids_min <= totals['Total Lipid'] )
constraints.append( totals['Protein'] <= total_daily_protein_max)
constraints.append( totals['Carbohydrate'] <= total_daily_carbs_max )
constraints.append( totals['Total Lipid'] <= total_daily_lipids_max )
#Ensure that we're doing three meals because this is the number of meals people
#in some countries eat.
constraints.append( sum([meal['use_meal'] for meal in meals])==3 )
#With an objective value of 1 we are asking the solver to identify any feasible
#solution. We don't care which one.
objective = cp.Minimize(1)
#We could also use:
#objective = cp.Minimize(totals['Kilocalories'])
#to meet nutritional needs while minimizing calories intake
problem = cp.Problem(objective, constraints)
val = problem.solve(solver=cp.GLPK_MI, verbose=True)
if val==1:
for q in quantities:
print("{0} = {1}".format(q, totals[q].value))
for m in meals:
if m['use_meal'].value!=1:
continue
print(m['name'])
for f in m['foods']:
print("\t{0} units of {1} (min={2}, max={3})".format(f['portion'].value, f['Description'], f['min'], f['max']))
The result looks like this:
Protein = 46.0
Carbohydrate = 224.99999999999997
Total Lipid = 69.68255808922696
Kilocalories = 1719.4249142101326
Meal #1
10.0 units of TURKEY,BREAST,SMOKED,LEMON PEPPER FLAVOR,97% FAT-FREE (min=10, max=1000)
10.0 units of MCDONALD'S,BACON EGG & CHS BISCUIT (min=10, max=1000)
1000.0 units of CANDIES,FUDGE,CHOC,PREPARED-FROM-RECIPE (min=10, max=1000)
1000.0 units of OLD EL PASO CHILI W/BNS,CND ENTREE (min=10, max=1000)
Meal #2
999.9999999960221 units of CAMPBELL SOUP,CAMPBELL'S RED & WHITE,BROCCOLI CHS SOUP,COND (min=10, max=1000)
10.0 units of LAMB,AUS,IMP,FRSH,RIB,LN&FAT,1/8"FAT,RAW (min=10, max=1000)
10.0 units of TURKEY,BREAST,SMOKED,LEMON PEPPER FLAVOR,97% FAT-FREE (min=10, max=1000)
10.0 units of MCDONALD'S,BACON EGG & CHS BISCUIT (min=10, max=1000)
Meal #5
1000.0 units of OLD EL PASO CHILI W/BNS,CND ENTREE (min=10, max=1000)
221.06457214122466 units of CHEESE,PAST PROCESS,SWISS,W/DI NA PO4 (min=10, max=1000)
10.0 units of LAMB,DOM,SHLDR,WHL (ARM&BLD),LN&FAT,1/8"FAT,CHOIC,CKD,RSTD (min=10, max=1000)
For reasons that are unclear to me, GLPK fails pretty often, though it does find a solution about 1 out of 20 times. This is disappointing since failing to find a solution looks the same as if it had proven there is no solution.
At the moment, I guess I'd recommend running it a larger number of times to before ruling out the possibility of a solution.
Note that adding more foods and meals should make it easier for the solver to find a solution and reduce the frequency of the problem above.
Another way to approach this problem would be to enumerate subsets of meals, such as all subsets of three meals a day. Determining whether a given subset of meals satisfies portion and nutrient constraints is then a linear programming which can be solved quickly and easily. The downside to this is that there may be a large number of meal combinations to enumerate. MIP solvers may have techniques to reduce the number of combinations that need to be searched.
It sounds like a variation of the theme of the Diet Problem, and Mathematical Programming is an effective approach, as Richard clearly pointed out. However this may not be feasible in your case, since there are not many libraries for mathematical programming on mobile devices (may some can be found here), but I am not aware of any DART libraries for linear programming.
As an alternative, a metaheuristic approach might work well, and you have plenty of alternatives, including Genetic Algorithms.
Independently of the algorithm you are going to adopt, one choice you need to make is how to encode solutions. The most natural choice would be to use non-negative continuous (i.e., real) decision variables: let's call them xfm, representing the amount of food f at meal m (it does not matter whether m is the 3rd meal of the 5th day of the 2nd week: just number them progressively). Two subscripts, so they form a matrix of values of F rows times M columns. One natural way to encode them would be to unroll that matrix into a vector xi, i.e., write each row of the matrix one after the other in an array, so that each subscript i identifies a specific pair (f,m).
Next you need to clearly define what the objective function is and have a procedure for calculating it from your encoded solutions.
Next you need a procedure for generating new solutions to be evaluated (look at point 2 here for some ideas).
If you wish to stick with genetic algorithms, here are some ideas for mutation and crossover, just make sure to preserve non-negativity.

Design L1 and L2 distance functions to assess the similarity of bank customers. Each customer is characterized by the following attribute

I am having a hard time with the question below. I am not sure if I got it correct, but either way, I need some help futher understanding it if anyone has time to explain, please do.
Design L1 and L2 distance functions to assess the similarity of bank customers. Each customer is characterized by the following attributes:
− Age (customer’s age, which is a real number with the maximum age is 90 years and minimum age 15 years)
− Cr (“credit rating”) which is ordinal attribute with values ‘very good’, ‘good, ‘medium’, ‘poor’, and ‘very poor’.
− Av_bal (avg account balance, which is a real number with mean 7000, standard deviation is 4000)
Using the L1 distance function computes the distance between the following 2 customers: c1 = (55, good, 7000) and c2 = (25, poor, 1000). [15 points]
Using the L2 distance function computes the distance between the above mentioned 2 customers
Using the L2 distance function computes the distance between the above mentioned 2 customers.
Answer with L1
d(c1,c2) = (c1.cr-c2.cr)/4 +(c1.avg.bal –c2.avg.bal/4000)* (c1.age-mean.age/std.age)-( c2.age-mean.age/std.age)
The question as is, leaves some room for interpretation. Mainly because similarity is not specified exactly. I will try to explain what the standard approach would be.
Usually, before you start, you want to normalize values such that they are rougly in the same range. Otherwise, your similarity will be dominated by the feature with the largest variance.
If you have no information about the distribution but just the range of the values you want to try to nomalize them to [0,1]. For your example this means
norm_age = (age-15)/(90-15)
For nominal values you want to find a mapping to ordinal values if you want to use Lp-Norms. Note: this is not always possible (e.g., colors cannot intuitively be mapped to ordinal values). In you case you can transform the credit rating like this
cr = {0 if ‘very good’, 1 if ‘good, 2 if ‘medium’, 3 if ‘poor’, 4 if ‘very poor’}
afterwards you can do the same normalization as for age
norm_cr = cr/4
Lastly, for normally distributed values you usually perform standardization by subtracting the mean and dividing by the standard deviation.
norm_av_bal = (av_bal-7000)/4000
Now that you have normalized your values, you can go ahead and define the distance functions:
L1(c1, c2) = |c1.norm_age - c2.norm_age| + |c1.norm_cr - c2.norm_cr |
+ |c1.norm_av_bal - c2.norm_av_bal|
and
L2(c1, c2) = sqrt((c1.norm_age - c2.norm_age)2 + (c1.norm_cr -
c2.norm_cr)2 + (c1.norm_av_bal -
c2.norm_av_bal)2)

random choice, specific distributions, variable applicability

Imagine I have person 1, 2, 3 and 4, then I have shirt styles A, B, C, D and I want to distribute the shirt styles to the people such that 25% of them get style A, 25% get style B, 25% get style C and 25% get style D but some of the people refuse to wear certain styles, these people are represented by Fs. How can I randomly match all the people with the styles they are willing to wear to get the approximate distribution?
A B C D
1 T F T T
2 T F F F
3 T T T T
4 T T T F
In this case this is easy and 25% is can be fully achieved, just give each person a different style. However, I intend to take this problem beyond this simple situation, my solution has to be generic. The number or styles, the number of people, and the distribution is all variable. Sometimes, the distribution will be impossible to create 100% accurately, approximate/close/best effor is expected. The selection process should be random and attempt to maintain the distribution.
I'm pretty agnostic to the language here, I'm just seeking the algorithm. Though preferably it would be able to be distributed.
Finding a solution when you are hampered by the Fs is https://en.wikipedia.org/wiki/Assignment_problem. One way to select an arbitrary assignment when there are many would be to set random costs where a style is acceptable to a person and then let it find the assignment with lowest possible cost. However it is not obvious that this will fit any natural definition of random. One (very inefficient) natural definition of random would be to select from all possible assignments at random until you get one that is acceptable to everybody. The distribution you get from this might not be the same as the one you would get by setting up random costs and then solving the resulting assignment problem.
You are using the term 'randomly match' which should be used with caution. The correct interpretation, I believe, is a random selection from the set of all valid solutions, so basically if we could enumerate all valid solution - we could trivially solve the problem.
You are looking for a close-enough solution, so we need to better define what a valid solution is. I suggest to define some threshold (say 1% error at most).
In your example, there are 4 groups (those assigned with shirt style A/B/C/D). Therefore, there are 2^4-1 possible person archetypes (love/hate each of A/B/C/D, with the assumption that anyone loves at least one shirt style). Each archetype population has a given population size, and each archetype can be assigned to some of the 4 groups (1 or more).
The goal is to divide the population of each archetype between the 4 groups, such that say, the size of each group is between L and H.
Lets formalize it.
Problem statement:
Denote A(0001),...,A(1111): the population size of each of the 15 archetypes
Denote G1(0001): the size of A(0001) assigned to G1, etc.
Given
L,H: constants
A(0001),...,A(1111): 15 constants
Our goal is to find all integer solutions for
G1(0001),G1(0011),G1(0101),G1(0111),G1(1001),G1(1011),G1(1101),G1(1111),
G2(0010),G2(0011),G2(0110),G2(0111),G2(1010),G2(1011),G2(1110),G2(1111),
G3(0100),G3(0101),G3(0110),G3(0111),G3(1100),G3(1101),G3(1110),G3(1111),
G4(1000),G4(1001),G4(1010),G4(1011),G4(1100),G4(1101),G4(1110),G4(1111)
subject to:
G1(0001) = A(0001)
G2(0010) = A(0010)
G2(0011) + G1(0011) = A(0011)
G3(0100) = A(0100)
G3(0101) + G1(0101) = A(0101)
G3(0110) + G2(0110) = A(0110)
G3(0111) + G2(0101) + G1(0101) = A(0111)
G4(1000) = A(1000)
G4(1001) + G1(1001) = A(1001)
G4(1010) + G2(1010) = A(1010)
G4(1011) + G2(1011) + G1(1011) = A(1011)
G4(1100) + G3(1100) = A(1100)
G4(1101) + G3(1101) + G1(1101) = A(1101)
G4(1110) + G3(1110) + G2(1110) = A(1110)
G4(1111) + G3(1111) + G2(1111) + G1(1111) = A(1111)
L < G1(0001)+G1(0011)+G1(0101)+G1(0111)+G1(1001)+G1(1011)+G1(1101)+G1(1111) <
H
L < G2(0010)+G2(0011)+G2(0110)+G2(0111)+G2(1010)+G2(1011)+G2(1110)+G2(1111) <
H
L < G3(0100)+G3(0101)+G3(0110)+G3(0111)+G3(1100)+G3(1101)+G3(1110)+G3(1111) <
H
L < G4(1000)+G4(1001)+G4(1010)+G4(1011)+G4(1100)+G4(1101)+G4(1110)+G4(1111) <
H
Now we can use an integer programming solver for the job.

How to generate correlated Uniform[0,1] variables

(This question is related to how to generate a dataset of correlated variables with different distributions?)
In Stata, say that I create a random variable following a Uniform[0,1] distribution:
set seed 100
gen random1 = runiform()
I now want to create a second random variable that is correlated with the first (the correlation should be .75, say), but is bounded by 0 and 1. I would like this second variable to also be more-or-less Uniform[0,1]. How can I do this?
This won't be exact, but the NORTA/copula method should be pretty close and easy to implement.
The relevant citation is:
Cario, Marne C., and Barry L. Nelson. Modeling and generating random vectors with arbitrary marginal distributions and correlation matrix. Technical Report, Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois, 1997.
The paper can be found here.
The general recipe to generate correlated random variables from any distribution is:
Draw two (or more) correlated variables from a joint standard normal distribution using corr2data
Calculate the univariate normal CDF of each of these variables using normal()
Apply the inverse CDF of any distribution to simulate draws from that distribution.
The third step is pretty easy with the [0,1] uniform: you don't even need it. Typically, the magnitude of the correlations you get will be less than the magnitudes of the original (normal) correlations, so it might be useful to bump those up a bit.
Stata Code for 2 uniformish variables that have a correlation of 0.75:
clear
// Step 1
matrix C = (1, .75 \ .75, 1)
corr2data x y, n(10000) corr(C) double
corr x y, means
// Steps 2-3
replace x = normal(x)
replace y = normal(y)
// Make sure things worked
corr x y, means
stack x y, into(z) clear
lab define vars 1 "x" 2 "y"
lab val _stack vars
capture ssc install bihist
bihist z, by(_stack) density tw1(yline(-1 0 1))
If you want to improve the approximation for the uniform case, you can transform the correlations like this (see section 5 of the linked paper):
matrix C = (1,2*sin(.75*_pi/6)\2*sin(.75*_pi/6),1)
This is 0.76536686 instead of the 0.75.
Code for the question in the comments
The correlation matrix C written more compactly, and I am applying the transformation:
clear
matrix C = ( 1, ///
2*sin(-.46*_pi/6), 1, ///
2*sin(.53*_pi/6), 2*sin(-.80*_pi/6), 1, ///
2*sin(0*_pi/6), 2*sin(-.41*_pi/6), 2*sin(.48*_pi/6), 1 )
corr2data v1 v2 v3 v4, n(10000) corr(C) cstorage(lower)
forvalues i=1/4 {
replace v`i' = normal(v`i')
}

What is a better way to sort by a 5 star rating?

I'm trying to sort a bunch of products by customer ratings using a 5 star system. The site I'm setting this up for does not have a lot of ratings and continue to add new products so it will usually have a few products with a low number of ratings.
I tried using average star rating but that algorithm fails when there is a small number of ratings.
Example a product that has 3x 5 star ratings would show up better than a product that has 100x 5 star ratings and 2x 2 star ratings.
Shouldn't the second product show up higher because it is statistically more trustworthy because of the larger number of ratings?
Prior to 2015, the Internet Movie Database (IMDb) publicly listed the formula used to rank their Top 250 movies list. To quote:
The formula for calculating the Top Rated 250 Titles gives a true Bayesian estimate:
weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C
where:
R = average for the movie (mean)
v = number of votes for the movie
m = minimum votes required to be listed in the Top 250 (currently 25000)
C = the mean vote across the whole report (currently 7.0)
For the Top 250, only votes from regular voters are considered.
It's not so hard to understand. The formula is:
rating = (v / (v + m)) * R +
(m / (v + m)) * C;
Which can be mathematically simplified to:
rating = (R * v + C * m) / (v + m);
The variables are:
R – The item's own rating. R is the average of the item's votes. (For example, if an item has no votes, its R is 0. If someone gives it 5 stars, R becomes 5. If someone else gives it 1 star, R becomes 3, the average of [1, 5]. And so on.)
C – The average item's rating. Find the R of every single item in the database, including the current one, and take the average of them; that is C. (Suppose there are 4 items in the database, and their ratings are [2, 3, 5, 5]. C is 3.75, the average of those numbers.)
v – The number of votes for an item. (To given another example, if 5 people have cast votes on an item, v is 5.)
m – The tuneable parameter. The amount of "smoothing" applied to the rating is based on the number of votes (v) in relation to m. Adjust m until the results satisfy you. And don't misinterpret IMDb's description of m as "minimum votes required to be listed" – this system is perfectly capable of ranking items with less votes than m.
All the formula does is: add m imaginary votes, each with a value of C, before calculating the average. In the beginning, when there isn't enough data (i.e. the number of votes is dramatically less than m), this causes the blanks to be filled in with average data. However, as votes accumulates, eventually the imaginary votes will be drowned out by real ones.
In this system, votes don't cause the rating to fluctuate wildly. Instead, they merely perturb it a bit in some direction.
When there are zero votes, only imaginary votes exist, and all of them are C. Thus, each item begins with a rating of C.
See also:
A demo. Click "Solve".
Another explanation of IMDb's system.
An explanation of a similar Bayesian star-rating system.
Evan Miller shows a Bayesian approach to ranking 5-star ratings:
where
nk is the number of k-star ratings,
sk is the "worth" (in points) of k stars,
N is the total number of votes
K is the maximum number of stars (e.g. K=5, in a 5-star rating system)
z_alpha/2 is the 1 - alpha/2 quantile of a normal distribution. If you want 95% confidence (based on the Bayesian posterior distribution) that the actual sort criterion is at least as big as the computed sort criterion, choose z_alpha/2 = 1.65.
In Python, the sorting criterion can be calculated with
def starsort(ns):
"""
http://www.evanmiller.org/ranking-items-with-star-ratings.html
"""
N = sum(ns)
K = len(ns)
s = list(range(K,0,-1))
s2 = [sk**2 for sk in s]
z = 1.65
def f(s, ns):
N = sum(ns)
K = len(ns)
return sum(sk*(nk+1) for sk, nk in zip(s,ns)) / (N+K)
fsns = f(s, ns)
return fsns - z*math.sqrt((f(s2, ns)- fsns**2)/(N+K+1))
For example, if an item has 60 five-stars, 80 four-stars, 75 three-stars, 20 two-stars and 25 one-stars, then its overall star rating would be about 3.4:
x = (60, 80, 75, 20, 25)
starsort(x)
# 3.3686975120774694
and you can sort a list of 5-star ratings with
sorted([(60, 80, 75, 20, 25), (10,0,0,0,0), (5,0,0,0,0)], key=starsort, reverse=True)
# [(10, 0, 0, 0, 0), (60, 80, 75, 20, 25), (5, 0, 0, 0, 0)]
This shows the effect that more ratings can have upon the overall star value.
You'll find that this formula tends to give an overall rating which is a bit
lower than the overall rating reported by sites such as Amazon, Ebay or Wal-mart
particularly when there are few votes (say, less than 300). This reflects the
higher uncertainy that comes with fewer votes. As the number of votes increases
(into the thousands) all overall these rating formulas should tend to the
(weighted) average rating.
Since the formula only depends on the frequency distribution of 5-star ratings
for the item itself, it is easy to combine reviews from multiple sources (or,
update the overall rating in light of new votes) by simply adding the frequency
distributions together.
Unlike the IMDb formula, this formula does not depend on the average score
across all items, nor an artificial minimum number of votes cutoff value.
Moreover, this formula makes use of the full frequency distribution -- not just
the average number of stars and the number of votes. And it makes sense that it
should since an item with ten 5-stars and ten 1-stars should be treated as
having more uncertainty than (and therefore not rated as highly as) an item with
twenty 3-star ratings:
In [78]: starsort((10,0,0,0,10))
Out[78]: 2.386028063783418
In [79]: starsort((0,0,20,0,0))
Out[79]: 2.795342687927806
The IMDb formula does not take this into account.
See this page for a good analysis of star-based rating systems, and this one for a good analysis of upvote-/downvote- based systems.
For up and down voting you want to estimate the probability that, given the ratings you have, the "real" score (if you had infinite ratings) is greater than some quantity (like, say, the similar number for some other item you're sorting against).
See the second article for the answer, but the conclusion is you want to use the Wilson confidence. The article gives the equation and sample Ruby code (easily translated to another language).
Well, depending on how complex you want to make it, you could have ratings additionally be weighted based on how many ratings the person has made, and what those ratings are. If the person has only made one rating, it could be a shill rating, and might count for less. Or if the person has rated many things in category a, but few in category b, and has an average rating of 1.3 out of 5 stars, it sounds like category a may be artificially weighed down by the low average score of this user, and should be adjusted.
But enough of making it complex. Let’s make it simple.
Assuming we’re working with just two values, ReviewCount and AverageRating, for a particular item, it would make sense to me to look ReviewCount as essentially being the “reliability” value. But we don’t just want to bring scores down for low ReviewCount items: a single one-star rating is probably as unreliable as a single 5 star rating. So what we want to do is probably average towards the middle: 3.
So, basically, I’m thinking of an equation something like X * AverageRating + Y * 3 = the-rating-we-want. In order to make this value come out right we need X+Y to equal 1. Also we need X to increase in value as ReviewCount increases...with a review count of 0, x should be 0 (giving us an equation of “3”), and with an infinite review count X should be 1 (which makes the equation = AverageRating).
So what are X and Y equations? For the X equation want the dependent variable to asymptotically approach 1 as the independent variable approaches infinity. A good set of equations is something like:
Y = 1/(factor^RatingCount)
and (utilizing the fact that X must be equal to 1-Y)
X = 1 – (1/(factor^RatingCount)
Then we can adjust "factor" to fit the range that we're looking for.
I used this simple C# program to try a few factors:
// We can adjust this factor to adjust our curve.
double factor = 1.5;
// Here's some sample data
double RatingAverage1 = 5;
double RatingCount1 = 1;
double RatingAverage2 = 4.5;
double RatingCount2 = 5;
double RatingAverage3 = 3.5;
double RatingCount3 = 50000; // 50000 is not infinite, but it's probably plenty to closely simulate it.
// Do the calculations
double modfactor = Math.Pow(factor, RatingCount1);
double modRating1 = (3 / modfactor)
+ (RatingAverage1 * (1 - 1 / modfactor));
double modfactor2 = Math.Pow(factor, RatingCount2);
double modRating2 = (3 / modfactor2)
+ (RatingAverage2 * (1 - 1 / modfactor2));
double modfactor3 = Math.Pow(factor, RatingCount3);
double modRating3 = (3 / modfactor3)
+ (RatingAverage3 * (1 - 1 / modfactor3));
Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}",
RatingAverage1, RatingCount1, modRating1));
Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}",
RatingAverage2, RatingCount2, modRating2));
Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}",
RatingAverage3, RatingCount3, modRating3));
// Hold up for the user to read the data.
Console.ReadLine();
So you don’t bother copying it in, it gives this output:
RatingAverage: 5, RatingCount: 1, Adjusted Rating: 3.67
RatingAverage: 4.5, RatingCount: 5, Adjusted Rating: 4.30
RatingAverage: 3.5, RatingCount: 50000, Adjusted Rating: 3.50
Something like that? You could obviously adjust the "factor" value as needed to get the kind of weighting you want.
You could sort by median instead of arithmetic mean. In this case both examples have a median of 5, so both would have the same weight in a sorting algorithm.
You could use a mode to the same effect, but median is probably a better idea.
If you want to assign additional weight to the product with 100 5-star ratings, you'll probably want to go with some kind of weighted mode, assigning more weight to ratings with the same median, but with more overall votes.
If you just need a fast and cheap solution that will mostly work without using a lot of computation here's one option (assuming a 1-5 rating scale)
SELECT Products.id, Products.title, avg(Ratings.score), etc
FROM
Products INNER JOIN Ratings ON Products.id=Ratings.product_id
GROUP BY
Products.id, Products.title
ORDER BY (SUM(Ratings.score)+25.0)/(COUNT(Ratings.id)+20.0) DESC, COUNT(Ratings.id) DESC
By adding in 25 and dividing by the total ratings + 20 you're basically adding 10 worst scores and 10 best scores to the total ratings and then sorting accordingly.
This does have known issues. For example, it unfairly rewards low-scoring products with few ratings (as this graph demonstrates, products with an average score of 1 and just one rating score a 1.2 while products with an average score of 1 and 1k+ ratings score closer to 1.05). You could also argue it unfairly punishes high-quality products with few ratings.
This chart shows what happens for all 5 ratings over 1-1000 ratings:
http://www.wolframalpha.com/input/?i=Plot3D%5B%2825%2Bxy%29/%2820%2Bx%29%2C%7Bx%2C1%2C1000%7D%2C%7By%2C0%2C6%7D%5D
You can see the dip upwards at the very bottom ratings, but overall it's a fair ranking, I think. You can also look at it this way:
http://www.wolframalpha.com/input/?i=Plot3D%5B6-%28%2825%2Bxy%29/%2820%2Bx%29%29%2C%7Bx%2C1%2C1000%7D%2C%7By%2C0%2C6%7D%5D
If you drop a marble on most places in this graph, it will automatically roll towards products with both higher scores and higher ratings.
Obviously, the low number of ratings puts this problem at a statistical handicap. Never the less...
A key element to improving the quality of an aggregate rating is to "rate the rater", i.e. to keep tabs of the ratings each particular "rater" has supplied (relative to others). This allows weighing their votes during the aggregation process.
Another solution, more of a cope out, is to supply the end-users with a count (or a range indication thereof) of votes for the underlying item.
One option is something like Microsoft's TrueSkill system, where the score is given by mean - 3*stddev, where the constants can be tweaked.
After look for a while, I choose the Bayesian system.
If someone is using Ruby, here a gem for it:
https://github.com/wbotelhos/rating
I'd highly recommend the book Programming Collective Intelligence by Toby Segaran (OReilly) ISBN 978-0-596-52932-1 which discusses how to extract meaningful data from crowd behaviour. The examples are in Python, but its easy enough to convert.

Resources