I am having a hard time with the question below. I am not sure if I got it correct, but either way, I need some help futher understanding it if anyone has time to explain, please do.
Design L1 and L2 distance functions to assess the similarity of bank customers. Each customer is characterized by the following attributes:
− Age (customer’s age, which is a real number with the maximum age is 90 years and minimum age 15 years)
− Cr (“credit rating”) which is ordinal attribute with values ‘very good’, ‘good, ‘medium’, ‘poor’, and ‘very poor’.
− Av_bal (avg account balance, which is a real number with mean 7000, standard deviation is 4000)
Using the L1 distance function computes the distance between the following 2 customers: c1 = (55, good, 7000) and c2 = (25, poor, 1000). [15 points]
Using the L2 distance function computes the distance between the above mentioned 2 customers
Using the L2 distance function computes the distance between the above mentioned 2 customers.
Answer with L1
d(c1,c2) = (c1.cr-c2.cr)/4 +(c1.avg.bal –c2.avg.bal/4000)* (c1.age-mean.age/std.age)-( c2.age-mean.age/std.age)
The question as is, leaves some room for interpretation. Mainly because similarity is not specified exactly. I will try to explain what the standard approach would be.
Usually, before you start, you want to normalize values such that they are rougly in the same range. Otherwise, your similarity will be dominated by the feature with the largest variance.
If you have no information about the distribution but just the range of the values you want to try to nomalize them to [0,1]. For your example this means
norm_age = (age-15)/(90-15)
For nominal values you want to find a mapping to ordinal values if you want to use Lp-Norms. Note: this is not always possible (e.g., colors cannot intuitively be mapped to ordinal values). In you case you can transform the credit rating like this
cr = {0 if ‘very good’, 1 if ‘good, 2 if ‘medium’, 3 if ‘poor’, 4 if ‘very poor’}
afterwards you can do the same normalization as for age
norm_cr = cr/4
Lastly, for normally distributed values you usually perform standardization by subtracting the mean and dividing by the standard deviation.
norm_av_bal = (av_bal-7000)/4000
Now that you have normalized your values, you can go ahead and define the distance functions:
L1(c1, c2) = |c1.norm_age - c2.norm_age| + |c1.norm_cr - c2.norm_cr |
+ |c1.norm_av_bal - c2.norm_av_bal|
and
L2(c1, c2) = sqrt((c1.norm_age - c2.norm_age)2 + (c1.norm_cr -
c2.norm_cr)2 + (c1.norm_av_bal -
c2.norm_av_bal)2)
I have a list of 6500 items that I would like to trade or invest in. (Not for real money, but for a certain game.) Each item has 5 numbers that will be used to rank it among the others.
Total quantity of item traded per day: The higher this number, the better.
The Donchian Channel of the item over the last 5 days: The higher this number, the better.
The median spread of the price: The lower this number, the better.
The spread of the 20 day moving average for the item: The lower this number, the better.
The spread of the 5 day moving average for the item: The higher this number, the better.
All 5 numbers have the same 'weight', or in other words, they should all affect the final number in the with the same worth or value.
At the moment, I just multiply all 5 numbers for each item, but it doesn't rank the items the way I would them to be ranked. I just want to combine all 5 numbers into a weighted number that I can use to rank all 6500 items, but I'm unsure of how to do this correctly or mathematically.
Note: The total quantity of the item traded per day and the donchian channel are numbers that are much higher then the spreads, which are more of percentage type numbers. This is probably the reason why multiplying them all together didn't work for me; the quantity traded per day and the donchian channel had a much bigger role in the final number.
The reason people are having trouble answering this question is we have no way of comparing two different "attributes". If there were just two attributes, say quantity traded and median price spread, would (20million,50%) be worse or better than (100,1%)? Only you can decide this.
Converting everything into the same size numbers could help, this is what is known as "normalisation". A good way of doing this is the z-score which Prasad mentions. This is a statistical concept, looking at how the quantity varies. You need to make some assumptions about the statistical distributions of your numbers to use this.
Things like spreads are probably normally distributed - shaped like a normal distribution. For these, as Prasad says, take z(spread) = (spread-mean(spreads))/standardDeviation(spreads).
Things like the quantity traded might be a Power law distribution. For these you might want to take the log() before calculating the mean and sd. That is the z score is z(qty) = (log(qty)-mean(log(quantities)))/sd(log(quantities)).
Then just add up the z-score for each attribute.
To do this for each attribute you will need to have an idea of its distribution. You could guess but the best way is plot a graph and have a look. You might also want to plot graphs on log scales. See wikipedia for a long list.
You can replace each attribute-vector x (of length N = 6500) by the z-score of the vector Z(x), where
Z(x) = (x - mean(x))/sd(x).
This would transform them into the same "scale", and then you can add up the Z-scores (with equal weights) to get a final score, and rank the N=6500 items by this total score. If you can find in your problem some other attribute-vector that would be an indicator of "goodness" (say the 10-day return of the security?), then you could fit a regression model of this predicted attribute against these z-scored variables, to figure out the best non-uniform weights.
Start each item with a score of 0. For each of the 5 numbers, sort the list by that number and add each item's ranking in that sorting to its score. Then, just sort the items by the combined score.
You would usually normalize your data entries to their respective range. Since there is no fixed range for them, you'll have to use a sliding range - or, to keep it simpler, normalize them to the daily ranges.
For each day, get all entries for a given type, get the highest and the lowest of them, determine the difference between them. Let Bottom=value of the lowest, Range=difference between highest and lowest. Then you calculate for each entry (value - Bottom)/Range, which will result in something between 0.0 and 1.0. These are the numbers you can continue to work with, then.
Pseudocode (brackets replaced by indentation to make easier to read):
double maxvalues[5];
double minvalues[5];
// init arrays with any item
for(i=0; i<5; i++)
maxvalues[i] = items[0][i];
minvalues[i] = items[0][i];
// find minimum and maximum values
foreach (items as item)
for(i=0; i<5; i++)
if (minvalues[i] > item[i])
minvalues[i] = item[i];
if (maxvalues[i] < item[i])
maxvalues[i] = item[i];
// now scale them - in this case, to the range of 0 to 1.
double scaledItems[sizeof(items)][5];
double t;
foreach(i=0; i<5; i++)
double delta = maxvalues[i] - minvalues[i];
foreach(j=sizeof(items)-1; j>=0; --j)
scaledItems[j][i] = (items[j][i] - minvalues[i]) / delta;
// linear normalization
something like that. I'll be more elegant with a good library (STL, boost, whatever you have on the implementation platform), and the normalization should be in a separate function, so you can replace it with other variations like log() as the need arises.
Total quantity of item traded per day: The higher this number, the better. (a)
The Donchian Channel of the item over the last 5 days: The higher this number, the better. (b)
The median spread of the price: The lower this number, the better. (c)
The spread of the 20 day moving average for the item: The lower this number, the better. (d)
The spread of the 5 day moving average for the item: The higher this number, the better. (e)
a + b -c -d + e = "score" (higher score = better score)
I'm trying to make a randomizer that will use the Monte Carlo Hit or Miss Simulation.
I have a Key-Value pair that represents the ID and the probability value:
ID - Value
2 - 0.37
1 - 0.35
4 - 0.14
3 - 0.12
When you add all of those values, you will get a total of 1.0.
You can imagine those values as the total area of a "slice" on the "wheel" (EG: ID 2 occupies 37% of the wheel, while ID 3 only occupies 12% of the wheel). When converted to "range" it will look like this:
ID - Value - Range
2 - 0.37 - 0 to 37
1 - 0.35 - 37 to 72
4 - 0.14 - 72 to 86
3 - 0.12- 86 to 100
Now, I am using Random.NextDouble() to generate a random value that is between 0.0 and 1.0. That random value will be considered as the "spin" on the wheel. Say, the randomizer returns 0.35, then ID 2 will be selected.
What is the best way to implement this given that I have an array of doubles?
The simplest solutions are often the best, if your range is 0 - 100 by design (or another manageebly small number), you can allocate an int[] and use the table of ranges you created to fill in the ID at the corresponding index, your "throw" will then look like:
int randomID = rangesToIDs[random.nextInt(rangesToIDs.length)];
Btw, it is not necessary to sort the ID's on range size, as the randoms are assumed to be distributed uniformly it does not matter where in the lookup table a range is placed. It only matters that the number of entries is proportional to the chance to throw an ID.
Let's assume your initial data is represented as array D[n], where D[i] = (id, p) and sum(D[i].p for i=0..n-1) == 1.
Build a second array P[n] such that P[i] = (q, id): P[i] = (sum(D[j].p for j in 0..i), D[j].id) -- i.e., convert individual probablity of each slice i into cumulative probability of all slices preceding i (inclusive). Note that, by definition, this array P is ordered by field q (i.e. by cumulative probability).
Now you can use binary search to find the slice chosen by the random number r (0 <= r <= 1):
find highest i such that P[i].q <= r; then P[i].id is your slice.
It is possible to speed up the lookup further by hashing the probability range with a fixed grid. I can write more details on this if anybody is interested.
As jk wrote sorted dictionary of should be fine.
let's say you got dictionary like this:
0.37 2
0.72 1
0.86 4
1.00 3
You roll xx = 0.66..
Iterate through dictionary starting from lowest number (that's 0.37)
if xx < dict[i].key
return dict[i].value
Or another solution which comes to my mind is List of custom objects containing lower and upper bound and value. You iterate then through list and check if rolled number is in range of up and low bounds.
a sorted map/dictionary with the 'Value' as the key and the 'ID' as the value would allow you to quickly find the upper bound of the range you are in and then look up the ID for that range
assuming your dictionary allows it, a binary search would be better to find the upper bound than interating throught the entire dictionary
boundaries = [37, 72, 86, 100]
num = 100 * random
for i in boundaries:
if num < i then return i
I'm trying to sort a bunch of products by customer ratings using a 5 star system. The site I'm setting this up for does not have a lot of ratings and continue to add new products so it will usually have a few products with a low number of ratings.
I tried using average star rating but that algorithm fails when there is a small number of ratings.
Example a product that has 3x 5 star ratings would show up better than a product that has 100x 5 star ratings and 2x 2 star ratings.
Shouldn't the second product show up higher because it is statistically more trustworthy because of the larger number of ratings?
Prior to 2015, the Internet Movie Database (IMDb) publicly listed the formula used to rank their Top 250 movies list. To quote:
The formula for calculating the Top Rated 250 Titles gives a true Bayesian estimate:
weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C
where:
R = average for the movie (mean)
v = number of votes for the movie
m = minimum votes required to be listed in the Top 250 (currently 25000)
C = the mean vote across the whole report (currently 7.0)
For the Top 250, only votes from regular voters are considered.
It's not so hard to understand. The formula is:
rating = (v / (v + m)) * R +
(m / (v + m)) * C;
Which can be mathematically simplified to:
rating = (R * v + C * m) / (v + m);
The variables are:
R – The item's own rating. R is the average of the item's votes. (For example, if an item has no votes, its R is 0. If someone gives it 5 stars, R becomes 5. If someone else gives it 1 star, R becomes 3, the average of [1, 5]. And so on.)
C – The average item's rating. Find the R of every single item in the database, including the current one, and take the average of them; that is C. (Suppose there are 4 items in the database, and their ratings are [2, 3, 5, 5]. C is 3.75, the average of those numbers.)
v – The number of votes for an item. (To given another example, if 5 people have cast votes on an item, v is 5.)
m – The tuneable parameter. The amount of "smoothing" applied to the rating is based on the number of votes (v) in relation to m. Adjust m until the results satisfy you. And don't misinterpret IMDb's description of m as "minimum votes required to be listed" – this system is perfectly capable of ranking items with less votes than m.
All the formula does is: add m imaginary votes, each with a value of C, before calculating the average. In the beginning, when there isn't enough data (i.e. the number of votes is dramatically less than m), this causes the blanks to be filled in with average data. However, as votes accumulates, eventually the imaginary votes will be drowned out by real ones.
In this system, votes don't cause the rating to fluctuate wildly. Instead, they merely perturb it a bit in some direction.
When there are zero votes, only imaginary votes exist, and all of them are C. Thus, each item begins with a rating of C.
See also:
A demo. Click "Solve".
Another explanation of IMDb's system.
An explanation of a similar Bayesian star-rating system.
Evan Miller shows a Bayesian approach to ranking 5-star ratings:
where
nk is the number of k-star ratings,
sk is the "worth" (in points) of k stars,
N is the total number of votes
K is the maximum number of stars (e.g. K=5, in a 5-star rating system)
z_alpha/2 is the 1 - alpha/2 quantile of a normal distribution. If you want 95% confidence (based on the Bayesian posterior distribution) that the actual sort criterion is at least as big as the computed sort criterion, choose z_alpha/2 = 1.65.
In Python, the sorting criterion can be calculated with
def starsort(ns):
"""
http://www.evanmiller.org/ranking-items-with-star-ratings.html
"""
N = sum(ns)
K = len(ns)
s = list(range(K,0,-1))
s2 = [sk**2 for sk in s]
z = 1.65
def f(s, ns):
N = sum(ns)
K = len(ns)
return sum(sk*(nk+1) for sk, nk in zip(s,ns)) / (N+K)
fsns = f(s, ns)
return fsns - z*math.sqrt((f(s2, ns)- fsns**2)/(N+K+1))
For example, if an item has 60 five-stars, 80 four-stars, 75 three-stars, 20 two-stars and 25 one-stars, then its overall star rating would be about 3.4:
x = (60, 80, 75, 20, 25)
starsort(x)
# 3.3686975120774694
and you can sort a list of 5-star ratings with
sorted([(60, 80, 75, 20, 25), (10,0,0,0,0), (5,0,0,0,0)], key=starsort, reverse=True)
# [(10, 0, 0, 0, 0), (60, 80, 75, 20, 25), (5, 0, 0, 0, 0)]
This shows the effect that more ratings can have upon the overall star value.
You'll find that this formula tends to give an overall rating which is a bit
lower than the overall rating reported by sites such as Amazon, Ebay or Wal-mart
particularly when there are few votes (say, less than 300). This reflects the
higher uncertainy that comes with fewer votes. As the number of votes increases
(into the thousands) all overall these rating formulas should tend to the
(weighted) average rating.
Since the formula only depends on the frequency distribution of 5-star ratings
for the item itself, it is easy to combine reviews from multiple sources (or,
update the overall rating in light of new votes) by simply adding the frequency
distributions together.
Unlike the IMDb formula, this formula does not depend on the average score
across all items, nor an artificial minimum number of votes cutoff value.
Moreover, this formula makes use of the full frequency distribution -- not just
the average number of stars and the number of votes. And it makes sense that it
should since an item with ten 5-stars and ten 1-stars should be treated as
having more uncertainty than (and therefore not rated as highly as) an item with
twenty 3-star ratings:
In [78]: starsort((10,0,0,0,10))
Out[78]: 2.386028063783418
In [79]: starsort((0,0,20,0,0))
Out[79]: 2.795342687927806
The IMDb formula does not take this into account.
See this page for a good analysis of star-based rating systems, and this one for a good analysis of upvote-/downvote- based systems.
For up and down voting you want to estimate the probability that, given the ratings you have, the "real" score (if you had infinite ratings) is greater than some quantity (like, say, the similar number for some other item you're sorting against).
See the second article for the answer, but the conclusion is you want to use the Wilson confidence. The article gives the equation and sample Ruby code (easily translated to another language).
Well, depending on how complex you want to make it, you could have ratings additionally be weighted based on how many ratings the person has made, and what those ratings are. If the person has only made one rating, it could be a shill rating, and might count for less. Or if the person has rated many things in category a, but few in category b, and has an average rating of 1.3 out of 5 stars, it sounds like category a may be artificially weighed down by the low average score of this user, and should be adjusted.
But enough of making it complex. Let’s make it simple.
Assuming we’re working with just two values, ReviewCount and AverageRating, for a particular item, it would make sense to me to look ReviewCount as essentially being the “reliability” value. But we don’t just want to bring scores down for low ReviewCount items: a single one-star rating is probably as unreliable as a single 5 star rating. So what we want to do is probably average towards the middle: 3.
So, basically, I’m thinking of an equation something like X * AverageRating + Y * 3 = the-rating-we-want. In order to make this value come out right we need X+Y to equal 1. Also we need X to increase in value as ReviewCount increases...with a review count of 0, x should be 0 (giving us an equation of “3”), and with an infinite review count X should be 1 (which makes the equation = AverageRating).
So what are X and Y equations? For the X equation want the dependent variable to asymptotically approach 1 as the independent variable approaches infinity. A good set of equations is something like:
Y = 1/(factor^RatingCount)
and (utilizing the fact that X must be equal to 1-Y)
X = 1 – (1/(factor^RatingCount)
Then we can adjust "factor" to fit the range that we're looking for.
I used this simple C# program to try a few factors:
// We can adjust this factor to adjust our curve.
double factor = 1.5;
// Here's some sample data
double RatingAverage1 = 5;
double RatingCount1 = 1;
double RatingAverage2 = 4.5;
double RatingCount2 = 5;
double RatingAverage3 = 3.5;
double RatingCount3 = 50000; // 50000 is not infinite, but it's probably plenty to closely simulate it.
// Do the calculations
double modfactor = Math.Pow(factor, RatingCount1);
double modRating1 = (3 / modfactor)
+ (RatingAverage1 * (1 - 1 / modfactor));
double modfactor2 = Math.Pow(factor, RatingCount2);
double modRating2 = (3 / modfactor2)
+ (RatingAverage2 * (1 - 1 / modfactor2));
double modfactor3 = Math.Pow(factor, RatingCount3);
double modRating3 = (3 / modfactor3)
+ (RatingAverage3 * (1 - 1 / modfactor3));
Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}",
RatingAverage1, RatingCount1, modRating1));
Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}",
RatingAverage2, RatingCount2, modRating2));
Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}",
RatingAverage3, RatingCount3, modRating3));
// Hold up for the user to read the data.
Console.ReadLine();
So you don’t bother copying it in, it gives this output:
RatingAverage: 5, RatingCount: 1, Adjusted Rating: 3.67
RatingAverage: 4.5, RatingCount: 5, Adjusted Rating: 4.30
RatingAverage: 3.5, RatingCount: 50000, Adjusted Rating: 3.50
Something like that? You could obviously adjust the "factor" value as needed to get the kind of weighting you want.
You could sort by median instead of arithmetic mean. In this case both examples have a median of 5, so both would have the same weight in a sorting algorithm.
You could use a mode to the same effect, but median is probably a better idea.
If you want to assign additional weight to the product with 100 5-star ratings, you'll probably want to go with some kind of weighted mode, assigning more weight to ratings with the same median, but with more overall votes.
If you just need a fast and cheap solution that will mostly work without using a lot of computation here's one option (assuming a 1-5 rating scale)
SELECT Products.id, Products.title, avg(Ratings.score), etc
FROM
Products INNER JOIN Ratings ON Products.id=Ratings.product_id
GROUP BY
Products.id, Products.title
ORDER BY (SUM(Ratings.score)+25.0)/(COUNT(Ratings.id)+20.0) DESC, COUNT(Ratings.id) DESC
By adding in 25 and dividing by the total ratings + 20 you're basically adding 10 worst scores and 10 best scores to the total ratings and then sorting accordingly.
This does have known issues. For example, it unfairly rewards low-scoring products with few ratings (as this graph demonstrates, products with an average score of 1 and just one rating score a 1.2 while products with an average score of 1 and 1k+ ratings score closer to 1.05). You could also argue it unfairly punishes high-quality products with few ratings.
This chart shows what happens for all 5 ratings over 1-1000 ratings:
http://www.wolframalpha.com/input/?i=Plot3D%5B%2825%2Bxy%29/%2820%2Bx%29%2C%7Bx%2C1%2C1000%7D%2C%7By%2C0%2C6%7D%5D
You can see the dip upwards at the very bottom ratings, but overall it's a fair ranking, I think. You can also look at it this way:
http://www.wolframalpha.com/input/?i=Plot3D%5B6-%28%2825%2Bxy%29/%2820%2Bx%29%29%2C%7Bx%2C1%2C1000%7D%2C%7By%2C0%2C6%7D%5D
If you drop a marble on most places in this graph, it will automatically roll towards products with both higher scores and higher ratings.
Obviously, the low number of ratings puts this problem at a statistical handicap. Never the less...
A key element to improving the quality of an aggregate rating is to "rate the rater", i.e. to keep tabs of the ratings each particular "rater" has supplied (relative to others). This allows weighing their votes during the aggregation process.
Another solution, more of a cope out, is to supply the end-users with a count (or a range indication thereof) of votes for the underlying item.
One option is something like Microsoft's TrueSkill system, where the score is given by mean - 3*stddev, where the constants can be tweaked.
After look for a while, I choose the Bayesian system.
If someone is using Ruby, here a gem for it:
https://github.com/wbotelhos/rating
I'd highly recommend the book Programming Collective Intelligence by Toby Segaran (OReilly) ISBN 978-0-596-52932-1 which discusses how to extract meaningful data from crowd behaviour. The examples are in Python, but its easy enough to convert.
I have 100 points to dole out to a set of items. Each item must receive a proportional number of points relative to others (a weight). Some items may have a weight of 0, but they must receive some points.
I tried to do it by giving each item 5 points, then doling out the remaining points proportionally, but when I have 21 items, the algorithm assigns 0 to one item. Of course, I could hand out 1 point initially, but then the problem remains for 101 items or more. Normally, this algorithm should deal with less than 20 items, but I want the algorithm to be robust in the face of more items.
I know using floats/fractions would be perfect, but the underlying system must receive integers, and the total must be 100.
This is framework / language agnostic, although I will implement in Ruby.
Currently, I have this (pseudo-code):
total_weight = sum_of(items.weight)
if total_weight == 0 then
# Distribute all points equally between each item
items.points = 100 / number_of_items
# Apply remaining points in case of rounding errors (100 / 3 == [33, 33, 34])
else
items.points = 5
points_to_dole_out = 100 - number_of_items * 5
for(item in items)
item.points += item.weight * total_weight / 100
end
end
First, give every item one point. This is to meet your requirement that all items get points. Then get the % of the total weight that each item has, and award it that % of the remaining points (round down).
There will be some portion of points left over. Sort the set of items by the size of their decimal parts, and then dole out the remaining points one at a time in order from biggest decimal part to smallest.
So if an item has a weight of twelve and the total weight of all items is 115, you would first award it 1 point. If there were 4 other items, there would be 110 points left after doling out the minimum scores. You would then award the item 10 points because its percentage of the total weight is 9.58 and 10.538 9.58% of 110. Then you would sort it based on the .538 and if it were near the top end, it might end up getting bumped to a 11.
The 101 case cannot be solved given the two constraints {total == 100 } { each item > 0 } So in order to be robust you must get a solution to that. That's a business problem not a technical one.
The minimum score case is actually a business problem too. Clearly the meaning of your results if you dole out a min of 5 per item is quite different from a min score of 1 - the gap between the low, non-zero score and the low scores is potentially compressed. Hence you really should get clarity from the users of the system rather than just pick a number: how will they use this data?