Flooding Bayesian rating creates values out of range - algorithm

I'm trying to apply the Bayesian rating formula, but if I rate 1 out of 5 thousand of hundreds, the final rating is greater than 5.
For example, a given item has no votes and after voting 170,000 times with 1 star, its final rating is 5.23. If I rate 100, it has a normal value.
Here is what I have in PHP.
<?php
// these values came from DB
$total_votes = 2936; // total of votes for all items
$total_rating = 582.955; // sum of all ratings
$total_items = 202;
// now the specific item, it has no votes yet
$this_num_votes = 0;
$this_score = 0;
$this_rating = 0;
// simulating a lot of votes with 1 star
for ($i=0; $i < 170000; $i++) {
$rating_sent = 1; // the new rating, always 1
$total_votes++; // adding 1 to total
$total_rating = $total_rating+$rating_sent; // adding 1 to total
$avg_num_votes = ($total_votes/$total_items); // Average number of votes in all items
$avg_rating = ($total_rating/$total_items); // Average rating for all items
$this_num_votes = $this_num_votes+1; // Number of votes for this item
$this_score = $this_score+$rating_sent; // Sum of all votes for this item
$this_rating = $this_score/$this_num_votes; // Rating for this item
$bayesian_rating = ( ($avg_num_votes * $avg_rating) + ($this_num_votes * $this_rating) ) / ($avg_num_votes + $this_num_votes);
}
echo $bayesian_rating;
?>
Even if I flood with 1 or 2:
$rating_sent = rand(1,2)
The final rating after 100,000 votes is over 5.
I just did a new test using
$rating_sent = rand(1,5)
And after 100,000 I got a value completely out of range range (10.53). I know that in a normal situation no item will get 170,000 votes while all the other items get no vote. But I wonder if there is something wrong with my code or if this is an expected behavior of Bayesian formula considering the massive votes.
Edit
Just to make it clear, here is a better explanation for some variables.
$avg_num_votes // SUM(votes given to all items)/COUNT(all items)
$avg_rating // SUM(rating of all items)/COUNT(all items)
$this_num_votes // COUNT(votes given for this item)
$this_score // SUM(rating for this item)
$bayesian_rating // is the formula itself
The formula is: ( (avg_num_votes * avg_rating) + (this_num_votes * this_rating) ) / (avg_num_votes + this_num_votes). Taken from here

You need to divide by total_votes rather than total_items when calculating avg_rating.
I made the changes and got something that behaves much better here.
http://codepad.org/gSdrUhZ2

Related

dc.js and crossfilter second level aggregation to average count per hour

I am trying to slightly extend the problem described in this question:
dc.js and crossfilter reduce average counts per day of week
I would like to chart average counts per hour of the day. I have followed the solution above, counting the values by day in the custom reduce with the only change being to dimension by hour of day. This seems to work well and can be seen in the following fiddle:
http://jsfiddle.net/dolomite/6eeahs6z/73/
The top bar chart shows the average counts by hour, the lower chart the total counts by hour. So hour 22 has a total count of 47 and average count of 4.2727... There are 11 days in the data so this is correct.
However, when I click on the weekday row chart and filter for Sunday I get a total count for hour 22 of 4 and an average of 0.3636... The denominator in calculating the average values is still including all weekdays in the data, irrespective of the weekday I filter by. So while the total count has filtered to just show 4 for Sunday it is being divided by the total number of days in the data, whereas the requirement is just to divide by the number of whichever day/s have been selected in the filter.
I know the solution lies in modifying the custom reduce but I am stuck! Any pointers on where I am going wrong would be gratefully received.
hourAvgGroup = hourDim.group().reduce(
function (p, v) { // add
var day = d3.time.day(v.EventDate).getTime();
p.map.set(day, p.map.has(day) ? p.map.get(day) + 1 : 1);
p.avg = average_map(p.map);
return p;
},
function (p, v) { // remove
var day = d3.time.day(v.EventDate).getTime();
p.map.set(day, p.map.has(day) ? p.map.get(day) - 1 : 0);
p.avg = average_map(p.map);
return p;
},
function () { // init
return { map: d3.map(), avg: 0 };
}
)
function average_map(m) {
var sum = 0;
m.forEach(function(k, v) {
sum += v;
});
return m.size() ? sum / m.size() : 0;
}
m.size() counts up the number of keys in the map. The problem is that even if a day has 0 records assigned to it, the key is still there, so m.size() counts it in the denominator. The solution is to remove the key when the count gets to 0. There are probably more efficient ways to do this, but the simplest solution is to add one line to your remove function in the custom reducer so that the function looks like this:
function (p, v) { // remove
var day = d3.time.day(v.EventDate).getTime();
p.map.set(day, p.map.has(day) ? p.map.get(day) - 1 : 0);
// If the day has 0 records, remove the key
if(p.map.has(day) && p.map.get(day) == 0) p.map.remove(day);
p.avg = average_map(p.map);
return p;
},
By the way, I would also recommend not including the actual average and average calculation in your group. Calculate it in the dc.js chart valueAccessor instead. The reducer is run once for every record added or removed. The valueAccessor is only run once per filter operation.

Highstock custom min max approximation works properly only at some ranges

I am feeding highstock chart with datas from mysql. There are variables stored every minute so if you want to look at data past 3 months they are grouped. Highstock's default approximation functions give low,high,averare,sum values only. I min and max values are most important for me so I made my own approximation function which is:
approximation: function (arr) {
// first time or point precalculated
if ( !gInfo || gInfo.nextPoint) {
// first time return first value (arr[0])
var point = gInfo ? gInfo.nextPoint : arr[0];
// save current data to the next iteration
gInfo = {prev : arr, nextPoint : null};
return point;
} else {
var prev = gInfo.prev,
// concat current group with the previous one
data = prev.concat(arr),
// get min, max and their positions
min = Math.min.apply(null, data),
max = Math.max.apply(null, data),
minIdx = data.indexOf(min),
maxIdx = data.indexOf(max),
// order min and max
aprox = minIdx < maxIdx ? [min, max] : [max, min];
// save next aproximation and return current
gInfo.nextPoint = aprox[1];
return aprox[0];
}
},
Actually I didn't make it but I found it here in the forum.
The problem is it gives me right results only at some ranges as shown in the pictures below:
First picture at max range - not ok:
Max range - you can't see every min value
As I am changing range to smaller I can see every min value:
This is how it should looke like at max range
It is also happening when I zoom in so datas are grouped in two min intervals and I am just scrolling to the left or to the right.
At first I thought that it has something to do with the way groups are made by changing groupPixelWidth: to any value did not help.
Having min or max is really important for me and this is something I can solve in highstock.
There seems to an error in approximation function.
In 3rd line:
if ( !gInfo || gInfo.nextPoint) {
if should evaluate to false if !gInfo is false (it is after first time) AND gInfo.nextPoint returns false, but it will return false not only if it is null (as set in function), but also when it is zero. Changing if condition to:
if (!gInfo || gInfo.nextPoint !== null) {
Example with error (before the fix): http://jsfiddle.net/p2qvx24a/1/
Example with fix: http://jsfiddle.net/p2qvx24a/

Efficient way to generate a seemingly random permutation from a very large set without repeating?

I have a very large set (billions or more, it's expected to grow exponentially to some level), and I want to generate seemingly random elements from it without repeating. I know I can pick a random number and repeat and record the elements I have generated, but that takes more and more memory as numbers are generated, and wouldn't be practical after couple millions elements out.
I mean, I could say 1, 2, 3 up to billions and each would be constant time without remembering all the previous, or I can say 1,3,5,7,9 and on then 2,4,6,8,10, but is there a more sophisticated way to do that and eventually get a seemingly random permutation of that set?
Update
1, The set does not change size in the generation process. I meant when the user's input increases linearly, the size of the set increases exponentially.
2, In short, the set is like the set of every integer from 1 to 10 billions or more.
3, In long, it goes up to 10 billion because each element carries the information of many independent choices, for example. Imagine an RPG character that have 10 attributes, each can go from 1 to 100 (for my problem different choices can have different ranges), thus there's 10^20 possible characters, number "10873456879326587345" would correspond to a character that have "11, 88, 35...", and I would like an algorithm to generate them one by one without repeating, but makes it looks random.
Thanks for the interesting question. You can create a "pseudorandom"* (cyclic) permutation with a few bytes using modular exponentiation. Say we have n elements. Search for a prime p that's bigger than n+1. Then find a primitive root g modulo p. Basically by definition of primitive root, the action x --> (g * x) % p is a cyclic permutation of {1, ..., p-1}. And so x --> ((g * (x+1))%p) - 1 is a cyclic permutation of {0, ..., p-2}. We can get a cyclic permutation of {0, ..., n-1} by repeating the previous permutation if it gives a value bigger (or equal) n.
I implemented this idea as a Go package. https://github.com/bwesterb/powercycle
package main
import (
"fmt"
"github.com/bwesterb/powercycle"
)
func main() {
var x uint64
cycle := powercycle.New(10)
for i := 0; i < 10; i++ {
fmt.Println(x)
x = cycle.Apply(x)
}
}
This outputs something like
0
6
4
1
2
9
3
5
8
7
but that might vary off course depending on the generator chosen.
It's fast, but not super-fast: on my five year old i7 it takes less than 210ns to compute one application of a cycle on 1000000000000000 elements. More details:
BenchmarkNew10-8 1000000 1328 ns/op
BenchmarkNew1000-8 500000 2566 ns/op
BenchmarkNew1000000-8 50000 25893 ns/op
BenchmarkNew1000000000-8 200000 7589 ns/op
BenchmarkNew1000000000000-8 2000 648785 ns/op
BenchmarkApply10-8 10000000 170 ns/op
BenchmarkApply1000-8 10000000 173 ns/op
BenchmarkApply1000000-8 10000000 172 ns/op
BenchmarkApply1000000000-8 10000000 169 ns/op
BenchmarkApply1000000000000-8 10000000 201 ns/op
BenchmarkApply1000000000000000-8 10000000 204 ns/op
Why did I say "pseudorandom"? Well, we are always creating a very specific kind of cycle: namely one that uses modular exponentiation. It looks pretty pseudorandom though.
I would use a random number and swap it with an element at the beginning of the set.
Here's some pseudo code
set = [1, 2, 3, 4, 5, 6]
picked = 0
Function PickNext(set, picked)
If picked > Len(set) - 1 Then
Return Nothing
End If
// random number between picked (inclusive) and length (exclusive)
r = RandomInt(picked, Len(set))
// swap the picked element to the beginning of the set
result = set[r]
set[r] = set[picked]
set[picked] = result
// update picked
picked++
// return your next random element
Return temp
End Function
Every time you pick an element there is one swap and the only extra memory being used is the picked variable. The swap can happen if the elements are in a database or in memory.
EDIT Here's a jsfiddle of a working implementation http://jsfiddle.net/sun8rw4d/
JavaScript
var set = [];
set.picked = 0;
function pickNext(set) {
if(set.picked > set.length - 1) { return null; }
var r = set.picked + Math.floor(Math.random() * (set.length - set.picked));
var result = set[r];
set[r] = set[set.picked];
set[set.picked] = result;
set.picked++;
return result;
}
// testing
for(var i=0; i<100; i++) {
set.push(i);
}
while(pickNext(set) !== null) { }
document.body.innerHTML += set.toString();
EDIT 2 Finally, a random binary walk of the set. This can be accomplished with O(Log2(N)) stack space (memory) which for 10billion is only 33. There's no shuffling or swapping involved. Using trinary instead of binary might yield even better pseudo random results.
// on the fly set generator
var count = 0;
var maxValue = 64;
function nextElement() {
// restart the generation
if(count == maxValue) {
count = 0;
}
return count++;
}
// code to pseudo randomly select elements
var current = 0;
var stack = [0, maxValue - 1];
function randomBinaryWalk() {
if(stack.length == 0) { return null; }
var high = stack.pop();
var low = stack.pop();
var mid = ((high + low) / 2) | 0;
// pseudo randomly choose the next path
if(Math.random() > 0.5) {
if(low <= mid - 1) {
stack.push(low);
stack.push(mid - 1);
}
if(mid + 1 <= high) {
stack.push(mid + 1);
stack.push(high);
}
} else {
if(mid + 1 <= high) {
stack.push(mid + 1);
stack.push(high);
}
if(low <= mid - 1) {
stack.push(low);
stack.push(mid - 1);
}
}
// how many elements to skip
var toMid = (current < mid ? mid - current : (maxValue - current) + mid);
// skip elements
for(var i = 0; i < toMid - 1; i++) {
nextElement();
}
current = mid;
// get result
return nextElement();
}
// test
var result;
var list = [];
do {
result = randomBinaryWalk();
list.push(result);
} while(result !== null);
document.body.innerHTML += '<br/>' + list.toString();
Here's the results from a couple of runs with a small set of 64 elements. JSFiddle http://jsfiddle.net/yooLjtgu/
30,46,38,34,36,35,37,32,33,31,42,40,41,39,44,45,43,54,50,52,53,51,48,47,49,58,60,59,61,62,56,57,55,14,22,18,20,19,21,16,15,17,26,28,29,27,24,25,23,6,2,4,5,3,0,1,63,10,8,7,9,12,11,13
30,14,22,18,16,15,17,20,19,21,26,28,29,27,24,23,25,6,10,8,7,9,12,13,11,2,0,63,1,4,5,3,46,38,42,44,45,43,40,41,39,34,36,35,37,32,31,33,54,58,56,55,57,60,59,61,62,50,48,49,47,52,51,53
As I mentioned in my comment, unless you have an efficient way to skip to a specific point in your "on the fly" generation of the set this will not be very efficient.
if it is enumerable then use a pseudo-random integer generator adjusted to the period 0 .. 2^n - 1 where the upper bound is just greater than the size of your set and generate pseudo-random integers discarding those more than the size of your set. Use those integers to index items from your set.
Pre- compute yourself a series of indices (e.g. in a file), which has the properties you need and then randomly choose a start index for your enumeration and use the series in a round-robin manner.
The length of your pre-computed series should be > the maximum size of the set.
If you combine this (depending on your programming language etc.) with file mappings, your final nextIndex(INOUT state) function is (nearly) as simple as return mappedIndices[state++ % PERIOD];, if you have a fixed size of each entry (e.g. 8 bytes -> uint64_t).
Of course, the returned value could be > your current set size. Simply draw indices until you get one which is <= your sets current size.
Update (In response to question-update):
There is another option to achieve your goal if it is about creating 10Billion unique characters in your RPG: Generate a GUID and write yourself a function which computes your number from the GUID. man uuid if you are are on a unix system. Else google it. Some parts of the uuid are not random but contain meta-info, some parts are either systematic (such as your network cards MAC address) or random, depending on generator algorithm. But they are very very most likely unique. So, whenever you need a new unique number, generate a uuid and transform it to your number by means of some algorithm which basically maps the uuid bytes to your number in a non-trivial way (e.g. use hash functions).

Algorithm: Determine if a combination of min/max values fall within a given range

Imagine you have 3 buckets, but each of them has a hole in it. I'm trying to fill a bath tub. The bath tub has a minimum level of water it needs and a maximum level of water it can contain. By the time you reach the tub with the bucket it is not clear how much water will be in the bucket, but you have a range of possible values.
Is it possible to adequately fill the tub with water?
Pretty much you have 3 ranges (min,max), is there some sum of them that will fall within a 4th range?
For example:
Bucket 1 : 5-10L
Bucket 2 : 15-25L
Bucket 3 : 10-50L
Bathtub 100-150L
Is there some guaranteed combination of 1 2 and 3 that will fill the bathtub within the requisite range? Multiples of each bucket can be used.
EDIT: Now imagine there are 50 different buckets?
If the capacity of the tub is not very large ( not greater than 10^6 for an example), we can solve it using dynamic programming.
Approach:
Initialization: memo[X][Y] is an array to memorize the result. X = number of buckets, Y = maximum capacity of the tub. Initialize memo[][] with -1.
Code:
bool dp(int bucketNum, int curVolume){
if(curVolume > maxCap)return false; // pruning extra branches
if(curVolume>=minCap && curVolume<=maxCap){ // base case on success
return true;
}
int &ret = memo[bucketNum][curVolume];
if(ret != -1){ // this state has been visited earlier
return false;
}
ret = false;
for(int i = minC[bucketNum]; i < = maxC[bucketNum]; i++){
int newVolume = curVolume + i;
for(int j = bucketNum; j <= 3; j++){
ret|=dp(j,newVolume);
if(ret == true)return ret;
}
}
return ret;
}
Warning: Code not tested
Here's a naïve recursive solution in python that works just fine (although it doesn't find an optimal solution):
def match_helper(lower, upper, units, least_difference, fail = dict()):
if upper < lower + least_difference:
return None
if fail.get((lower,upper)):
return None
exact_match = [ u for u in units if u['lower'] >= lower and u['upper'] <= upper ]
if exact_match:
return [ exact_match[0] ]
for unit in units:
if unit['upper'] > upper:
continue
recursive_match = match_helper(lower - unit['lower'], upper - unit['upper'], units, least_difference)
if recursive_match:
return [unit] + recursive_match
else:
fail[(lower,upper)] = 1
return None
def match(lower, upper):
units = [
{ 'name': 'Bucket 1', 'lower': 5, 'upper': 10 },
{ 'name': 'Bucket 2', 'lower': 15, 'upper': 25 },
{ 'name': 'Bucket 3', 'lower': 10, 'upper': 50 }
]
least_difference = min([ u['upper'] - u['lower'] for u in units ])
return match_helper(
lower = lower,
upper = upper,
units = sorted(units, key = lambda u: u['upper']),
least_difference = min([ u['upper'] - u['lower'] for u in units ]),
)
result = match(100, 175)
if result:
lower = sum([ u['lower'] for u in result ])
upper = sum([ u['upper'] for u in result ])
names = [ u['name'] for u in result ]
print lower, "-", upper
print names
else:
print "No solution"
It prints "No solution" for 100-150, but for 100-175 it comes up with a solution of 5x bucket 1, 5x bucket 2.
Assuming you are saying that the "range" for each bucket is the amount of water that it may have when it reaches the tub, and all you care about is if they could possibly fill the tub...
Just take the "max" of each bucket and sum them. If that is in the range of what you consider the tub to be "filled" then it can.
Updated:
Given that buckets can be used multiple times, this seems to me like we're looking for solutions to a pair of equations.
Given buckets x, y and z we want to find a, b and c:
a*x.min + b*y.min + c*z.min >= bathtub.min
and
a*x.max + b*y.max + c*z.max <= bathtub.max
Re: http://en.wikipedia.org/wiki/Diophantine_equation
If bathtub.min and bathtub.max are both multiples of the greatest common divisor of a,b and c, then there are infinitely many solutions (i.e. we can fill the tub), otherwise there are no solutions (i.e. we can never fill the tub).
This can be solved with multiple applications of the change making problem.
Each Bucket.Min value is a currency denomination, and Bathtub.Min is the target value.
When you find a solution via a change-making algorithm, then apply one more constraint:
sum(each Bucket.Max in your solution) <= Bathtub.max
If this constraint is not met, throw out this solution and look for another. This will probably require a change to a standard change-making algorithm that allows you to try other solutions when one is found to not be suitable.
Initially, your target range is Bathtub.Range.
Each time you add an instance of a bucket to the solution, you reduce the target range for the remaining buckets.
For example, using your example buckets and tub:
Target Range = 100..150
Let's say we want to add a Bucket1 to the candidate solution. That then gives us
Target Range = 95..140
because if the rest of the buckets in the solution total < 95, then this Bucket1 might not be sufficient to fill the tub to 100, and if the rest of the buckets in the solution total > 140, then this Bucket1 might fill the tub over 150.
So, this gives you a quick way to check if a candidate solution is valid:
TargetRange = Bathtub.Range
foreach Bucket in CandidateSolution
TargetRange.Min -= Bucket.Min
TargetRange.Max -= Bucket.Max
if TargetRange.Min == 0 AND TargetRange.Max >= 0 then solution found
if TargetRange.Min < 0 or TargetRange.Max < 0 then solution is invalid
This still leaves the question - How do you come up with the set of candidate solutions?
Brute force would try all possible combinations of buckets.
Here is my solution for finding the optimal solution (least number of buckets). It compares the ratio of the maximums to the ratio of the minimums, to figure out the optimal number of buckets to fill the tub.
private static void BucketProblem()
{
Range bathTub = new Range(100, 175);
List<Range> buckets = new List<Range> {new Range(5, 10), new Range(15, 25), new Range(10, 50)};
Dictionary<Range, int> result;
bool canBeFilled = SolveBuckets(bathTub, buckets, out result);
}
private static bool BucketHelper(Range tub, List<Range> buckets, Dictionary<Range, int> results)
{
Range bucket;
int startBucket = -1;
int fills = -1;
for (int i = buckets.Count - 1; i >=0 ; i--)
{
bucket = buckets[i];
double maxRatio = (double)tub.Maximum / bucket.Maximum;
double minRatio = (double)tub.Minimum / bucket.Minimum;
if (maxRatio >= minRatio)
{
startBucket = i;
if (maxRatio - minRatio > 1)
fills = (int) minRatio + 1;
else
fills = (int) maxRatio;
break;
}
}
if (startBucket < 0)
return false;
bucket = buckets[startBucket];
tub.Maximum -= bucket.Maximum * fills;
tub.Minimum -= bucket.Minimum * fills;
results.Add(bucket, fills);
return tub.Maximum == 0 || tub.Minimum <= 0 || startBucket == 0 || BucketHelper(tub, buckets.GetRange(0, startBucket), results);
}
public static bool SolveBuckets(Range tub, List<Range> buckets, out Dictionary<Range, int> results)
{
results = new Dictionary<Range, int>();
buckets = buckets.OrderBy(b => b.Minimum).ToList();
return BucketHelper(new Range(tub.Minimum, tub.Maximum), buckets, results);
}

Algorithm to order 'tag line' campaigns based on resulting sales

I want to be able to introduce new 'tag lines' into a database that are shown 'randomly' to users. (These tag lines are shown as an introduction as animated text.)
Based upon the number of sales that result from those taglines I'd like the good ones to trickle to the top, but still show the others less frequently.
I could come up with a basic algorithm quite easily but I want something thats a little more 'statistically accurate'.
I dont really know where to start. Its been a while since I've done anything more than basic statistics. My model would need to be sensitive to tolerances, but obviously it doesnt need to be worthy of a PHD.
Edit: I am currently tracking a 'conversion rate' - i.e. hits per order. This value would probably be best calculated as a cumulative 'all time' convertsion rate to be fed into the algorithm.
Looking at your problem, I would modify the requirements a bit -
1) The most popular one should be shown most often.
2) Taglines should "age", so one that got a lot of votes (purchase) in the past, but none recently should be shown less often
3) Brand new taglines should be shown more often during their first days.
If you agree on those, then a algorithm could be something like:
START:
x = random(1, 3);
if x = 3 goto NEW else goto NORMAL
NEW:
TagVec = Taglines.filterYounger(5 days); // I'm taking a LOT of liberties with the pseudo code,,,
x = random(1, TagVec.Length);
return tagVec[x-1]; // 0 indexed vectors even in made up language,
NORMAL:
// Similar to EBGREEN above
sum = 0;
ForEach(TagLine in TagLines) {
sum += TagLine.noOfPurhcases;
}
x = random(1, sum);
ForEach(TagLine in TagLines) {
x -= TagLine.noOfPurchase;
if ( x > 0) return TagLine; // Find the TagLine that represent our random number
}
Now, as a setup I would give every new tagline 10 purchases, to avoid getting really big slanting for one single purchase.
The aging process I would count a purchase older than a week as 0.8 purhcase per week of age. So 1 week old gives 0.8 points, 2 weeks give 0.8*0.8 = 0,64 and so forth...
You would have to play around with the Initial purhcases parameter (10 in my example) and the aging speed (1 week here) and the aging factor (0.8 here) to find something that suits you.
I would suggest randomly choosing with a weighting factor based on previous sales. So let's say you had this:
tag1 = 1 sale
tag2 = 0 sales
tag3 = 1 sale
tag4 = 2 sales
tag5 = 3 sales
A simple weighting formula would be 1 + number of sales, so this would be the probability of selecting each tag:
tag1 = 2/12 = 16.7%
tag2 = 1/12 = 8.3%
tag3 = 2/12 = 16.6%
tag4 = 3/12 = 25%
tag5 = 4/12 = 33.3%
You could easily change the weighting formula to get just the distribution that you want.
You have to come up with a weighting formula based on sales.
I don't think there's any such thing as a "statistically accurate" formula here - it's all based on your preference.
No one can say "this is the correct weighting and the other weighting is wrong" because there isn't a final outcome you are attempting to model - this isn't like trying to weigh responses to a poll about an upcoming election (where you are trying to model results to represent something that will happen in the future).
Heres an example in javascript. Not that I'm not suggesting running this client side...
Also there is alot of optimization that can be done.
Note: createMemberInNormalDistribution() is implemented here Converting a Uniform Distribution to a Normal Distribution
/*
* an example set of taglines
* hits are sales
* views are times its been shown
*/
var taglines = [
{"tag":"tagline 1","hits":1,"views":234},
{"tag":"tagline 2","hits":5,"views":566},
{"tag":"tagline 3","hits":3,"views":421},
{"tag":"tagline 4","hits":1,"views":120},
{"tag":"tagline 5","hits":7,"views":200}
];
/*set up our stat model for the tags*/
var TagModel = function(set){
var hits, views, sumOfDiff, sumOfSqDiff;
hits = views = sumOfDiff = sumOfSqDiff = 0;
/*find average*/
for (n in set){
hits += set[n].hits;
views += set[n].views;
}
this.avg = hits/views;
/*find standard deviation and variance*/
for (n in set){
var diff =((set[n].hits/set[n].views)-this.avg);
sumOfDiff += diff;
sumOfSqDiff += diff*diff;
}
this.variance = sumOfDiff;
this.std_dev = Math.sqrt(sumOfSqDiff/set.length);
/*return tag to use fChooser determines likelyhood of tag*/
this.getTag = function(fChooser){
var m = this;
set.sort(function(a,b){
return fChooser((a.hits/a.views),(b.hits/b.views), m);
});
return set[0];
};
};
var config = {
"uniformDistribution":function(a,b,model){
return Math.random()*b-Math.random()*a;
},
"normalDistribution":function(a,b,model){
var a1 = createMemberInNormalDistribution(model.avg,model.std_dev)* a;
var b1 = createMemberInNormalDistribution(model.avg,model.std_dev)* b;
return b1-a1;
},
//say weight = 10^n... higher n is the more even the distribution will be.
"weight": .5,
"weightedDistribution":function(a,b,model){
var a1 = createMemberInNormalDistribution(model.avg,model.std_dev*config.weight)* a;
var b1 = createMemberInNormalDistribution(model.avg,model.std_dev*config.weight)* b;
return b1-a1;
}
}
var model = new TagModel(taglines);
//to use
model.getTag(config.uniformDistribution).tag;
//running 10000 times: ({'tagline 4':836, 'tagline 5':7608, 'tagline 1':100, 'tagline 2':924, 'tagline 3':532})
model.getTag(config.normalDistribution).tag;
//running 10000 times: ({'tagline 4':1775, 'tagline 5':3471, 'tagline 1':1273, 'tagline 2':1857, 'tagline 3':1624})
model.getTag(config.weightedDistribution).tag;
//running 10000 times: ({'tagline 4':1514, 'tagline 5':5045, 'tagline 1':577, 'tagline 2':1627, 'tagline 3':1237})
config.weight = 2;
model.getTag(config.weightedDistribution).tag;
//running 10000 times: {'tagline 4':1941, 'tagline 5':2715, 'tagline 1':1559, 'tagline 2':1957, 'tagline 3':1828})

Resources