Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I am working on compression algorithm for which I want to replace all consecutive number with its mathematical form which is not logical mathematically but my algorithm will know and convert it to original form.
Suppose I have string:
string input = "732183900000000000002389288888888888888";
Did you see it have 0000000000 and 8888888888888 are major consecutive duplicates.
And now I want to convert those to:
//convert 000000000 to 0*9. Means 9 times 0.
//convert 888888888 to 8*9. Means 8 times 0.
string output = "7321839" +
"0*13" +
"23892" +
"8*14";
//or
string output = "7321839-0*13-23892-8*14";
Points to consider:
Any language that works on windows will be accepted. For me main thing is algorithm.
Please keep performance in mind as it would be used for big files.
To be honest this is as simple as it gets:
Parse through the string one character at a time.
Check if the previous character is the same as the current one.
If it is same then increment a counter variable or else reset it to 0.
If the counter value is greater than one when we reset the counter to 0 then add * to the result.
Regex might be a bit convoluted for this given the rules for dashes (although not impossible by any means),
Seemingly, you want the following
Groups of the same number greater than the count of 1
No prefix dash
No suffix dash
No double dashes (speculative)
Here is a fairly efficient C# O(n) implementation with StringBuilder, which inurn should allow you to work with exceedingly large strings with minimal allocations
Given
public static string Shorten(string value)
{
var sb = new StringBuilder(value.Length);
int i, last;
var isLastGroup = false;
void Write()
{
var isGroup = i - last > 1;
var getDash = last == 0 || isLastGroup ? "" : "-";
sb.Append(isGroup ? $"{getDash}{value[last]}*{i - last}{(i != value.Length ? "-" : "")}" : value[last].ToString());
isLastGroup = isGroup;
last = i;
}
for (i = 0, last = 0; i < value.Length; i++)
if (value[last] != value[i])
Write();
Write();
return sb.ToString();
}
Tests
Console.WriteLine(Shorten("1"));
Console.WriteLine(Shorten("111"));
Console.WriteLine(Shorten("1112"));
Console.WriteLine(Shorten("1222"));
Console.WriteLine(Shorten("12233344445555512345"));
Results
1
13
13-2
1-23
1-22-33-44-5*5-12345
Full Demo Here
I am trying to find the best algorithm for my particular application. I have searched around on SO, Google, read various articles about Levenshtein distances, etc. but honestly it's a bit out of my area of expertise. And most seem to find how similar two input strings are, like a Hamming distance between strings.
What I'm looking for is different, more of a fuzzy record search (and I'm sure there is a name for it, that I don't know to Google). I am sure someone has solved this problem before and I'm looking for a recommendation to point me in the right direction for my further research.
In my case I am needing a fuzzy search of a database of entries of music artists and their albums. As you can imagine, the database will have millions of entries so an algorithm that scales well is crucial. It's not important to my question that Artist and Album are in different columns, the database could just store all words in one column if that helped the search.
The database to search:
|-------------------|---------------------|
| Artist | Album |
|-------------------|---------------------|
| Alanis Morissette | Jagged Little Pill |
| Moby | Everything is Wrong |
| Air | Moon Safari |
| Pearl Jam | Ten |
| Nirvana | Nevermind |
| Radiohead | OK Computer |
| Beck | Odelay |
|-------------------|---------------------|
The query text will contain from just one word in the entire Artist_Album concatenation up to the entire thing. The query text is coming from OCR and is likely to have single character transpositions but the most likely thing is the words are not guaranteed to have the right order. Additionally, there could be extra words in the search that aren't a part of the album (like cover art text). For example, "OK Computer" might be at the top of the album and "Radiohead" below it, or some albums have text arranged in columns which intermixes the word orders.
Possible search strings:
C0mputer Rad1ohead
Pearl Ten Jan
Alanis Jagged Morisse11e Litt1e Pi11
Air Moon Virgin Records
Moby Everything
Note that with OCR, some letters will look like numbers, or the wrong letter completely (Jan instead of Jam). And in the case of Radiohead's OK Computer and Moby's Everything Is Wrong, the query text doesn't even have all of the words. In the case of Air's Moon Safari, the extra words Virgin Records are searched, but Safari is missing.
Is there a general algorithm that could return the single likeliest result from the database, and if none meet some "likeliness" score threshold, it returns nothing? I'm actually developing this in Python, but that's just a bonus, I'm looking more for where to get started researching.
Let's break the problem down in two parts.
First, you want to define some measure of likeness (this is called a metric). This metric should return a small number if the query text closely matches the album/artist cover, and return a larger number otherwise.
Second, you want a datastructure that speeds up this process. Obviously, you don't want to calculate this metric every single time a query is ran.
part 1: the metric
You already mentioned Levenshtein distance, which is a great place to start.
Think outside the box though.
LD makes certain assumptions (each character replacement is equally likely, deletion is equally likely as insertion, etc). You can obviously improve the performance of this metric by taking into account what faults OCR is likely to introduce.
E.g. turning a '1' into an 'i' should not be penalized as harshly as turning a '0' into an '_'.
I would implement the metric in two stages. For any given two strings:
split both strings in tokens (assume space as the separator)
look for the most similar words (using a modified version of LD)
assign a final score based on 'matching words', 'missing words' and 'added words' (preferably weighted)
This is an example implementation (fiddle around with the constants):
static double m(String a, String b){
String[] aParts = a.split(" ");
String[] bParts = b.split(" ");
boolean[] bUsed = new boolean[bParts.length];
int matchedTokens = 0;
int tokensInANotInB = 0;
int tokensInBNotInA = 0;
for(int i=0;i<aParts.length;i++){
String a0 = aParts[i];
boolean wasMatched = true;
for(int j=0;j<bParts.length;j++){
String b0 = bParts[j];
double d = levenshtein(a0, b0);
/* If we match the token a0 with a token from b0
* update the number of matchedTokens
* escape the loop
*/
if(d < 2){
bUsed[j]=true;
wasMatched = true;
matchedTokens++;
break;
}
}
if(!wasMatched){
tokensInANotInB++;
}
}
for(boolean partUsed : bUsed){
if(!partUsed){
tokensInBNotInA++;
}
}
return (matchedTokens
+ tokensInANotInB * -0.3 // the query is allowed to contain extra words at minimal cost
+ tokensInBNotInA * -0.5 // the album title should not contain too many extra words
) / java.lang.Math.max(aParts.length, bParts.length);
}
This function uses a modified levenshtein function:
static double levenshtein(String x, String y) {
double[][] dp = new double[x.length() + 1][y.length() + 1];
for (int i = 0; i <= x.length(); i++) {
for (int j = 0; j <= y.length(); j++) {
if (i == 0) {
dp[i][j] = j;
}
else if (j == 0) {
dp[i][j] = i;
}
else {
dp[i][j] = min(dp[i - 1][j - 1]
+ costOfSubstitution(x.charAt(i - 1), y.charAt(j - 1)),
dp[i - 1][j] + 1,
dp[i][j - 1] + 1);
}
}
}
return dp[x.length()][y.length()];
}
Which uses the function 'cost of substitution' (which works as explained)
static double costOfSubstitution(char a, char b){
if(a == b)
return 0.0;
else{
// 1 and i
if(a == '1' && b == 'i')
return 0.5;
if(a == 'i' && b == '1')
return 0.5;
// 0 and O
if(a == '0' && b == 'o')
return 0.5;
if(a == 'o' && b == '0')
return 0.5;
if(a == '0' && b == 'O')
return 0.5;
if(a == 'O' && b == '0')
return 0.5;
// default
return 1.0;
}
}
I only included a couple of examples (turning '1' into 'i' or '0' into 'o').
But I'm sure you get the idea.
part 2: the datastructure
Look into BK-trees. They are a specific datastructure to hold metric information. Your metric needs to be a genuine metric (in the mathematical sense of the word). But that's easily arranged.
I am trying to solve Knapsack problem in Scala using dynamic programming .As a part of requirement I also need to show which items are picked to be filled in Knapsack.But I am getting "ArrayIndexOutOfBoundException".
And so far what I have code is like :
availableMoney is equivalent to weight of knapsack.products.channels is equivalent to value[] in knapsack.products.price is equivalent to weight[] in knapsack.
def knapSack(availableMoney: Int, products: List[Product]) : Int = {
var wt = List[Int](products.length)
var value = List[Int](products.length)
for (product <- products) {
value ::= product.channels.length
wt ::= product.price
}
val matrix = Array.fill(2, 2)(0)
val picks = Array.fill(2, 2)(0)
for (i <- 1 to products.length){
for (j <- 0 to availableMoney){
if (wt(i-1)<=j){
matrix(i)(j) = max(matrix(i-1)(j),value(i-1)+matrix(i-1)(j-wt(i-1)));
if (value(i-1)+matrix(i-1)(j-wt(i-1))>matrix(i-1)(j))
picks(i)(j)= 1;
else
picks(i)(j)= -1;
}
else{
picks(i)(j) = -1;
matrix(i)(j) = matrix(i-1)(j);
}
}
}
matrix(products.length)(availableMoney)
}
There are a couple of issues I think:
j runs from 0 to availableMoney, and is then used as an index into picks and matrix which have been initialised to specific sizes, so if availableMoney exceeds those dimensions, it will fail
i runs from 1 to products.length but is also used as an index into picks and matrix, so will miss 0 and if there are more products than the second dimension size, it will fail
Use some println debugging to check more closely what is going on. Looks like an interesting algorithm. Post us a solution once you get it working :)
I am trying to create random lines and select some of them, which are really rare. My code is rather simple, but to get something that I can use I need to create very large vectors(i.e.: <100000000 x 1, tracks variable in my code). Is there any way to be able to creater larger vectors and to reduce the time needed for all those calculations?
My code is
%Initial line values
tracks=input('Give me the number of muon tracks: ');
width=1e-4;
height=2e-4;
Ystart=15.*ones(tracks,1);
Xstart=-40+80.*rand(tracks,1);
%Xend=-40+80.*rand(tracks,1);
Xend=laprnd(tracks,1,Xstart,15);
X=[Xstart';Xend'];
Y=[Ystart';zeros(1,tracks)];
b=(Ystart.*Xend)./(Xend-Xstart);
hot=0;
cold=0;
for i=1:tracks
if ((Xend(i,1)<width/2 && Xend(i,1)>-width/2)||(b(i,1)<height && b(i,1)>0))
plot(X(:, i),Y(:, i),'r');%the chosen ones!
hold all
hot=hot+1;
else
%plot(X(:, i),Y(:, i),'b');%the rest of them
%hold all
cold=cold+1;
end
end
I am also using and calling a Laplace distribution generator made my Elvis Chen which can be found here
function y = laprnd(m, n, mu, sigma)
%LAPRND generate i.i.d. laplacian random number drawn from laplacian distribution
% with mean mu and standard deviation sigma.
% mu : mean
% sigma : standard deviation
% [m, n] : the dimension of y.
% Default mu = 0, sigma = 1.
% For more information, refer to
% http://en.wikipedia.org./wiki/Laplace_distribution
% Author : Elvis Chen (bee33#sjtu.edu.cn)
% Date : 01/19/07
%Check inputs
if nargin < 2
error('At least two inputs are required');
end
if nargin == 2
mu = 0; sigma = 1;
end
if nargin == 3
sigma = 1;
end
% Generate Laplacian noise
u = rand(m, n)-0.5;
b = sigma / sqrt(2);
y = mu - b * sign(u).* log(1- 2* abs(u));
The result plot is
As you indicate, your problem is two-fold. On the one hand, you have memory issues because you need to do so many trials. On the other hand, you have performance issues, because you have to process all those trials.
Solutions to each issue often have a negative impact on the other issue. IMHO, the best approach would be to find a compromise.
More trials are only possible of you get rid of those gargantuan arrays that are required for vectorization, and use a different strategy to do the loop. I will give priority to the possibility of using more trials, possibly at the cost of optimal performance.
When I execute your code as-is in the Matlab profiler, it immediately shows that the initial memory allocation for all your variables takes a lot of time. It also shows that the plot and hold all commands are the most time-consuming lines of them all. Some more trial-and-error shows that there is a disappointingly low maximum value for the trials you can do before OUT OF MEMORY errors start appearing.
The loop can be accelerated tremendously if you know a few things about its limitations in Matlab. In older versions of Matlab, it used to be true that loops should be avoided completely in favor of 'vectorized' code. In recent versions (I believe R2008a and up), the Mathworks introduced a piece of technology called the JIT accelerator (Just-in-Time compiler) which translates M-code into machine language on the fly during execution. Simply put, the JIT accelerator allows your code to bypass Matlab's interpreter and talk much more directly with the underlying hardware, which can save a lot of time.
The advice you'll hear a lot that loops should be avoided in Matlab, is no longer generally true. While vectorization still has its value, any procedure of sizable complexity that is implemented using only vectorized code is often illegible, hard to understand, hard to change and hard to upkeep. An implementation of the same procedure that uses loops, often has none of these drawbacks, and moreover, it will quite often be faster and require less memory.
Unfortunately, the JIT accelerator has a few nasty (and IMHO, unnecessary) limitations that you'll have to learn about.
One such thing is plot; it's generally a better idea to let a loop do nothing other than collect and manipulate data, and delay any plotting commands etc. until after the loop.
Another such thing is hold; the hold function is not a Matlab built-in function, meaning, it is implemented in M-language. Matlab's JIT accelerator is not able to accelerate non-builtin functions when used in a loop, meaning, your entire loop will run at Matlab's interpretation speed, rather than machine-language speed! Therefore, also delay this command until after the loop :)
Now, in case you're wondering, this last step can make a HUGE difference -- I know of one case where copy-pasting a function body into the upper-level loop caused a 1200x performance improvement. Days of execution time had been reduced to minutes!).
There is actually another minor issue in your loop (which is really small, and rather inconvenient, I will immediately agree with) -- the name of the loop variable should not be i. The name i is the name of the imaginary unit in Matlab, and the name resolution will also unnecessarily consume time on each iteration. It's small, but non-negligible.
Now, considering all this, I've come to the following implementation:
function [hot, cold, h] = MuonTracks(tracks)
% NOTE: no variables larger than 1x1 are initialized
width = 1e-4;
height = 2e-4;
% constant used for Laplacian noise distribution
bL = 15 / sqrt(2);
% Loop through all tracks
X = [];
hot = 0;
ii = 0;
while ii <= tracks
ii = ii + 1;
% Note that I've inlined (== copy-pasted) the original laprnd()
% function call. This was necessary to work around limitations
% in loops in Matlab, and prevent the nececessity of those HUGE
% variables.
%
% Of course, you can still easily generalize all of this:
% the new data
u = rand-0.5;
Ystart = 15;
Xstart = 800*rand-400;
Xend = Xstart - bL*sign(u)*log(1-2*abs(u));
b = (Ystart*Xend)/(Xend-Xstart);
% the test
if ((b < height && b > 0)) ||...
(Xend < width/2 && Xend > -width/2)
hot = hot+1;
% growing an array is perfectly fine when the chances of it
% happening are so slim
X = [X [Xstart; Xend]]; %#ok
end
end
% This is trivial to do here, and prevents an 'else' in the loop
cold = tracks - hot;
% Now plot the chosen ones
h = figure;
hold all
Y = repmat([15;0], 1, size(X,2));
plot(X, Y, 'r');
end
With this implementation, I can do this:
>> tic, MuonTracks(1e8); toc
Elapsed time is 24.738725 seconds.
with a completely negligible memory footprint.
The profiler now also shows a nice and even distribution of effort along the code; no lines that really stand out because of their memory use or performance.
It's possibly not the fastest possible implementation (if anyone sees obvious improvements, please, feel free to edit them in). But, if you're willing to wait, you'll be able to do MuonTracks(1e23) (or higher :)
I've also done an implementation in C, which can be compiled into a Matlab MEX file:
/* DoMuonCounting.c */
#include <math.h>
#include <matrix.h>
#include <mex.h>
#include <time.h>
#include <stdlib.h>
void CountMuons(
unsigned long long tracks,
unsigned long long *hot, unsigned long long *cold, double *Xout);
/* simple little helper functions */
double sign(double x) { return (x>0)-(x<0); }
double rand_double() { return (double)rand()/(double)RAND_MAX; }
/* the gateway function */
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
int
dims[] = {1,1};
const mxArray
/* Output arguments */
*hot_out = plhs[0] = mxCreateNumericArray(2,dims, mxUINT64_CLASS,0),
*cold_out = plhs[1] = mxCreateNumericArray(2,dims, mxUINT64_CLASS,0),
*X_out = plhs[2] = mxCreateDoubleMatrix(2,10000, mxREAL);
const unsigned long long
tracks = (const unsigned long long)mxGetPr(prhs[0])[0];
unsigned long long
*hot = (unsigned long long*)mxGetPr(hot_out),
*cold = (unsigned long long*)mxGetPr(cold_out);
double
*Xout = mxGetPr(X_out);
/* call the actual function, and return */
CountMuons(tracks, hot,cold, Xout);
}
// The actual muon counting
void CountMuons(
unsigned long long tracks,
unsigned long long *hot, unsigned long long *cold, double *Xout)
{
const double
width = 1.0e-4,
height = 2.0e-4,
bL = 15.0/sqrt(2.0),
Ystart = 15.0;
double
Xstart,
Xend,
u,
b;
unsigned long long
i = 0ul;
*hot = 0ul;
*cold = tracks;
/* seed the RNG */
srand((unsigned)time(NULL));
/* aaaand start! */
while (i++ < tracks)
{
u = rand_double() - 0.5;
Xstart = 800.0*rand_double() - 400.0;
Xend = Xstart - bL*sign(u)*log(1.0-2.0*fabs(u));
b = (Ystart*Xend)/(Xend-Xstart);
if ((b < height && b > 0.0) || (Xend < width/2.0 && Xend > -width/2.0))
{
Xout[0 + *hot*2] = Xstart;
Xout[1 + *hot*2] = Xend;
++(*hot);
--(*cold);
}
}
}
compile in Matlab with
mex DoMuonCounting.c
(after having run mex setup :) and then use it in conjunction with a small M-wrapper like this:
function [hot,cold, h] = MuonTrack2(tracks)
% call the MEX function
[hot,cold, Xtmp] = DoMuonCounting(tracks);
% process outputs, and generate plots
hot = uint32(hot); % circumvents limitations in 32-bit matlab
X = Xtmp(:,1:hot);
clear Xtmp
h = NaN;
if ~isempty(X)
h = figure;
hold all
Y = repmat([15;0], 1, hot);
plot(X, Y, 'r');
end
end
which allows me to do
>> tic, MuonTrack2(1e8); toc
Elapsed time is 14.496355 seconds.
Note that the memory footprint of the MEX version is slightly larger, but I think that's nothing to worry about.
The only flaw I see is the fixed maximum number of Muon counts (hard-coded as 10000 as the initial array size of Xout; needed because there are no dynamically growing arrays in standard C)...if you're worried this limit could be broken, simply increase it, change it to be equal to a fraction of tracks, or do some smarter (but more painful) dynamic array-growing tricks.
In Matlab, it is sometimes faster to vectorize rather than use a for loop. For example, this expression:
(Xend(i,1) < width/2 && Xend(i,1) > -width/2) || (b(i,1) < height && b(i,1) > 0)
which is defined for each value of i, can be rewritten in a vectorised manner like this:
isChosen = (Xend(:,1) < width/2 & Xend(:,1) > -width/2) | (b(:,1) < height & b(:,1)>0)
Expessions like Xend(:,1) will give you a column vector, so Xend(:,1) < width/2 will give you a column vector of boolean values. Note then that I have used & rather than && - this is because & performs an element-wise logical AND, unlike && which only works on scalar values. In this way you can build the entire expression, such that the variable isChosen holds a column vector of boolean values, one for each row of your Xend/b vectors.
Getting counts is now as simple as this:
hot = sum(isChosen);
since true is represented by 1. And:
cold = sum(~isChosen);
Finally, you can get the data points by using the boolean vector to select rows:
plot(X(:, isChosen),Y(:, isChosen),'r'); % Plot chosen values
hold all;
plot(X(:, ~isChosen),Y(:, ~isChosen),'b'); % Plot unchosen values
EDIT: The code should look like this:
isChosen = (Xend(:,1) < width/2 & Xend(:,1) > -width/2) | (b(:,1) < height & b(:,1)>0);
hot = sum(isChosen);
cold = sum(~isChosen);
plot(X(:, isChosen),Y(:, isChosen),'r'); % Plot chosen values
I want to be able to introduce new 'tag lines' into a database that are shown 'randomly' to users. (These tag lines are shown as an introduction as animated text.)
Based upon the number of sales that result from those taglines I'd like the good ones to trickle to the top, but still show the others less frequently.
I could come up with a basic algorithm quite easily but I want something thats a little more 'statistically accurate'.
I dont really know where to start. Its been a while since I've done anything more than basic statistics. My model would need to be sensitive to tolerances, but obviously it doesnt need to be worthy of a PHD.
Edit: I am currently tracking a 'conversion rate' - i.e. hits per order. This value would probably be best calculated as a cumulative 'all time' convertsion rate to be fed into the algorithm.
Looking at your problem, I would modify the requirements a bit -
1) The most popular one should be shown most often.
2) Taglines should "age", so one that got a lot of votes (purchase) in the past, but none recently should be shown less often
3) Brand new taglines should be shown more often during their first days.
If you agree on those, then a algorithm could be something like:
START:
x = random(1, 3);
if x = 3 goto NEW else goto NORMAL
NEW:
TagVec = Taglines.filterYounger(5 days); // I'm taking a LOT of liberties with the pseudo code,,,
x = random(1, TagVec.Length);
return tagVec[x-1]; // 0 indexed vectors even in made up language,
NORMAL:
// Similar to EBGREEN above
sum = 0;
ForEach(TagLine in TagLines) {
sum += TagLine.noOfPurhcases;
}
x = random(1, sum);
ForEach(TagLine in TagLines) {
x -= TagLine.noOfPurchase;
if ( x > 0) return TagLine; // Find the TagLine that represent our random number
}
Now, as a setup I would give every new tagline 10 purchases, to avoid getting really big slanting for one single purchase.
The aging process I would count a purchase older than a week as 0.8 purhcase per week of age. So 1 week old gives 0.8 points, 2 weeks give 0.8*0.8 = 0,64 and so forth...
You would have to play around with the Initial purhcases parameter (10 in my example) and the aging speed (1 week here) and the aging factor (0.8 here) to find something that suits you.
I would suggest randomly choosing with a weighting factor based on previous sales. So let's say you had this:
tag1 = 1 sale
tag2 = 0 sales
tag3 = 1 sale
tag4 = 2 sales
tag5 = 3 sales
A simple weighting formula would be 1 + number of sales, so this would be the probability of selecting each tag:
tag1 = 2/12 = 16.7%
tag2 = 1/12 = 8.3%
tag3 = 2/12 = 16.6%
tag4 = 3/12 = 25%
tag5 = 4/12 = 33.3%
You could easily change the weighting formula to get just the distribution that you want.
You have to come up with a weighting formula based on sales.
I don't think there's any such thing as a "statistically accurate" formula here - it's all based on your preference.
No one can say "this is the correct weighting and the other weighting is wrong" because there isn't a final outcome you are attempting to model - this isn't like trying to weigh responses to a poll about an upcoming election (where you are trying to model results to represent something that will happen in the future).
Heres an example in javascript. Not that I'm not suggesting running this client side...
Also there is alot of optimization that can be done.
Note: createMemberInNormalDistribution() is implemented here Converting a Uniform Distribution to a Normal Distribution
/*
* an example set of taglines
* hits are sales
* views are times its been shown
*/
var taglines = [
{"tag":"tagline 1","hits":1,"views":234},
{"tag":"tagline 2","hits":5,"views":566},
{"tag":"tagline 3","hits":3,"views":421},
{"tag":"tagline 4","hits":1,"views":120},
{"tag":"tagline 5","hits":7,"views":200}
];
/*set up our stat model for the tags*/
var TagModel = function(set){
var hits, views, sumOfDiff, sumOfSqDiff;
hits = views = sumOfDiff = sumOfSqDiff = 0;
/*find average*/
for (n in set){
hits += set[n].hits;
views += set[n].views;
}
this.avg = hits/views;
/*find standard deviation and variance*/
for (n in set){
var diff =((set[n].hits/set[n].views)-this.avg);
sumOfDiff += diff;
sumOfSqDiff += diff*diff;
}
this.variance = sumOfDiff;
this.std_dev = Math.sqrt(sumOfSqDiff/set.length);
/*return tag to use fChooser determines likelyhood of tag*/
this.getTag = function(fChooser){
var m = this;
set.sort(function(a,b){
return fChooser((a.hits/a.views),(b.hits/b.views), m);
});
return set[0];
};
};
var config = {
"uniformDistribution":function(a,b,model){
return Math.random()*b-Math.random()*a;
},
"normalDistribution":function(a,b,model){
var a1 = createMemberInNormalDistribution(model.avg,model.std_dev)* a;
var b1 = createMemberInNormalDistribution(model.avg,model.std_dev)* b;
return b1-a1;
},
//say weight = 10^n... higher n is the more even the distribution will be.
"weight": .5,
"weightedDistribution":function(a,b,model){
var a1 = createMemberInNormalDistribution(model.avg,model.std_dev*config.weight)* a;
var b1 = createMemberInNormalDistribution(model.avg,model.std_dev*config.weight)* b;
return b1-a1;
}
}
var model = new TagModel(taglines);
//to use
model.getTag(config.uniformDistribution).tag;
//running 10000 times: ({'tagline 4':836, 'tagline 5':7608, 'tagline 1':100, 'tagline 2':924, 'tagline 3':532})
model.getTag(config.normalDistribution).tag;
//running 10000 times: ({'tagline 4':1775, 'tagline 5':3471, 'tagline 1':1273, 'tagline 2':1857, 'tagline 3':1624})
model.getTag(config.weightedDistribution).tag;
//running 10000 times: ({'tagline 4':1514, 'tagline 5':5045, 'tagline 1':577, 'tagline 2':1627, 'tagline 3':1237})
config.weight = 2;
model.getTag(config.weightedDistribution).tag;
//running 10000 times: {'tagline 4':1941, 'tagline 5':2715, 'tagline 1':1559, 'tagline 2':1957, 'tagline 3':1828})