Related
I'm using the caret::train() function to develop a weighted knn classification model (kknn) with 10-fold cross-validation and a tuneGrid containing 15 values for kmax, one value for distance, and 3 values for kernel.
That’s 450 total iterations if I understand the process correctly (an iteration being the computation of the probability of a given outcome for a given combination of kmax, distance, and kernel). x has about 480,000 data points (6 predictors each having about 80,000 observations), and y has about 80,000 data points.
Understanding that there are innumerable variables affecting performance, how long can I reasonably expect the train function to take if run on a pc with an 8-core 3GHz Intel processor and 32GB of RAM?
It currently takes about 70 minutes per fold, which is about 1.5 minutes per iteration. Is this reasonable, or excessive?
This is a kknn learning exercise. I realize there are other types of algorithms that produce better results more efficiently.
Here is the essential code:
x <- as.matrix(train_set2[, c("n_launch_angle", "n_launch_speed", "n_spray_angle_Kolp", "n_spray_angle_adj", "n_hp_to_1b", "n_if_alignment")])
y <- train_set2$events
set.seed(1)
fitControl <- trainControl(method = "cv", number = 10, p = 0.8, returnData = TRUE,
returnResamp = "all", savePredictions = "all",
summaryFunction = twoClassSummary, classProbs = TRUE,
verboseIter = TRUE)
tuneGrid <- expand.grid(kmax = seq(11, 39, 2),
distance = 2,
kernel = c("triangular", "gaussian", "optimal"))
kknn_train <- train(x, y, method = "kknn",
tuneGrid = tuneGrid, trControl = fitControl)
As we have established in the comments, it is reasonable to expect this type of runtime. There are a few step to reduce this;
Running your code in parallel
Using a more efficient OS; like Linux
Be more efficient in your trainControl(), is it really necessary to have returnResamps=TRUE? There is small gains in controlling these.
Clearly, the first one is a no-brainer. For the second one, I can find as many computer-engineers who swears to linux as those who swears to windows. What convinced me to switch to Linux, was this particular test, which I hope will give you what it gave me.
# Calculate distance matrix
test_data <- function(dim, num, seed = 1903) {
set.seed(seed)
dist(
matrix(
rnorm(dim * num), nrow = num
)
)
}
# Benchmarking
microbenchmark::microbenchmark(test_data(120,4500))
This piece of code simply just runs faster on the exact same machine that runs Linux. At least this was my experience.
My task:
Create a program to copy a picture (given as input) using primitives only (like triangle or something). The program should use evolutionary algorithm to create output picture.
My question:
I need to invent an algorithm to create populations and check them (how much - in % - they match the input picture).
I have an idea; you can find it below.
So what I want from you: advice (if you find my idea not so bad) or inspiration (maybe you have a better idea?)
My idea:
Let's say that I'll use only triangles to build the output picture.
My first population is P pictures (generated by using T randomly generated triangles - called Elements).
I check by my fitness function every pictures in population and choose E of them as elite and rest of population just remove:
To compare 2 pictures we check every pixel in picture A and compare his R,G,B with
the same pixel (the same coordinates) in picture B.
I use this:
SingleDif = sqrt[ (Ar - Br)^2 + (Ag - Bg)^2 + (Ab - Bb)^2]
then i sum all differences (from all pixels) - lets call it SumDif
and use:
PictureDif = (DifMax - SumDif)/DifMax
where
DifMax = pictureHeight * pictureWidth * 255*3
The best are used to create the next population in this way:
picture MakeChild(picture Mother, picture Father)
{
picture child;
for( int i = 0; i < T; ++i )
{
j //this is a random number from 0 to 1 - created now
if( j < 0.5 ) child.element(i) = Mother.element(i);
else child.element(i) = Father.element(i)
if( j < some small % ) mutate( child.element(i) );
}
return child;
}
So it's quite simple. Only the mutation needs a comment: So there is always some small probability that element X in child will be different than X in his parent. To do this we make random changes in element in child (change his colour by random number, or add random number to his (x,y) coordinate - or his node).
So this is my idea... I didn't test it, didn't code it.
Please check my idea - what do you think about it?
I would make the number of patches of each child dynamic and get the mutation operation to insert/delete patches with some (low) probability. Of course this could result in a lot of redundancy and bloat in the child's genome. In these situations, it is usually a good idea to use the length of an individual's genome as a parameter of the fitness function so that individuals get rewarded (with a higher fitness value) for using fewer patches. So for example if the PictureDif of individuals A and B are the same but the A has fewer patches than B, then A has a higher fitness.
Another issue is the reproductive operator that you proposed (namely, the crossover operation). In order for the evolutionary process to work efficiently, you need to achieve a reasonable exploration and exploitation balance. One way of doing this is by having a set of reproductive operators that exhibit a good fitness correlation [1] which means the fitness of a child must be close to the fitness of its parent(s).
In the case of single parent reproduction you only need to find the right mutation parameters. However, when it comes to multi-parent reproduction (crossover) one of the frequently used techniques is to produce 2 children (instead of 1) from the same 2 parents. For the first child, each gene comes from the mother with the probability of 0.2 and from the father with the probability of 0.8, and for the second child the other way around. Of course after the crossover, you can do the mutation.
Oh and one more thing, for the mutation operators, when you say
... make random changes in element in child (change his colour by random number, or add random number to his (x,y) coordinate - or his node)
it's a good idea to use a Gaussian distribution to change the colour, coordinate etc.
[1] Evolutionary Computation: A unified approach by Kenneth A. De Jong, page 69
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am finding it hard to understand the process of Naive Bayes, and I was wondering if someone could explain it with a simple step by step process in English. I understand it takes comparisons by times occurred as a probability, but I have no idea how the training data is related to the actual dataset.
Please give me an explanation of what role the training set plays. I am giving a very simple example for fruits here, like banana for example
training set---
round-red
round-orange
oblong-yellow
round-red
dataset----
round-red
round-orange
round-red
round-orange
oblong-yellow
round-red
round-orange
oblong-yellow
oblong-yellow
round-red
The accepted answer has many elements of k-NN (k-nearest neighbors), a different algorithm.
Both k-NN and NaiveBayes are classification algorithms. Conceptually, k-NN uses the idea of "nearness" to classify new entities. In k-NN 'nearness' is modeled with ideas such as Euclidean Distance or Cosine Distance. By contrast, in NaiveBayes, the concept of 'probability' is used to classify new entities.
Since the question is about Naive Bayes, here's how I'd describe the ideas and steps to someone. I'll try to do it with as few equations and in plain English as much as possible.
First, Conditional Probability & Bayes' Rule
Before someone can understand and appreciate the nuances of Naive Bayes', they need to know a couple of related concepts first, namely, the idea of Conditional Probability, and Bayes' Rule. (If you are familiar with these concepts, skip to the section titled Getting to Naive Bayes')
Conditional Probability in plain English: What is the probability that something will happen, given that something else has already happened.
Let's say that there is some Outcome O. And some Evidence E. From the way these probabilities are defined: The Probability of having both the Outcome O and Evidence E is:
(Probability of O occurring) multiplied by the (Prob of E given that O happened)
One Example to understand Conditional Probability:
Let say we have a collection of US Senators. Senators could be Democrats or Republicans. They are also either male or female.
If we select one senator completely randomly, what is the probability that this person is a female Democrat? Conditional Probability can help us answer that.
Probability of (Democrat and Female Senator)= Prob(Senator is Democrat) multiplied by Conditional Probability of Being Female given that they are a Democrat.
P(Democrat & Female) = P(Democrat) * P(Female | Democrat)
We could compute the exact same thing, the reverse way:
P(Democrat & Female) = P(Female) * P(Democrat | Female)
Understanding Bayes Rule
Conceptually, this is a way to go from P(Evidence| Known Outcome) to P(Outcome|Known Evidence). Often, we know how frequently some particular evidence is observed, given a known outcome. We have to use this known fact to compute the reverse, to compute the chance of that outcome happening, given the evidence.
P(Outcome given that we know some Evidence) = P(Evidence given that we know the Outcome) times Prob(Outcome), scaled by the P(Evidence)
The classic example to understand Bayes' Rule:
Probability of Disease D given Test-positive =
P(Test is positive|Disease) * P(Disease)
_______________________________________________________________
(scaled by) P(Testing Positive, with or without the disease)
Now, all this was just preamble, to get to Naive Bayes.
Getting to Naive Bayes'
So far, we have talked only about one piece of evidence. In reality, we have to predict an outcome given multiple evidence. In that case, the math gets very complicated. To get around that complication, one approach is to 'uncouple' multiple pieces of evidence, and to treat each of piece of evidence as independent. This approach is why this is called naive Bayes.
P(Outcome|Multiple Evidence) =
P(Evidence1|Outcome) * P(Evidence2|outcome) * ... * P(EvidenceN|outcome) * P(Outcome)
scaled by P(Multiple Evidence)
Many people choose to remember this as:
P(Likelihood of Evidence) * Prior prob of outcome
P(outcome|evidence) = _________________________________________________
P(Evidence)
Notice a few things about this equation:
If the Prob(evidence|outcome) is 1, then we are just multiplying by 1.
If the Prob(some particular evidence|outcome) is 0, then the whole prob. becomes 0. If you see contradicting evidence, we can rule out that outcome.
Since we divide everything by P(Evidence), we can even get away without calculating it.
The intuition behind multiplying by the prior is so that we give high probability to more common outcomes, and low probabilities to unlikely outcomes. These are also called base rates and they are a way to scale our predicted probabilities.
How to Apply NaiveBayes to Predict an Outcome?
Just run the formula above for each possible outcome. Since we are trying to classify, each outcome is called a class and it has a class label. Our job is to look at the evidence, to consider how likely it is to be this class or that class, and assign a label to each entity.
Again, we take a very simple approach: The class that has the highest probability is declared the "winner" and that class label gets assigned to that combination of evidences.
Fruit Example
Let's try it out on an example to increase our understanding: The OP asked for a 'fruit' identification example.
Let's say that we have data on 1000 pieces of fruit. They happen to be Banana, Orange or some Other Fruit.
We know 3 characteristics about each fruit:
Whether it is Long
Whether it is Sweet and
If its color is Yellow.
This is our 'training set.' We will use this to predict the type of any new fruit we encounter.
Type Long | Not Long || Sweet | Not Sweet || Yellow |Not Yellow|Total
___________________________________________________________________
Banana | 400 | 100 || 350 | 150 || 450 | 50 | 500
Orange | 0 | 300 || 150 | 150 || 300 | 0 | 300
Other Fruit | 100 | 100 || 150 | 50 || 50 | 150 | 200
____________________________________________________________________
Total | 500 | 500 || 650 | 350 || 800 | 200 | 1000
___________________________________________________________________
We can pre-compute a lot of things about our fruit collection.
The so-called "Prior" probabilities. (If we didn't know any of the fruit attributes, this would be our guess.) These are our base rates.
P(Banana) = 0.5 (500/1000)
P(Orange) = 0.3
P(Other Fruit) = 0.2
Probability of "Evidence"
p(Long) = 0.5
P(Sweet) = 0.65
P(Yellow) = 0.8
Probability of "Likelihood"
P(Long|Banana) = 0.8
P(Long|Orange) = 0 [Oranges are never long in all the fruit we have seen.]
....
P(Yellow|Other Fruit) = 50/200 = 0.25
P(Not Yellow|Other Fruit) = 0.75
Given a Fruit, how to classify it?
Let's say that we are given the properties of an unknown fruit, and asked to classify it. We are told that the fruit is Long, Sweet and Yellow. Is it a Banana? Is it an Orange? Or Is it some Other Fruit?
We can simply run the numbers for each of the 3 outcomes, one by one. Then we choose the highest probability and 'classify' our unknown fruit as belonging to the class that had the highest probability based on our prior evidence (our 1000 fruit training set):
P(Banana|Long, Sweet and Yellow)
P(Long|Banana) * P(Sweet|Banana) * P(Yellow|Banana) * P(banana)
= _______________________________________________________________
P(Long) * P(Sweet) * P(Yellow)
= 0.8 * 0.7 * 0.9 * 0.5 / P(evidence)
= 0.252 / P(evidence)
P(Orange|Long, Sweet and Yellow) = 0
P(Other Fruit|Long, Sweet and Yellow)
P(Long|Other fruit) * P(Sweet|Other fruit) * P(Yellow|Other fruit) * P(Other Fruit)
= ____________________________________________________________________________________
P(evidence)
= (100/200 * 150/200 * 50/200 * 200/1000) / P(evidence)
= 0.01875 / P(evidence)
By an overwhelming margin (0.252 >> 0.01875), we classify this Sweet/Long/Yellow fruit as likely to be a Banana.
Why is Bayes Classifier so popular?
Look at what it eventually comes down to. Just some counting and multiplication. We can pre-compute all these terms, and so classifying becomes easy, quick and efficient.
Let z = 1 / P(evidence). Now we quickly compute the following three quantities.
P(Banana|evidence) = z * Prob(Banana) * Prob(Evidence1|Banana) * Prob(Evidence2|Banana) ...
P(Orange|Evidence) = z * Prob(Orange) * Prob(Evidence1|Orange) * Prob(Evidence2|Orange) ...
P(Other|Evidence) = z * Prob(Other) * Prob(Evidence1|Other) * Prob(Evidence2|Other) ...
Assign the class label of whichever is the highest number, and you are done.
Despite the name, Naive Bayes turns out to be excellent in certain applications. Text classification is one area where it really shines.
Your question as I understand it is divided in two parts, part one being you need a better understanding of the Naive Bayes classifier & part two being the confusion surrounding Training set.
In general all of Machine Learning Algorithms need to be trained for supervised learning tasks like classification, prediction etc. or for unsupervised learning tasks like clustering.
During the training step, the algorithms are taught with a particular input dataset (training set) so that later on we may test them for unknown inputs (which they have never seen before) for which they may classify or predict etc (in case of supervised learning) based on their learning. This is what most of the Machine Learning techniques like Neural Networks, SVM, Bayesian etc. are based upon.
So in a general Machine Learning project basically you have to divide your input set to a Development Set (Training Set + Dev-Test Set) & a Test Set (or Evaluation set). Remember your basic objective would be that your system learns and classifies new inputs which they have never seen before in either Dev set or test set.
The test set typically has the same format as the training set. However, it is very important that the test set be distinct from the training corpus: if we simply
reused the training set as the test set, then a model that simply memorized its input, without learning how to generalize to new examples, would receive misleadingly high scores.
In general, for an example, 70% of our data can be used as training set cases. Also remember to partition the original set into the training and test sets randomly.
Now I come to your other question about Naive Bayes.
To demonstrate the concept of Naïve Bayes Classification, consider the example given below:
As indicated, the objects can be classified as either GREEN or RED. Our task is to classify new cases as they arrive, i.e., decide to which class label they belong, based on the currently existing objects.
Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case (which hasn't been observed yet) is twice as likely to have membership GREEN rather than RED. In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities are based on previous experience, in this case the percentage of GREEN and RED objects, and often used to predict outcomes before they actually happen.
Thus, we can write:
Prior Probability of GREEN: number of GREEN objects / total number of objects
Prior Probability of RED: number of RED objects / total number of objects
Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior probabilities for class membership are:
Prior Probability for GREEN: 40 / 60
Prior Probability for RED: 20 / 60
Having formulated our prior probability, we are now ready to classify a new object (WHITE circle in the diagram below). Since the objects are well clustered, it is reasonable to assume that the more GREEN (or RED) objects in the vicinity of X, the more likely that the new cases belong to that particular color. To measure this likelihood, we draw a circle around X which encompasses a number (to be chosen a priori) of points irrespective of their class labels. Then we calculate the number of points in the circle belonging to each class label. From this we calculate the likelihood:
From the illustration above, it is clear that Likelihood of X given GREEN is smaller than Likelihood of X given RED, since the circle encompasses 1 GREEN object and 3 RED ones. Thus:
Although the prior probabilities indicate that X may belong to GREEN (given that there are twice as many GREEN compared to RED) the likelihood indicates otherwise; that the class membership of X is RED (given that there are more RED objects in the vicinity of X than GREEN). In the Bayesian analysis, the final classification is produced by combining both sources of information, i.e., the prior and the likelihood, to form a posterior probability using the so-called Bayes' rule (named after Rev. Thomas Bayes 1702-1761).
Finally, we classify X as RED since its class membership achieves the largest posterior probability.
Naive Bayes comes under supervising machine learning which used to make classifications of data sets.
It is used to predict things based on its prior knowledge and independence assumptions.
They call it naive because it’s assumptions (it assumes that all of the features in the dataset are equally important and independent) are really optimistic and rarely true in most real-world applications.
It is classification algorithm which makes the decision for the unknown data set. It is based on Bayes Theorem which describe the probability of an event based on its prior knowledge.
Below diagram shows how naive Bayes works
Formula to predict NB:
How to use Naive Bayes Algorithm ?
Let's take an example of how N.B woks
Step 1: First we find out Likelihood of table which shows the probability of yes or no in below diagram.
Step 2: Find the posterior probability of each class.
Problem: Find out the possibility of whether the player plays in Rainy condition?
P(Yes|Rainy) = P(Rainy|Yes) * P(Yes) / P(Rainy)
P(Rainy|Yes) = 2/9 = 0.222
P(Yes) = 9/14 = 0.64
P(Rainy) = 5/14 = 0.36
Now, P(Yes|Rainy) = 0.222*0.64/0.36 = 0.39 which is lower probability which means chances of the match played is low.
For more reference refer these blog.
Refer GitHub Repository Naive-Bayes-Examples
Ram Narasimhan explained the concept very nicely here below is an alternative explanation through the code example of Naive Bayes in action
It uses an example problem from this book on page 351
This is the data set that we will be using
In the above dataset if we give the hypothesis = {"Age":'<=30', "Income":"medium", "Student":'yes' , "Creadit_Rating":'fair'} then what is the probability that he will buy or will not buy a computer.
The code below exactly answers that question.
Just create a file called named new_dataset.csv and paste the following content.
Age,Income,Student,Creadit_Rating,Buys_Computer
<=30,high,no,fair,no
<=30,high,no,excellent,no
31-40,high,no,fair,yes
>40,medium,no,fair,yes
>40,low,yes,fair,yes
>40,low,yes,excellent,no
31-40,low,yes,excellent,yes
<=30,medium,no,fair,no
<=30,low,yes,fair,yes
>40,medium,yes,fair,yes
<=30,medium,yes,excellent,yes
31-40,medium,no,excellent,yes
31-40,high,yes,fair,yes
>40,medium,no,excellent,no
Here is the code the comments explains everything we are doing here! [python]
import pandas as pd
import pprint
class Classifier():
data = None
class_attr = None
priori = {}
cp = {}
hypothesis = None
def __init__(self,filename=None, class_attr=None ):
self.data = pd.read_csv(filename, sep=',', header =(0))
self.class_attr = class_attr
'''
probability(class) = How many times it appears in cloumn
__________________________________________
count of all class attribute
'''
def calculate_priori(self):
class_values = list(set(self.data[self.class_attr]))
class_data = list(self.data[self.class_attr])
for i in class_values:
self.priori[i] = class_data.count(i)/float(len(class_data))
print "Priori Values: ", self.priori
'''
Here we calculate the individual probabilites
P(outcome|evidence) = P(Likelihood of Evidence) x Prior prob of outcome
___________________________________________
P(Evidence)
'''
def get_cp(self, attr, attr_type, class_value):
data_attr = list(self.data[attr])
class_data = list(self.data[self.class_attr])
total =1
for i in range(0, len(data_attr)):
if class_data[i] == class_value and data_attr[i] == attr_type:
total+=1
return total/float(class_data.count(class_value))
'''
Here we calculate Likelihood of Evidence and multiple all individual probabilities with priori
(Outcome|Multiple Evidence) = P(Evidence1|Outcome) x P(Evidence2|outcome) x ... x P(EvidenceN|outcome) x P(Outcome)
scaled by P(Multiple Evidence)
'''
def calculate_conditional_probabilities(self, hypothesis):
for i in self.priori:
self.cp[i] = {}
for j in hypothesis:
self.cp[i].update({ hypothesis[j]: self.get_cp(j, hypothesis[j], i)})
print "\nCalculated Conditional Probabilities: \n"
pprint.pprint(self.cp)
def classify(self):
print "Result: "
for i in self.cp:
print i, " ==> ", reduce(lambda x, y: x*y, self.cp[i].values())*self.priori[i]
if __name__ == "__main__":
c = Classifier(filename="new_dataset.csv", class_attr="Buys_Computer" )
c.calculate_priori()
c.hypothesis = {"Age":'<=30', "Income":"medium", "Student":'yes' , "Creadit_Rating":'fair'}
c.calculate_conditional_probabilities(c.hypothesis)
c.classify()
output:
Priori Values: {'yes': 0.6428571428571429, 'no': 0.35714285714285715}
Calculated Conditional Probabilities:
{
'no': {
'<=30': 0.8,
'fair': 0.6,
'medium': 0.6,
'yes': 0.4
},
'yes': {
'<=30': 0.3333333333333333,
'fair': 0.7777777777777778,
'medium': 0.5555555555555556,
'yes': 0.7777777777777778
}
}
Result:
yes ==> 0.0720164609053
no ==> 0.0411428571429
I try to explain the Bayes rule with an example.
What is the chance that a random person selected from the society is a smoker?
You may reply 10%, and let's assume that's right.
Now, what if I say that the random person is a man and is 15 years old?
You may say 15 or 20%, but why?.
In fact, we try to update our initial guess with new pieces of evidence ( P(smoker) vs. P(smoker | evidence) ). The Bayes rule is a way to relate these two probabilities.
P(smoker | evidence) = P(smoker)* p(evidence | smoker)/P(evidence)
Each evidence may increase or decrease this chance. For example, this fact that he is a man may increase the chance provided that this percentage (being a man) among non-smokers is lower.
In the other words, being a man must be an indicator of being a smoker rather than a non-smoker. Therefore, if an evidence is an indicator of something, it increases the chance.
But how do we know that this is an indicator?
For each feature, you can compare the commonness (probability) of that feature under the given conditions with its commonness alone. (P(f | x) vs. P(f)).
P(smoker | evidence) / P(smoker) = P(evidence | smoker)/P(evidence)
For example, if we know that 90% of smokers are men, it's not still enough to say whether being a man is an indicator of being smoker or not. For example if the probability of being a man in the society is also 90%, then knowing that someone is a man doesn't help us ((90% / 90%) = 1. But if men contribute to 40% of the society, but 90% of the smokers, then knowing that someone is a man increases the chance of being a smoker (90% / 40%) = 2.25, so it increases the initial guess (10%) by 2.25 resulting 22.5%.
However, if the probability of being a man was 95% in the society, then regardless of the fact that the percentage of men among smokers is high (90%)! the evidence that someone is a man decreases the chance of him being a smoker! (90% / 95%) = 0.95).
So we have:
P(smoker | f1, f2, f3,... ) = P(smoker) * contribution of f1* contribution of f2 *...
=
P(smoker)*
(P(being a man | smoker)/P(being a man))*
(P(under 20 | smoker)/ P(under 20))
Note that in this formula we assumed that being a man and being under 20 are independent features so we multiplied them, it means that knowing that someone is under 20 has no effect on guessing that he is man or woman. But it may not be true, for example maybe most adolescence in a society are men...
To use this formula in a classifier
The classifier is given with some features (being a man and being under 20) and it must decide if he is an smoker or not (these are two classes). It uses the above formula to calculate the probability of each class under the evidence (features), and it assigns the class with the highest probability to the input. To provide the required probabilities (90%, 10%, 80%...) it uses the training set. For example, it counts the people in the training set that are smokers and find they contribute 10% of the sample. Then for smokers checks how many of them are men or women .... how many are above 20 or under 20....In the other words, it tries to build the probability distribution of the features for each class based on the training data.
I have a problem involving a collection of continuous probability distribution functions, most of which are determined empirically (e.g. departure times, transit times). What I need is some way of taking two of these PDFs and doing arithmetic on them. E.g. if I have two values x taken from PDF X, and y taken from PDF Y, I need to get the PDF for (x+y), or any other operation f(x,y).
An analytical solution is not possible, so what I'm looking for is some representation of PDFs that allows such things. An obvious (but computationally expensive) solution is monte-carlo: generate lots of values of x and y, and then just measure f(x, y). But that takes too much CPU time.
I did think about representing the PDF as a list of ranges where each range has a roughly equal probability, effectively representing the PDF as the union of a list of uniform distributions. But I can't see how to combine them.
Does anyone have any good solutions to this problem?
Edit: The goal is to create a mini-language (aka Domain Specific Language) for manipulating PDFs. But first I need to sort out the underlying representation and algorithms.
Edit 2: dmckee suggests a histogram implementation. That is what I was getting at with my list of uniform distributions. But I don't see how to combine them to create new distributions. Ultimately I need to find things like P(x < y) in cases where this may be quite small.
Edit 3: I have a bunch of histograms. They are not evenly distributed because I'm generating them from occurance data, so basically if I have 100 samples and I want ten points in the histogram then I allocate 10 samples to each bar, and make the bars variable width but constant area.
I've figured out that to add PDFs you convolve them, and I've boned up on the maths for that. When you convolve two uniform distributions you get a new distribution with three sections: the wider uniform distribution is still there, but with a triangle stuck on each side the width of the narrower one. So if I convolve each element of X and Y I'll get a bunch of these, all overlapping. Now I'm trying to figure out how to sum them all and then get a histogram that is the best approximation to it.
I'm beginning to wonder if Monte-Carlo wasn't such a bad idea after all.
Edit 4: This paper discusses convolutions of uniform distributions in some detail. In general you get a "trapezoid" distribution. Since each "column" in the histograms is a uniform distribution, I had hoped that the problem could be solved by convolving these columns and summing the results.
However the result is considerably more complex than the inputs, and also includes triangles. Edit 5: [Wrong stuff removed]. But if these trapezoids are approximated to rectangles with the same area then you get the Right Answer, and reducing the number of rectangles in the result looks pretty straightforward too. This might be the solution I've been trying to find.
Edit 6: Solved! Here is the final Haskell code for this problem:
-- | Continuous distributions of scalars are represented as a
-- | histogram where each bar has approximately constant area but
-- | variable width and height. A histogram with N bars is stored as
-- | a list of N+1 values.
data Continuous = C {
cN :: Int,
-- ^ Number of bars in the histogram.
cAreas :: [Double],
-- ^ Areas of the bars. #length cAreas == cN#
cBars :: [Double]
-- ^ Boundaries of the bars. #length cBars == cN + 1#
} deriving (Show, Read)
{- | Add distributions. If two random variables #vX# and #vY# are
taken from distributions #x# and #y# respectively then the
distribution of #(vX + vY)# will be #(x .+. y).
This is implemented as the convolution of distributions x and y.
Each is a histogram, which is to say the sum of a collection of
uniform distributions (the "bars"). Therefore the convolution can be
computed as the sum of the convolutions of the cross product of the
components of x and y.
When you convolve two uniform distributions of unequal size you get a
trapezoidal distribution. Let p = p2-p1, q - q2-q1. Then we get:
> | |
> | ______ |
> | | | with | _____________
> | | | | | |
> +-----+----+------- +--+-----------+-
> p1 p2 q1 q2
>
> gives h|....... _______________
> | /: :\
> | / : : \ 1
> | / : : \ where h = -
> | / : : \ q
> | / : : \
> +--+-----+-------------+-----+-----
> p1+q1 p2+q1 p1+q2 p2+q2
However we cannot keep the trapezoid in the final result because our
representation is restricted to uniform distributions. So instead we
store a uniform approximation to the trapezoid with the same area:
> h|......___________________
> | | / \ |
> | |/ \|
> | | |
> | /| |\
> | / | | \
> +-----+-------------------+--------
> p1+q1+p/2 p2+q2-p/2
-}
(.+.) :: Continuous -> Continuous -> Continuous
c .+. d = C {cN = length bars - 1,
cBars = map fst bars,
cAreas = zipWith barArea bars (tail bars)}
where
-- The convolve function returns a list of two (x, deltaY) pairs.
-- These can be sorted by x and then sequentially summed to get
-- the new histogram. The "b" parameter is the product of the
-- height of the input bars, which was omitted from the diagrams
-- above.
convolve b c1 c2 d1 d2 =
if (c2-c1) < (d2-d1) then convolve1 b c1 c2 d1 d2 else convolve1 b d1
d2 c1 c2
convolve1 b p1 p2 q1 q2 =
[(p1+q1+halfP, h), (p2+q2-halfP, (-h))]
where
halfP = (p2-p1)/2
h = b / (q2-q1)
outline = map sumGroup $ groupBy ((==) `on` fst) $ sortBy (comparing fst)
$ concat
[convolve (areaC*areaD) c1 c2 d1 d2 |
(c1, c2, areaC) <- zip3 (cBars c) (tail $ cBars c) (cAreas c),
(d1, d2, areaD) <- zip3 (cBars d) (tail $ cBars d) (cAreas d)
]
sumGroup pairs = (fst $ head pairs, sum $ map snd pairs)
bars = tail $ scanl (\(_,y) (x2,dy) -> (x2, y+dy)) (0, 0) outline
barArea (x1, h) (x2, _) = (x2 - x1) * h
Other operators are left as an exercise for the reader.
No need for histograms or symbolic computation: everything can be done at the language level in closed form, if the right point of view is taken.
[I shall use the term "measure" and "distribution" interchangeably. Also, my Haskell is rusty and I ask you to forgive me for being imprecise in this area.]
Probability distributions are really codata.
Let mu be a probability measure. The only thing you can do with a measure is integrate it against a test function (this is one possible mathematical definition of "measure"). Note that this is what you will eventually do: for instance integrating against identity is taking the mean:
mean :: Measure -> Double
mean mu = mu id
another example:
variance :: Measure -> Double
variance mu = (mu $ \x -> x ^ 2) - (mean mu) ^ 2
another example, which computes P(mu < x):
cdf :: Measure -> Double -> Double
cdf mu x = mu $ \z -> if z < x then 1 else 0
This suggests an approach by duality.
The type Measure shall therefore denote the type (Double -> Double) -> Double. This allows you to model results of MC simulation, numerical/symbolic quadrature against a PDF, etc. For instance, the function
empirical :: [Double] -> Measure
empirical h:t f = (f h) + empirical t f
returns the integral of f against an empirical measure obtained by eg. MC sampling. Also
from_pdf :: (Double -> Double) -> Measure
from_pdf rho f = my_favorite_quadrature_method rho f
construct measures from (regular) densities.
Now, the good news. If mu and nu are two measures, the convolution mu ** nu is given by:
(mu ** nu) f = nu $ \y -> (mu $ \x -> f $ x + y)
So, given two measures, you can integrate against their convolution.
Also, given a random variable X of law mu, the law of a * X is given by:
rescale :: Double -> Measure -> Measure
rescale a mu f = mu $ \x -> f(a * x)
Also, the distribution of phi(X) is given by the image measure phi_* X, in our framework:
apply :: (Double -> Double) -> Measure -> Measure
apply phi mu f = mu $ f . phi
So now you can easily work out an embedded language for measures. There are much more things to do here, particularly with respect to sample spaces other than the real line, dependencies between random variables, conditionning, but I hope you get the point.
In particular, the pushforward is functorial:
newtype Measure a = (a -> Double) -> Double
instance Functor Measure a where
fmap f mu = apply f mu
It is a monad too (exercise -- hint: this very much looks like the continuation monad. What is return ? What is the analog of call/cc ?).
Also, combined with a differential geometry framework, this can probably be turned into something which compute Bayesian posterior distributions automatically.
At the end of the day, you can write stuff like
m = mean $ apply cos ((from_pdf gauss) ** (empirical data))
to compute the mean of cos(X + Y) where X has pdf gauss and Y has been sampled by a MC method whose results are in data.
Probability distributions form a monad; see eg the work of Claire Jones and also the LICS 1989 paper, but the ideas go back to a 1982 paper by Giry (DOI 10.1007/BFb0092872) and to a 1962 note by Lawvere that I cannot track down (http://permalink.gmane.org/gmane.science.mathematics.categories/6541).
But I don't see the comonad: there's no way to get an "a" out of an "(a->Double)->Double". Perhaps if you make it polymorphic - (a->r)->r for all r? (That's the continuation monad.)
Is there anything that stops you from employing a mini-language for this?
By that I mean, define a language that lets you write f = x + y and evaluates f for you just as written. And similarly for g = x * z, h = y(x), etc. ad nauseum. (The semantics I'm suggesting call for the evaluator to select a random number on each innermost PDF appearing on the RHS at evaluation time, and not to try to understand the composted form of the resulting PDFs. This may not be fast enough...)
Assuming that you understand the precision limits you need, you can represent a PDF fairly simply with a histogram or spline (the former being a degenerate case of the later). If you need to mix analytically defined PDFs with experimentally determined ones, you'll have to add a type mechanism.
A histogram is just an array, the contents of which represent the incidence in a particular region of the input range. You haven't said if you have a language preference, so I'll assume something c-like. You need to know the bin-structure (uniorm sizes are easy, but not always best) including the high and low limits and possibly the normalizatation:
struct histogram_struct {
int bins; /* Assumed to be uniform */
double low;
double high;
/* double normalization; */
/* double *errors; */ /* if using, intialize with enough space,
* and store _squared_ errors
*/
double contents[];
};
This kind of thing is very common in scientific analysis software, and you might want to use an existing implementation.
I worked on similar problems for my dissertation.
One way to compute approximate convolutions is to take the Fourier transform of the density functions (histograms in this case), multiply them, then take the inverse Fourier transform to get the convolution.
Look at Appendix C of my dissertation for formulas for various special cases of operations on probability distributions. You can find the dissertation at: http://riso.sourceforge.net
I wrote Java code to carry out those operations. You can find the code at: https://sourceforge.net/projects/riso
Autonomous mobile robotics deals with similar issue in localization and navigation, in particular the Markov localization and Kalman filter (sensor fusion). See An experimental comparison of localization methods continued for example.
Another approach you could borrow from mobile robots is path planning using potential fields.
A couple of responses:
1) If you have empirically determined PDFs they either you have histograms or you have an approximation to a parametric PDF. A PDF is a continuous function and you don't have infinite data...
2) Let's assume that the variables are independent. Then if you make the PDF discrete then P(f(x,y)) = f(x,y)p(x,y) = f(x,y)p(x)p(y) summed over all the combinations of x and y such that f(x,y) meets your target.
If you are going to fit the empirical PDFs to standard PDFs, e.g. the normal distribution, then you can use already-determined functions to figure out the sum, etc.
If the variables are not independent, then you have more trouble on your hands and I think you have to use copulas.
I think that defining your own mini-language, etc., is overkill. you can do this with arrays...
Some initial thoughts:
First, Mathematica has a nice facility for doing this with exact distributions.
Second, representation as histograms (ie, empirical PDFs) is problematic since you have to make choices about bin size. That can be avoided by storing a cumulative distribution instead, ie, an empirical CDF. (In fact, you then retain the ability to recreate the full data set of samples that the empirical distribution is based on.)
Here's some ugly Mathematica code to take a list of samples and return an empirical CDF, namely a list of value-probability pairs. Run the output of this through ListPlot to see a plot of the empirical CDF.
empiricalCDF[t_] :=
Flatten[{{#[[2,1]],#[[1,2]]},#[[2]]}&/#Partition[Prepend[Transpose[{#[[1]],
Rest[FoldList[Plus,0,#[[2]]]]/Length[t]}&[Transpose[{First[#],Length[#]}&/#
Split[Sort[t]]]]],{Null,0}],2,1],1]
Finally, here's some information on combining discrete probability distributions:
http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/Chapter7.pdf
I think the histograms or the list of 1/N area regions is a good idea. For the sake of argument, I'll assume that you'll have a fixed N for all distributions.
Use the paper you linked edit 4 to generate the new distribution. Then, approximate it with a new N-element distribution.
If you don't want N to be fixed, it's even easier. Take each convex polygon (trapezoid or triangle) in the new generated distribution and approximate it with a uniform distribution.
Another suggestion is to use kernel densities. Especially if you use Gaussian kernels, then they can be relatively easy to work with... except that the distributions quickly explode in size without care. Depending on the application, there are additional approximation techniques like importance sampling that can be used.
If you want some fun, try representing them symbolically like Maple or Mathemetica would do. Maple uses directed acyclic graphs, while Matematica uses a list/lisp like appoach (I believe, but it's been a loooong time, since I even thought about this).
Do all your manipulations symbolically, then at the end push through numerical values. (Or just find a way to launch off in a shell and do the computations).
Paul.
How can I convert a uniform distribution (as most random number generators produce, e.g. between 0.0 and 1.0) into a normal distribution? What if I want a mean and standard deviation of my choosing?
There are plenty of methods:
Do not use Box Muller. Especially if you draw many gaussian numbers. Box Muller yields a result which is clamped between -6 and 6 (assuming double precision. Things worsen with floats.). And it is really less efficient than other available methods.
Ziggurat is fine, but needs a table lookup (and some platform-specific tweaking due to cache size issues)
Ratio-of-uniforms is my favorite, only a few addition/multiplications and a log 1/50th of the time (eg. look there).
Inverting the CDF is efficient (and overlooked, why ?), you have fast implementations of it available if you search google. It is mandatory for Quasi-Random numbers.
The Ziggurat algorithm is pretty efficient for this, although the Box-Muller transform is easier to implement from scratch (and not crazy slow).
Changing the distribution of any function to another involves using the inverse of the function you want.
In other words, if you aim for a specific probability function p(x) you get the distribution by integrating over it -> d(x) = integral(p(x)) and use its inverse: Inv(d(x)). Now use the random probability function (which have uniform distribution) and cast the result value through the function Inv(d(x)). You should get random values cast with distribution according to the function you chose.
This is the generic math approach - by using it you can now choose any probability or distribution function you have as long as it have inverse or good inverse approximation.
Hope this helped and thanks for the small remark about using the distribution and not the probability itself.
Here is a javascript implementation using the polar form of the Box-Muller transformation.
/*
* Returns member of set with a given mean and standard deviation
* mean: mean
* standard deviation: std_dev
*/
function createMemberInNormalDistribution(mean,std_dev){
return mean + (gaussRandom()*std_dev);
}
/*
* Returns random number in normal distribution centering on 0.
* ~95% of numbers returned should fall between -2 and 2
* ie within two standard deviations
*/
function gaussRandom() {
var u = 2*Math.random()-1;
var v = 2*Math.random()-1;
var r = u*u + v*v;
/*if outside interval [0,1] start over*/
if(r == 0 || r >= 1) return gaussRandom();
var c = Math.sqrt(-2*Math.log(r)/r);
return u*c;
/* todo: optimize this algorithm by caching (v*c)
* and returning next time gaussRandom() is called.
* left out for simplicity */
}
Where R1, R2 are random uniform numbers:
NORMAL DISTRIBUTION, with SD of 1:
sqrt(-2*log(R1))*cos(2*pi*R2)
This is exact... no need to do all those slow loops!
Reference: dspguide.com/ch2/6.htm
Use the central limit theorem wikipedia entry mathworld entry to your advantage.
Generate n of the uniformly distributed numbers, sum them, subtract n*0.5 and you have the output of an approximately normal distribution with mean equal to 0 and variance equal to (1/12) * (1/sqrt(N)) (see wikipedia on uniform distributions for that last one)
n=10 gives you something half decent fast. If you want something more than half decent go for tylers solution (as noted in the wikipedia entry on normal distributions)
I would use Box-Muller. Two things about this:
You end up with two values per iteration
Typically, you cache one value and return the other. On the next call for a sample, you return the cached value.
Box-Muller gives a Z-score
You have to then scale the Z-score by the standard deviation and add the mean to get the full value in the normal distribution.
It seems incredible that I could add something to this after eight years, but for the case of Java I would like to point readers to the Random.nextGaussian() method, which generates a Gaussian distribution with mean 0.0 and standard deviation 1.0 for you.
A simple addition and/or multiplication will change the mean and standard deviation to your needs.
The standard Python library module random has what you want:
normalvariate(mu, sigma)
Normal distribution. mu is the mean, and sigma is the standard deviation.
For the algorithm itself, take a look at the function in random.py in the Python library.
The manual entry is here
This is a Matlab implementation using the polar form of the Box-Muller transformation:
Function randn_box_muller.m:
function [values] = randn_box_muller(n, mean, std_dev)
if nargin == 1
mean = 0;
std_dev = 1;
end
r = gaussRandomN(n);
values = r.*std_dev - mean;
end
function [values] = gaussRandomN(n)
[u, v, r] = gaussRandomNValid(n);
c = sqrt(-2*log(r)./r);
values = u.*c;
end
function [u, v, r] = gaussRandomNValid(n)
r = zeros(n, 1);
u = zeros(n, 1);
v = zeros(n, 1);
filter = r==0 | r>=1;
% if outside interval [0,1] start over
while n ~= 0
u(filter) = 2*rand(n, 1)-1;
v(filter) = 2*rand(n, 1)-1;
r(filter) = u(filter).*u(filter) + v(filter).*v(filter);
filter = r==0 | r>=1;
n = size(r(filter),1);
end
end
And invoking histfit(randn_box_muller(10000000),100); this is the result:
Obviously it is really inefficient compared with the Matlab built-in randn.
This is my JavaScript implementation of Algorithm P (Polar method for normal deviates) from Section 3.4.1 of Donald Knuth's book The Art of Computer Programming:
function normal_random(mean,stddev)
{
var V1
var V2
var S
do{
var U1 = Math.random() // return uniform distributed in [0,1[
var U2 = Math.random()
V1 = 2*U1-1
V2 = 2*U2-1
S = V1*V1+V2*V2
}while(S >= 1)
if(S===0) return 0
return mean+stddev*(V1*Math.sqrt(-2*Math.log(S)/S))
}
I thing you should try this in EXCEL: =norminv(rand();0;1). This will product the random numbers which should be normally distributed with the zero mean and unite variance. "0" can be supplied with any value, so that the numbers will be of desired mean, and by changing "1", you will get the variance equal to the square of your input.
For example: =norminv(rand();50;3) will yield to the normally distributed numbers with MEAN = 50 VARIANCE = 9.
Q How can I convert a uniform distribution (as most random number generators produce, e.g. between 0.0 and 1.0) into a normal distribution?
For software implementation I know couple random generator names which give you a pseudo uniform random sequence in [0,1] (Mersenne Twister, Linear Congruate Generator). Let's call it U(x)
It is exist mathematical area which called probibility theory.
First thing: If you want to model r.v. with integral distribution F then you can try just to evaluate F^-1(U(x)). In pr.theory it was proved that such r.v. will have integral distribution F.
Step 2 can be appliable to generate r.v.~F without usage of any counting methods when F^-1 can be derived analytically without problems. (e.g. exp.distribution)
To model normal distribution you can cacculate y1*cos(y2), where y1~is uniform in[0,2pi]. and y2 is the relei distribution.
Q: What if I want a mean and standard deviation of my choosing?
You can calculate sigma*N(0,1)+m.
It can be shown that such shifting and scaling lead to N(m,sigma)
I have the following code which maybe could help:
set.seed(123)
n <- 1000
u <- runif(n) #creates U
x <- -log(u)
y <- runif(n, max=u*sqrt((2*exp(1))/pi)) #create Y
z <- ifelse (y < dnorm(x)/2, -x, NA)
z <- ifelse ((y > dnorm(x)/2) & (y < dnorm(x)), x, z)
z <- z[!is.na(z)]
It is also easier to use the implemented function rnorm() since it is faster than writing a random number generator for the normal distribution. See the following code as prove
n <- length(z)
t0 <- Sys.time()
z <- rnorm(n)
t1 <- Sys.time()
t1-t0
function distRandom(){
do{
x=random(DISTRIBUTION_DOMAIN);
}while(random(DISTRIBUTION_RANGE)>=distributionFunction(x));
return x;
}