word2vec training procedure clarification - gensim

I'm trying to learn the skip-gram model within word2vec, however I'm confused by some of the basic concepts. To start, here is my current understanding of the model motivated with an example. I am using Python gensim as I go.
Here I have a corpus with three sentences.
sentences = [
['i', 'like', 'cats', 'and', 'dogs'],
['i', 'like', 'dogs'],
['dogs', 'like', 'dogs']
]
From this, I can determine my vocabulary, V = ['and', 'cats', 'dogs', 'i', 'like'].
Following this paper by Tomas Mikolov (and others)
The basic Skip-gram formulation defines p(w_t+j |w_t) using the softmax
function:
where v_w and v′_w are the “input” and “output” vector representations
of w, and W is the number of words in the vocabulary.
To my understanding, the skip-gram model involves two matrices (I'll call them I and O) which are the vector representations of "input/center" words and the vector representation of "output/context" words. Assuming d = 2 (vector dimension or 'size' as its called in genism), I should be a 2x5 matrix and O should be a 5x2 matrix. At the start of the training procedure, these matrices are filled with random values (yes?). So we might have
import numpy as np
np.random.seed(2017)
I = np.random.rand(5,2).round(2) # 5 rows by 2 cols
[[ 0.02 0.77] # and
[ 0.45 0.12] # cats
[ 0.93 0.65] # dogs
[ 0.14 0.23] # i
[ 0.23 0.26]] # like
O = np.random.rand(2,5).round(2) # 2 rows by 5 cols
#and #cats #dogs #i #like
[[ 0.11 0.63 0.39 0.32 0.63]
[ 0.29 0.94 0.15 0.08 0.7 ]]
Now if I want to calculate the probability that the word "dogs" appears in the context of "cats" I should do
exp([0.39, 0.15] * [0.45 0.12])/(...) = (0.1125)/(...)
A few questions on this:
Is my understanding of the algorithm correct thus far?
Using genism, I can train a model on this data using
import gensim
model = gensim.models.Word2Vec(sentences, sg = 1, size=2, window=1, min_count=1)
model.wv['dogs'] # array([ 0.06249372, 0.22618999], dtype=float32)
For the array given, is that the vector for "dogs" in the Input matrix or the Output matrix? Is there a way to view both matrices in the final model?
Why does model.wv.similarity('cats','cats') = 1? I thought this should be closer to 0, since the data would indicate that the word "cats" is unlikely to occur in the context of the word "cats".

(1) Generally, yes, but:
The O output matrix – more properly understood as the weights from the neural-network's hidden layer, to a number of output nodes – is interpreted differently whether using 'negative sampling' ('NS') or 'hierarchical softmax' ('HS') training.
In practice in both I and O are len(vocab) rows and vector-size columns. (I is the Word2Vec model instance's model.wv.syn0 array; O is its model.syn1neg array in NS or model.syn1 in HS.)
I find NS a bit easier to think about: each predictable word corresponds to a single output node. For training data where (context)-indicates->(word), training tries to drive that word's node value toward 1.0, and the other randomly-chosen word node values toward 0.0.
In HS, each word is represented by a huffman-code of a small subset of the output nodes – those 'points' are driven to 1.0 or 0.0 to make the network more indicative of a single word after a (context)-indicates->(word) example.
Only the I matrix, initial word values, are randomized to low-magnitude vectors at the beginning. (The hidden-to-output weights O are left zeros.)
(2) Yes, that'll train things - just note that tiny toy-sized examples won't necessarily generate the useful constellations-of-vector-coordinates that are valued from word2vec.
(3) Note, model.similarity('cats', 'cats') is actually checking the cosine-similarity between the (input) vectors for those two words. Those are the same word, thus they definitionally have the same vector, and the similarity between identical vectors is 1.0.
That is, similarity() is not asking the model for a prediction, it's retrieving learned words by key and comparing those vectors. (Recent versions of gensim do have a predict_output_word() function, but it only works in NS mode, and making predictions isn't really the point of word2vec, and many implementations don't offer any prediction API at all. Rather, the point is using those attempted predictions during training to induce word-vectors that turn out to be useful for various other tasks.)
But even if you were reading predictions, 'cats' might still be a reasonable-although-bad prediction from the model in the context of 'cats'. The essence of forcing large vocabularies into the smaller dimensionality of 'dense' embeddings is compression – the model has no choice but to cluster related words together, because there's not enough internal complexity (learnable parameters) to simply memorize all details of the input. (And for the most part, that's a good thing, because it results in generalizable patterns, rather than just overfit idiosyncrasies of the training corpus.)
The word 'cats' will wind up close to 'dogs' and 'pets' – because they all co-occur with similar words, or each other. And thus the model will be forced to make similar output-predictions for each, because their input-vectors don't vary that much. And a few predictions that are nonsensical in logical language use – like a repeating word - may be made, but only because taking a larger error there still gives less error over the whole training set, compared to other weighting alternatives.

Related

What techniques are effective to find periodicity in arbitrary data points?

By "arbitrary" I mean that I don't have a signal sampled on a grid that is amenable to taking an FFT. I just have points (e.g. in time) where events happened, and I'd like an estimate of the rate, for example:
p = [0, 1.1, 1.9, 3, 3.9, 6.1 ...]
...could be hits from a process with a nominal periodicity (repetition interval) of 1.0, but with noise and some missed detections.
Are there well known methods for processing such data?
A least square algorithm may do the trick, if correctly initialized. A clustering method can be applied to this end.
As an FFT is performed, the signal is depicted as a sum of sine waves. The amplitude of the frequencies may be depicted as resulting from a least square fit on the signal. Hence, if the signal is unevenly sampled, resolving the same least square problem may make sense if the Fourier transform is to be estimated. If applied to a evenly sampled signal, it boils down to the same result.
As your signal is descrete, you may want to fit it as a sum of Dirac combs. It seems more sound to minimize the sum of squared distance to the nearest Dirac of the Dirac comb. This is a non-linear optimization problem where Dirac combs are described by their period and offset. This non-linear least-square problem can be solved by mean of the Levenberg-Marquardt algorithm. Here is an python example making use of the scipy.optimize.leastsq() function. Moreover, the error on the estimated period and offset can be estimated as depicted in How to compute standard deviation errors with scipy.optimize.least_squares . It is also documented in the documentation of curve_fit() and Getting standard errors on fitted parameters using the optimize.leastsq method in python
Nevertheless, half the period, or the thrid of the period, ..., would also fit, and multiples of the period are local minima that are to be avoided by a refining the initialization of the Levenberg-Marquardt algorithm. To this end, the differences between times of events can be clustered, the cluster featuring the smallest value being that of the expected period. As proposed in Clustering values by their proximity in python (machine learning?) , the clustering function sklearn.cluster.MeanShift() is applied.
Notice that the procedure can be extended to multidimentionnal data to look for periodic patterns or mixed periodic patterns featuring different fundamental periods.
import numpy as np
from scipy.optimize import least_squares
from scipy.optimize import leastsq
from sklearn.cluster import MeanShift, estimate_bandwidth
ticks=[0,1.1,1.9,3,3.9,6.1]
import scipy
print scipy.__version__
def crudeEstimate():
# loooking for the period by looking at differences between values :
diffs=np.zeros(((len(ticks))*(len(ticks)-1))/2)
k=0
for i in range(len(ticks)):
for j in range(i):
diffs[k]=ticks[i]-ticks[j]
k=k+1
#see https://stackoverflow.com/questions/18364026/clustering-values-by-their-proximity-in-python-machine-learning
X = np.array(zip(diffs,np.zeros(len(diffs))), dtype=np.float)
bandwidth = estimate_bandwidth(X, quantile=1.0/len(ticks))
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
print cluster_centers
labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)
for k in range(n_clusters_):
my_members = labels == k
print "cluster {0}: {1}".format(k, X[my_members, 0])
estimated_period=np.min(cluster_centers[:,0])
return estimated_period
def disttoDiracComb(x):
residual=np.zeros((len(ticks)))
for i in range(len(ticks)):
mindist=np.inf
for j in range(len(x)/2):
offset=x[2*j+1]
period=x[2*j]
#print period, offset
index=np.floor((ticks[i]-offset)/period)
#print 'index', index
currdist=ticks[i]-(index*period+offset)
if currdist>0.5*period:
currdist=period-currdist
index=index+1
#print 'event at ',ticks[i], 'not far from index ',index, '(', currdist, ')'
#currdist=currdist*currdist
#print currdist
if currdist<mindist:
mindist=currdist
residual[i]=mindist
#residual=residual-period*period
#print x, residual
return residual
estimated_period=crudeEstimate()
print 'crude estimate by clustering :',estimated_period
xp=np.array([estimated_period,0.0])
#res_1 = least_squares(disttoDiracComb, xp,method='lm',xtol=1e-15,verbose=1)
p,pcov,infodict,mesg,ier=leastsq(disttoDiracComb, x0=xp,ftol=1e-18, full_output=True)
#print ' p is ',p, 'covariance is ', pcov
# see https://stackoverflow.com/questions/14581358/getting-standard-errors-on-fitted-parameters-using-the-optimize-leastsq-method-i
s_sq = (disttoDiracComb(p)**2).sum()/(len(ticks)-len(p))
pcov=pcov *s_sq
perr = np.sqrt(np.diag(pcov))
#print 'estimated standard deviation on parameter :' , perr
print 'estimated period is ', p[0],' +/- ', 1.96*perr[0]
print 'estimated offset is ', p[1],' +/- ', 1.96*perr[1]
Applied to your sample, it prints :
crude estimate by clustering : 0.975
estimated period is 1.0042857141346768 +/- 0.04035792507868619
estimated offset is -0.011428571139828817 +/- 0.13385206912205957
It sounds like you need to decide what exactly you want to determine. If you want to know the average interval in a set of timestamps, then that's easy (just take the mean or median).
If you expect that the interval could be changing, then you need to have some idea about how fast it is changing. Then you can find a windowed moving average. You need to have an idea of how fast it is changing so that you can select your window size appropriately - a larger window will give you a smoother result, but a smaller window will be more responsive to a faster-changing rate.
If you have no idea whether the data is following any sort of pattern, then you are probably in the territory of data exploration. In that case, I would start by plotting the intervals, to see if a pattern appears to the eye. This might also benefit from applying a moving average if the data is quite noisy.
Essentially, whether or not there is something in the data and what it means is up to you and your knowledge of the domain. That is, in any set of timestamps there will be an average (and you can also easily calculate the variance to give an indication of variability in the data), but it is up to you whether that average carries any meaning.

Design L1 and L2 distance functions to assess the similarity of bank customers. Each customer is characterized by the following attribute

I am having a hard time with the question below. I am not sure if I got it correct, but either way, I need some help futher understanding it if anyone has time to explain, please do.
Design L1 and L2 distance functions to assess the similarity of bank customers. Each customer is characterized by the following attributes:
− Age (customer’s age, which is a real number with the maximum age is 90 years and minimum age 15 years)
− Cr (“credit rating”) which is ordinal attribute with values ‘very good’, ‘good, ‘medium’, ‘poor’, and ‘very poor’.
− Av_bal (avg account balance, which is a real number with mean 7000, standard deviation is 4000)
Using the L1 distance function computes the distance between the following 2 customers: c1 = (55, good, 7000) and c2 = (25, poor, 1000). [15 points]
Using the L2 distance function computes the distance between the above mentioned 2 customers
Using the L2 distance function computes the distance between the above mentioned 2 customers.
Answer with L1
d(c1,c2) = (c1.cr-c2.cr)/4 +(c1.avg.bal –c2.avg.bal/4000)* (c1.age-mean.age/std.age)-( c2.age-mean.age/std.age)
The question as is, leaves some room for interpretation. Mainly because similarity is not specified exactly. I will try to explain what the standard approach would be.
Usually, before you start, you want to normalize values such that they are rougly in the same range. Otherwise, your similarity will be dominated by the feature with the largest variance.
If you have no information about the distribution but just the range of the values you want to try to nomalize them to [0,1]. For your example this means
norm_age = (age-15)/(90-15)
For nominal values you want to find a mapping to ordinal values if you want to use Lp-Norms. Note: this is not always possible (e.g., colors cannot intuitively be mapped to ordinal values). In you case you can transform the credit rating like this
cr = {0 if ‘very good’, 1 if ‘good, 2 if ‘medium’, 3 if ‘poor’, 4 if ‘very poor’}
afterwards you can do the same normalization as for age
norm_cr = cr/4
Lastly, for normally distributed values you usually perform standardization by subtracting the mean and dividing by the standard deviation.
norm_av_bal = (av_bal-7000)/4000
Now that you have normalized your values, you can go ahead and define the distance functions:
L1(c1, c2) = |c1.norm_age - c2.norm_age| + |c1.norm_cr - c2.norm_cr |
+ |c1.norm_av_bal - c2.norm_av_bal|
and
L2(c1, c2) = sqrt((c1.norm_age - c2.norm_age)2 + (c1.norm_cr -
c2.norm_cr)2 + (c1.norm_av_bal -
c2.norm_av_bal)2)

Why does Perlin noise use a hash function rather than computing random values?

I'm reading through this explanation of Perlin noise which describes a hash function that is calculates random points for all x, y coordinates.
If the x, y coordinate hashes are generated randomly which are eventually used for computing the gradient's and such, why couldn't I just generate random numbers on the fly?
Is it simply a question of optimization that we use a permutation on hash maps to find our random values? The only reason I could think of is that permutations through our hash map some how generates a smoothening effect but I fail to see how.
Just for clarification, I'm refering to this section in the code:
private static readonly int[] p = { 151,160,137,91,90,15, // Hash lookup table as defined by Ken Perlin. This is a randomly
131,13,201,95,96,53,194,233,7,225,140,36,103,30,69,142,8,99,37,240,21,10,23, // arranged array of all numbers from 0-255 inclusive.
190, 6,148,247,120,234,75,0,26,197,62,94,252,219,203,117,35,11,32,57,177,33,
88,237,149,56,87,174,20,125,136,171,168, 68,175,74,165,71,134,139,48,27,166,
77,146,158,231,83,111,229,122,60,211,133,230,220,105,92,41,55,46,245,40,244,
102,143,54, 65,25,63,161, 1,216,80,73,209,76,132,187,208, 89,18,169,200,196,
135,130,116,188,159,86,164,100,109,198,173,186, 3,64,52,217,226,250,124,123,
5,202,38,147,118,126,255,82,85,212,207,206,59,227,47,16,58,17,182,189,28,42,
223,183,170,213,119,248,152, 2,44,154,163, 70,221,153,101,155,167, 43,172,9,
129,22,39,253, 19,98,108,110,79,113,224,232,178,185, 112,104,218,246,97,228,
251,34,242,193,238,210,144,12,191,179,162,241, 81,51,145,235,249,14,239,107,
49,192,214, 31,181,199,106,157,184, 84,204,176,115,121,50,45,127, 4,150,254,
138,236,205,93,222,114,67,29,24,72,243,141,128,195,78,66,215,61,156,180
};
int aaa, aba, aab, abb, baa, bba, bab, bbb;
aaa = p[p[p[ xi ]+ yi ]+ zi ];
aba = p[p[p[ xi ]+inc(yi)]+ zi ];
aab = p[p[p[ xi ]+ yi ]+inc(zi)];
abb = p[p[p[ xi ]+inc(yi)]+inc(zi)];
baa = p[p[p[inc(xi)]+ yi ]+ zi ];
bba = p[p[p[inc(xi)]+inc(yi)]+ zi ];
bab = p[p[p[inc(xi)]+ yi ]+inc(zi)];
bbb = p[p[p[inc(xi)]+inc(yi)]+inc(zi)];
Why don't we just initialize the values as follows?
aaa = random(255)
aab = random(255)
// ...
The key idea behind Perlin noise generation is to create a grid of points, each of which is assigned some vector value, and then to interpolate between those points in a specific way.
I checked out Ken Perlin's original paper on Perlin noise and it seems like as far back as the original paper he recommends using a hash function to do this:
Associate with each point in the integer lattice a pseudorandom value and x, y, and z gradient values. More precisely, map each ordered sequence of three integers into an uncorrelated ordered sequence of four real numbers [a,b,c,d] = H([x,y,z]), where [a,b,e,d] define a linear equation with gradient [a,b,c] and value d at [x,y,z]. H is best implemented as a hash function.
(Emphasis mine).
I suspect that the reason for this has to do with memory concerns. Perlin noise generation requires that the gradient function at different points in space be reevaluated multiple times over the course of the run of the algorithm. Accordingly, you could either
have some formula that, given a point in space, evaluates to the gradient, or
explicitly create a table and store all of the random values that you need.
Option (1) is what Ken Perlin is proposing. The advantage of this approach is that the memory usage required to store the gradients is minimal; you just need to use a hash function.
Option (2) is what you're proposing. This works just fine, but it uses a ton of memory (you need multiple values stored for each point in the integer lattice you're working with). Remember that Perlin's paper was written back in 1985 (!) when memory was much, much scarcer than it is today.
My suspicion is that you can get away with either approach, but given that you don't need true randomness, the pseudorandomness afforded by a good hash function should be sufficient.
I can't explain why the author of that article you read chose to use the particular hash function that they did, though. My guess is that it's "random enough" and sufficiently fast that it doesn't end up being the bottleneck in the computation; remember that the hash function gets called a lot of times in the noise generation code. This seems to be the standard approach to implementing Perlin noise; even Ken Perlin mentions using this hash function on his site.
What you can't do is the approach you're proposing of just letting the variables aaa, aab, aba, etc. be random. The reason why is that the Perlin noise algorithm requires you to reevaluate the noise term at a given point multiple times and expects that it will give back the same values every time. If you wanted to compute truly random values, you could do so, but you'd need to cache your results so that you give back consistent answers of the noise terms at each point.

OCR'ed and real string similarity

The problem:
There is a set of word S = {W1,W2.. Wn} where n < 10. This set just exists, we do not know its content.
These words are drawn on some image and then recognized. The OCR algorytm is poor as well as dpi and as a result there are mistakes. So we have a second set of errorneous words S' = {W1',W2'..Wn'}
Now we have a word W that is a member of original set S. And now I need and algorythm which, given W and S', return index of the word in S'. most similar to W.
Example. S is {"alpha", "bravo", "charlie"}, S' is for example {"alPha","hravc","onarlio"} (these are real possible ocr erros).
So the target function should return F("alpha") => 0, F("bravo") => 1, F("charlie") => 2
I tried Levenshtein distance, but it does not work well, because it returns small numbers on small strings and OCRed string can be longer than original.
Example if W' is {'hornist','cornrnunist'} and the given word is 'communist' the Levenshtein distance is 4 for the both words, but the right one is second.
Any suggestions?
As a zero approach, I'd suggest you to use the modification of Levenshtein distance algorithm with conditional cost of replacing/deleting/adding characters:
Distance(i, j) = min(Distance(i-1, j-1) + replace_cost(a.charAt(i), b.charAt(j)),
Distance(i-1, j ) + insert_cost(b.charAt(j)),
Distance(i , j-1) + delete_cost(a.charAt(i)))
You can implement function replace_cost in such way, that it will returns small values for visually similar characters (and high values for visually different characters), e.g.:
// visually similar characters
replace_cost('o', '0') = 0.1
replace_cost('o', 'O') = 0.1
replace_cost('O', '0') = 0.1
...
// visually different characters
replace_cost('O', 'K') = 0.9
...
And the similar approach can be used for insert_cost and delete_cost (e.g. you may notice, that during the OCR - some characters are more likely to disappear than others).
Also, in case when approach from above is not enough for you, I'd suggest you to look at Noisy channel model - which is widely used for spelling correction (this subject described very well in Natural Language Processing course by Dan Jurafsky, Christopher Manning - "Week 2 - Spelling Correction").
This appears to be quite difficult to do because the misread strings are not necessarily textually similar to the input, which is why Levinshtein distance won't work for you. The words are visually corrupted, not simply mistyped. You could try creating a dataset of common errors (o => 0, l -> 1, e => o) and then do some sort of comparison based on that.
If you have access to the OCR algorithm, you could run that algorithm again on a much broader set of inputs (with known outputs) and train a neural network to recognize common errors. Then you could use that model to predict mistakes in your original dataset (maybe overkill for an array of only ten items).

A simple explanation of Naive Bayes Classification [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am finding it hard to understand the process of Naive Bayes, and I was wondering if someone could explain it with a simple step by step process in English. I understand it takes comparisons by times occurred as a probability, but I have no idea how the training data is related to the actual dataset.
Please give me an explanation of what role the training set plays. I am giving a very simple example for fruits here, like banana for example
training set---
round-red
round-orange
oblong-yellow
round-red
dataset----
round-red
round-orange
round-red
round-orange
oblong-yellow
round-red
round-orange
oblong-yellow
oblong-yellow
round-red
The accepted answer has many elements of k-NN (k-nearest neighbors), a different algorithm.
Both k-NN and NaiveBayes are classification algorithms. Conceptually, k-NN uses the idea of "nearness" to classify new entities. In k-NN 'nearness' is modeled with ideas such as Euclidean Distance or Cosine Distance. By contrast, in NaiveBayes, the concept of 'probability' is used to classify new entities.
Since the question is about Naive Bayes, here's how I'd describe the ideas and steps to someone. I'll try to do it with as few equations and in plain English as much as possible.
First, Conditional Probability & Bayes' Rule
Before someone can understand and appreciate the nuances of Naive Bayes', they need to know a couple of related concepts first, namely, the idea of Conditional Probability, and Bayes' Rule. (If you are familiar with these concepts, skip to the section titled Getting to Naive Bayes')
Conditional Probability in plain English: What is the probability that something will happen, given that something else has already happened.
Let's say that there is some Outcome O. And some Evidence E. From the way these probabilities are defined: The Probability of having both the Outcome O and Evidence E is:
(Probability of O occurring) multiplied by the (Prob of E given that O happened)
One Example to understand Conditional Probability:
Let say we have a collection of US Senators. Senators could be Democrats or Republicans. They are also either male or female.
If we select one senator completely randomly, what is the probability that this person is a female Democrat? Conditional Probability can help us answer that.
Probability of (Democrat and Female Senator)= Prob(Senator is Democrat) multiplied by Conditional Probability of Being Female given that they are a Democrat.
P(Democrat & Female) = P(Democrat) * P(Female | Democrat)
We could compute the exact same thing, the reverse way:
P(Democrat & Female) = P(Female) * P(Democrat | Female)
Understanding Bayes Rule
Conceptually, this is a way to go from P(Evidence| Known Outcome) to P(Outcome|Known Evidence). Often, we know how frequently some particular evidence is observed, given a known outcome. We have to use this known fact to compute the reverse, to compute the chance of that outcome happening, given the evidence.
P(Outcome given that we know some Evidence) = P(Evidence given that we know the Outcome) times Prob(Outcome), scaled by the P(Evidence)
The classic example to understand Bayes' Rule:
Probability of Disease D given Test-positive =
P(Test is positive|Disease) * P(Disease)
_______________________________________________________________
(scaled by) P(Testing Positive, with or without the disease)
Now, all this was just preamble, to get to Naive Bayes.
Getting to Naive Bayes'
So far, we have talked only about one piece of evidence. In reality, we have to predict an outcome given multiple evidence. In that case, the math gets very complicated. To get around that complication, one approach is to 'uncouple' multiple pieces of evidence, and to treat each of piece of evidence as independent. This approach is why this is called naive Bayes.
P(Outcome|Multiple Evidence) =
P(Evidence1|Outcome) * P(Evidence2|outcome) * ... * P(EvidenceN|outcome) * P(Outcome)
scaled by P(Multiple Evidence)
Many people choose to remember this as:
P(Likelihood of Evidence) * Prior prob of outcome
P(outcome|evidence) = _________________________________________________
P(Evidence)
Notice a few things about this equation:
If the Prob(evidence|outcome) is 1, then we are just multiplying by 1.
If the Prob(some particular evidence|outcome) is 0, then the whole prob. becomes 0. If you see contradicting evidence, we can rule out that outcome.
Since we divide everything by P(Evidence), we can even get away without calculating it.
The intuition behind multiplying by the prior is so that we give high probability to more common outcomes, and low probabilities to unlikely outcomes. These are also called base rates and they are a way to scale our predicted probabilities.
How to Apply NaiveBayes to Predict an Outcome?
Just run the formula above for each possible outcome. Since we are trying to classify, each outcome is called a class and it has a class label. Our job is to look at the evidence, to consider how likely it is to be this class or that class, and assign a label to each entity.
Again, we take a very simple approach: The class that has the highest probability is declared the "winner" and that class label gets assigned to that combination of evidences.
Fruit Example
Let's try it out on an example to increase our understanding: The OP asked for a 'fruit' identification example.
Let's say that we have data on 1000 pieces of fruit. They happen to be Banana, Orange or some Other Fruit.
We know 3 characteristics about each fruit:
Whether it is Long
Whether it is Sweet and
If its color is Yellow.
This is our 'training set.' We will use this to predict the type of any new fruit we encounter.
Type Long | Not Long || Sweet | Not Sweet || Yellow |Not Yellow|Total
___________________________________________________________________
Banana | 400 | 100 || 350 | 150 || 450 | 50 | 500
Orange | 0 | 300 || 150 | 150 || 300 | 0 | 300
Other Fruit | 100 | 100 || 150 | 50 || 50 | 150 | 200
____________________________________________________________________
Total | 500 | 500 || 650 | 350 || 800 | 200 | 1000
___________________________________________________________________
We can pre-compute a lot of things about our fruit collection.
The so-called "Prior" probabilities. (If we didn't know any of the fruit attributes, this would be our guess.) These are our base rates.
P(Banana) = 0.5 (500/1000)
P(Orange) = 0.3
P(Other Fruit) = 0.2
Probability of "Evidence"
p(Long) = 0.5
P(Sweet) = 0.65
P(Yellow) = 0.8
Probability of "Likelihood"
P(Long|Banana) = 0.8
P(Long|Orange) = 0 [Oranges are never long in all the fruit we have seen.]
....
P(Yellow|Other Fruit) = 50/200 = 0.25
P(Not Yellow|Other Fruit) = 0.75
Given a Fruit, how to classify it?
Let's say that we are given the properties of an unknown fruit, and asked to classify it. We are told that the fruit is Long, Sweet and Yellow. Is it a Banana? Is it an Orange? Or Is it some Other Fruit?
We can simply run the numbers for each of the 3 outcomes, one by one. Then we choose the highest probability and 'classify' our unknown fruit as belonging to the class that had the highest probability based on our prior evidence (our 1000 fruit training set):
P(Banana|Long, Sweet and Yellow)
P(Long|Banana) * P(Sweet|Banana) * P(Yellow|Banana) * P(banana)
= _______________________________________________________________
P(Long) * P(Sweet) * P(Yellow)
= 0.8 * 0.7 * 0.9 * 0.5 / P(evidence)
= 0.252 / P(evidence)
P(Orange|Long, Sweet and Yellow) = 0
P(Other Fruit|Long, Sweet and Yellow)
P(Long|Other fruit) * P(Sweet|Other fruit) * P(Yellow|Other fruit) * P(Other Fruit)
= ____________________________________________________________________________________
P(evidence)
= (100/200 * 150/200 * 50/200 * 200/1000) / P(evidence)
= 0.01875 / P(evidence)
By an overwhelming margin (0.252 >> 0.01875), we classify this Sweet/Long/Yellow fruit as likely to be a Banana.
Why is Bayes Classifier so popular?
Look at what it eventually comes down to. Just some counting and multiplication. We can pre-compute all these terms, and so classifying becomes easy, quick and efficient.
Let z = 1 / P(evidence). Now we quickly compute the following three quantities.
P(Banana|evidence) = z * Prob(Banana) * Prob(Evidence1|Banana) * Prob(Evidence2|Banana) ...
P(Orange|Evidence) = z * Prob(Orange) * Prob(Evidence1|Orange) * Prob(Evidence2|Orange) ...
P(Other|Evidence) = z * Prob(Other) * Prob(Evidence1|Other) * Prob(Evidence2|Other) ...
Assign the class label of whichever is the highest number, and you are done.
Despite the name, Naive Bayes turns out to be excellent in certain applications. Text classification is one area where it really shines.
Your question as I understand it is divided in two parts, part one being you need a better understanding of the Naive Bayes classifier & part two being the confusion surrounding Training set.
In general all of Machine Learning Algorithms need to be trained for supervised learning tasks like classification, prediction etc. or for unsupervised learning tasks like clustering.
During the training step, the algorithms are taught with a particular input dataset (training set) so that later on we may test them for unknown inputs (which they have never seen before) for which they may classify or predict etc (in case of supervised learning) based on their learning. This is what most of the Machine Learning techniques like Neural Networks, SVM, Bayesian etc. are based upon.
So in a general Machine Learning project basically you have to divide your input set to a Development Set (Training Set + Dev-Test Set) & a Test Set (or Evaluation set). Remember your basic objective would be that your system learns and classifies new inputs which they have never seen before in either Dev set or test set.
The test set typically has the same format as the training set. However, it is very important that the test set be distinct from the training corpus: if we simply
reused the training set as the test set, then a model that simply memorized its input, without learning how to generalize to new examples, would receive misleadingly high scores.
In general, for an example, 70% of our data can be used as training set cases. Also remember to partition the original set into the training and test sets randomly.
Now I come to your other question about Naive Bayes.
To demonstrate the concept of Naïve Bayes Classification, consider the example given below:
As indicated, the objects can be classified as either GREEN or RED. Our task is to classify new cases as they arrive, i.e., decide to which class label they belong, based on the currently existing objects.
Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case (which hasn't been observed yet) is twice as likely to have membership GREEN rather than RED. In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities are based on previous experience, in this case the percentage of GREEN and RED objects, and often used to predict outcomes before they actually happen.
Thus, we can write:
Prior Probability of GREEN: number of GREEN objects / total number of objects
Prior Probability of RED: number of RED objects / total number of objects
Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior probabilities for class membership are:
Prior Probability for GREEN: 40 / 60
Prior Probability for RED: 20 / 60
Having formulated our prior probability, we are now ready to classify a new object (WHITE circle in the diagram below). Since the objects are well clustered, it is reasonable to assume that the more GREEN (or RED) objects in the vicinity of X, the more likely that the new cases belong to that particular color. To measure this likelihood, we draw a circle around X which encompasses a number (to be chosen a priori) of points irrespective of their class labels. Then we calculate the number of points in the circle belonging to each class label. From this we calculate the likelihood:
From the illustration above, it is clear that Likelihood of X given GREEN is smaller than Likelihood of X given RED, since the circle encompasses 1 GREEN object and 3 RED ones. Thus:
Although the prior probabilities indicate that X may belong to GREEN (given that there are twice as many GREEN compared to RED) the likelihood indicates otherwise; that the class membership of X is RED (given that there are more RED objects in the vicinity of X than GREEN). In the Bayesian analysis, the final classification is produced by combining both sources of information, i.e., the prior and the likelihood, to form a posterior probability using the so-called Bayes' rule (named after Rev. Thomas Bayes 1702-1761).
Finally, we classify X as RED since its class membership achieves the largest posterior probability.
Naive Bayes comes under supervising machine learning which used to make classifications of data sets.
It is used to predict things based on its prior knowledge and independence assumptions.
They call it naive because it’s assumptions (it assumes that all of the features in the dataset are equally important and independent) are really optimistic and rarely true in most real-world applications.
It is classification algorithm which makes the decision for the unknown data set. It is based on Bayes Theorem which describe the probability of an event based on its prior knowledge.
Below diagram shows how naive Bayes works
Formula to predict NB:
How to use Naive Bayes Algorithm ?
Let's take an example of how N.B woks
Step 1: First we find out Likelihood of table which shows the probability of yes or no in below diagram.
Step 2: Find the posterior probability of each class.
Problem: Find out the possibility of whether the player plays in Rainy condition?
P(Yes|Rainy) = P(Rainy|Yes) * P(Yes) / P(Rainy)
P(Rainy|Yes) = 2/9 = 0.222
P(Yes) = 9/14 = 0.64
P(Rainy) = 5/14 = 0.36
Now, P(Yes|Rainy) = 0.222*0.64/0.36 = 0.39 which is lower probability which means chances of the match played is low.
For more reference refer these blog.
Refer GitHub Repository Naive-Bayes-Examples
Ram Narasimhan explained the concept very nicely here below is an alternative explanation through the code example of Naive Bayes in action
It uses an example problem from this book on page 351
This is the data set that we will be using
In the above dataset if we give the hypothesis = {"Age":'<=30', "Income":"medium", "Student":'yes' , "Creadit_Rating":'fair'} then what is the probability that he will buy or will not buy a computer.
The code below exactly answers that question.
Just create a file called named new_dataset.csv and paste the following content.
Age,Income,Student,Creadit_Rating,Buys_Computer
<=30,high,no,fair,no
<=30,high,no,excellent,no
31-40,high,no,fair,yes
>40,medium,no,fair,yes
>40,low,yes,fair,yes
>40,low,yes,excellent,no
31-40,low,yes,excellent,yes
<=30,medium,no,fair,no
<=30,low,yes,fair,yes
>40,medium,yes,fair,yes
<=30,medium,yes,excellent,yes
31-40,medium,no,excellent,yes
31-40,high,yes,fair,yes
>40,medium,no,excellent,no
Here is the code the comments explains everything we are doing here! [python]
import pandas as pd
import pprint
class Classifier():
data = None
class_attr = None
priori = {}
cp = {}
hypothesis = None
def __init__(self,filename=None, class_attr=None ):
self.data = pd.read_csv(filename, sep=',', header =(0))
self.class_attr = class_attr
'''
probability(class) = How many times it appears in cloumn
__________________________________________
count of all class attribute
'''
def calculate_priori(self):
class_values = list(set(self.data[self.class_attr]))
class_data = list(self.data[self.class_attr])
for i in class_values:
self.priori[i] = class_data.count(i)/float(len(class_data))
print "Priori Values: ", self.priori
'''
Here we calculate the individual probabilites
P(outcome|evidence) = P(Likelihood of Evidence) x Prior prob of outcome
___________________________________________
P(Evidence)
'''
def get_cp(self, attr, attr_type, class_value):
data_attr = list(self.data[attr])
class_data = list(self.data[self.class_attr])
total =1
for i in range(0, len(data_attr)):
if class_data[i] == class_value and data_attr[i] == attr_type:
total+=1
return total/float(class_data.count(class_value))
'''
Here we calculate Likelihood of Evidence and multiple all individual probabilities with priori
(Outcome|Multiple Evidence) = P(Evidence1|Outcome) x P(Evidence2|outcome) x ... x P(EvidenceN|outcome) x P(Outcome)
scaled by P(Multiple Evidence)
'''
def calculate_conditional_probabilities(self, hypothesis):
for i in self.priori:
self.cp[i] = {}
for j in hypothesis:
self.cp[i].update({ hypothesis[j]: self.get_cp(j, hypothesis[j], i)})
print "\nCalculated Conditional Probabilities: \n"
pprint.pprint(self.cp)
def classify(self):
print "Result: "
for i in self.cp:
print i, " ==> ", reduce(lambda x, y: x*y, self.cp[i].values())*self.priori[i]
if __name__ == "__main__":
c = Classifier(filename="new_dataset.csv", class_attr="Buys_Computer" )
c.calculate_priori()
c.hypothesis = {"Age":'<=30', "Income":"medium", "Student":'yes' , "Creadit_Rating":'fair'}
c.calculate_conditional_probabilities(c.hypothesis)
c.classify()
output:
Priori Values: {'yes': 0.6428571428571429, 'no': 0.35714285714285715}
Calculated Conditional Probabilities:
{
'no': {
'<=30': 0.8,
'fair': 0.6,
'medium': 0.6,
'yes': 0.4
},
'yes': {
'<=30': 0.3333333333333333,
'fair': 0.7777777777777778,
'medium': 0.5555555555555556,
'yes': 0.7777777777777778
}
}
Result:
yes ==> 0.0720164609053
no ==> 0.0411428571429
I try to explain the Bayes rule with an example.
What is the chance that a random person selected from the society is a smoker?
You may reply 10%, and let's assume that's right.
Now, what if I say that the random person is a man and is 15 years old?
You may say 15 or 20%, but why?.
In fact, we try to update our initial guess with new pieces of evidence ( P(smoker) vs. P(smoker | evidence) ). The Bayes rule is a way to relate these two probabilities.
P(smoker | evidence) = P(smoker)* p(evidence | smoker)/P(evidence)
Each evidence may increase or decrease this chance. For example, this fact that he is a man may increase the chance provided that this percentage (being a man) among non-smokers is lower.
In the other words, being a man must be an indicator of being a smoker rather than a non-smoker. Therefore, if an evidence is an indicator of something, it increases the chance.
But how do we know that this is an indicator?
For each feature, you can compare the commonness (probability) of that feature under the given conditions with its commonness alone. (P(f | x) vs. P(f)).
P(smoker | evidence) / P(smoker) = P(evidence | smoker)/P(evidence)
For example, if we know that 90% of smokers are men, it's not still enough to say whether being a man is an indicator of being smoker or not. For example if the probability of being a man in the society is also 90%, then knowing that someone is a man doesn't help us ((90% / 90%) = 1. But if men contribute to 40% of the society, but 90% of the smokers, then knowing that someone is a man increases the chance of being a smoker (90% / 40%) = 2.25, so it increases the initial guess (10%) by 2.25 resulting 22.5%.
However, if the probability of being a man was 95% in the society, then regardless of the fact that the percentage of men among smokers is high (90%)! the evidence that someone is a man decreases the chance of him being a smoker! (90% / 95%) = 0.95).
So we have:
P(smoker | f1, f2, f3,... ) = P(smoker) * contribution of f1* contribution of f2 *...
=
P(smoker)*
(P(being a man | smoker)/P(being a man))*
(P(under 20 | smoker)/ P(under 20))
Note that in this formula we assumed that being a man and being under 20 are independent features so we multiplied them, it means that knowing that someone is under 20 has no effect on guessing that he is man or woman. But it may not be true, for example maybe most adolescence in a society are men...
To use this formula in a classifier
The classifier is given with some features (being a man and being under 20) and it must decide if he is an smoker or not (these are two classes). It uses the above formula to calculate the probability of each class under the evidence (features), and it assigns the class with the highest probability to the input. To provide the required probabilities (90%, 10%, 80%...) it uses the training set. For example, it counts the people in the training set that are smokers and find they contribute 10% of the sample. Then for smokers checks how many of them are men or women .... how many are above 20 or under 20....In the other words, it tries to build the probability distribution of the features for each class based on the training data.

Resources