What does a Bayesian Classifier score represent? - ruby

I'm using the ruby classifier gem whose classifications method returns the scores for a given string classified against the trained model.
Is the score a percentage? If so, is the maximum difference 100 points?

It's the logarithm of a probability. With a large trained set, the actual probabilities are very small numbers, so the logarithms are easier to compare. Theoretically, scores will range from infinitesimally close to zero down to negative infinity. 10**score * 100.0 will give you the actual probability, which indeed has a maximum difference of 100.

Actually to calculate the probability of a typical naive bayes classifier where b is the base, it is b^score/(1+b^score). This is the inverse logit (http://en.wikipedia.org/wiki/Logit) However, given the independence assumptions of the NBC, these scores tend to be too high or too low and probabilities calculated this way will accumulate at the boundaries. It is better to calculate the scores in a holdout set and do a logistic regression of accurate(1 or 0) on score to get a better feel for the relationship between score and probability.
From a Jason Rennie paper:
2.7 Naive Bayes Outputs Are Often Overcondent
Text databases frequently have
10,000 to 100,000 distinct vocabulary words; documents often contain 100 or more
terms. Hence, there is great opportunity for duplication.
To get a sense of how much duplication there is, we trained a MAP Naive Bayes
model with 80% of the 20 Newsgroups documents. We produced p(cjd;D) (posterior)
values on the remaining 20% of the data and show statistics on maxc p(cjd;D) in
table 2.3. The values are highly overcondent. 60% of the test documents are assigned
a posterior of 1 when rounded to 9 decimal digits. Unlike logistic regression, Naive
Bayes is not optimized to produce reasonable probability values. Logistic regression
performs joint optimization of the linear coecients, converging to the appropriate
probability values with sucient training data. Naive Bayes optimizes the coecients
one-by-one. It produces realistic outputs only when the independence assumption
holds true. When the features include signicant duplicate information (as is usually
the case with text), the posteriors provided by Naive Bayes are highly overcondent.

Related

How to form precision-recall curve using one test dataset for my algorithm?

I'm working on knowledge graph, more precisely in natural language processing field. To evaluate the components of my algorithm, it is necessary to be able to classify the good and the poor candidates. For this purpose, we manually classified pairs in a dataset.
My system returns the relevant pairs according to the implementation logic. now I'm able to calculate :
Precision = X
Recall = Y
For establishing a complete curve I need the rest of points (X,Y), what should I do?:
build another dataset for test ?
split my dataset ?
or any other solution ?
Neither of your proposed two methods. In short, Precision-recall or ROC curve is designed for classifiers with probabilistic output. That is, instead of simply producing a 0 or 1 (in case of binary classification), you need classifiers that can provide a probability in [0,1] range. This is the function to do it in sklearn, note how the 2nd parameter is called probas_pred.
To turn this probabilities into concrete class prediction, you can then set a threshold, say at .5. Setting such a threshold is problematic however, since you can trade-off precision/recall by varying the threshold, and an arbitrary choice can give false impression of a classifier's performance. To circumvent this, threshold-independent measures like area under ROC or Precision-Recall curve is used. They create thresholds at different intervals, say 0.1,0.2,0.3...0.9, turn probabilities into binary classes and then compute precision-recall for each such threshold.

Number of neighbours in KNN random subspace classifier

I built a classifier model using KNN as learners for an ensemble based on the random subspace method.
I have three predictors, whose dimension is 541 samples, and I develop an optimization procedure to find the best k (number of neighbours).
I chose the k that maximize the AUC of the classifier, whose performance is computed with 10-fold-cross-validation.
The result for the best k was 269 for each single weak learners (that are 60 as a result of a similar optimization).
Now, my question is:
Are 269 neighbours too many? I trust the results of the optimization, but I have never used so many neighbours and I am worried about overfitting.
Thank you in advance,
MP
The choice of k-value in k-NN is rather data dependent. We can argue about more general characteristics of smaller or bigger choices of k-values, but specifying a certain number as good/bad is not very accurate to tell. Because of this, if your CV implementation is correct, you can trust the results and move further with it because the CV will give the optimal for your specific case. For more of a general discussion, we can say these about the choice of the k-value:
1- Smaller choice of k-value : Small choice of k-values might increase the overall accuracy and are less costly to implement, but will make the system less robust to noisy input.
2- Bigger choice of k-value : Bigger choice of k-values will make the system more robust against noisy input, but will be more costly to execute and have weaker decision boundaries compared to smaller k-values.
You can always compare these general characteristics while choosing the k-value in your application. However, for choosing the optimal values using an algorithm like CV will give you a definite answer.

What is a bad, decent, good, and excellent F1-measure range?

I understand F1-measure is a harmonic mean of precision and recall. But what values define how good/bad a F1-measure is? I can't seem to find any references (google or academic) answering my question.
Consider sklearn.dummy.DummyClassifier(strategy='uniform') which is a classifier that make random guesses (a.k.a bad classifier). We can view DummyClassifier as a benchmark to beat, now let's see it's f1-score.
In a binary classification problem, with balanced dataset: 6198 total sample, 3099 samples labelled as 0 and 3099 samples labelled as 1, f1-score is 0.5 for both classes, and weighted average is 0.5:
Second example, using DummyClassifier(strategy='constant'), i.e. guessing the same label every time, guessing label 1 every time in this case, average of f1-scores is 0.33, while f1 for label 0 is 0.00:
I consider these to be bad f1-scores, given the balanced dataset.
PS. summary generated using sklearn.metrics.classification_report
You did not find any reference for f1 measure range because there is not any range. The F1 measure is a combined matrix of precision and recall.
Let's say you have two algorithms, one has higher precision and lower recall. By this observation , you can not tell that which algorithm is better, unless until your goal is to maximize precision.
So, given this ambiguity about how to select superior algorithm among two (one with higher recall and other with higher precision), we use f1-measure to select superior among them.
f1-measure is a relative term that's why there is no absolute range to define how better your algorithm is.

Algorithm for finding the optimal weights for prediction

Consider I have the following weights and quantitative parameters: w_1..w_n, p_1..p_n. 0 <= w <= 1. I also have a selection of cases of parameters and associated values.
What algorithms exist for finding the optimal weights to minimize the errors of predicting the value given the parameters? And what algorithms have typically achieved the best results?
I try to predict the quality of an apple based on the parameters p_1=transport _time, p_2=days_since_picking. The quality is measured using a subjective likert scale.
Fifty people have rated apples with scores from 1 to 5 and I know p_1 and p_2 for all those apples. How do I predict and find the weights for p_1 and p_2 that minimize the total errors in the cases?
I agree with the comment that you should run a web search on "linear regression". At least three other sources for lists of algorithms come to mind:
NLopt: http://ab-initio.mit.edu/wiki/index.php/NLopt_Algorithms (and my C# wrapper for it: https://github.com/BrannonKing/NLoptNet)
S. Boyd's book: http://stanford.edu/~boyd/cvxbook/
You could probably use a supervised AI algorithm. Neural networks are typically made up of "weights": https://en.wikipedia.org/wiki/Supervised_learning
You could also use a genetic algorithm in conjunction with gray code weight encoding.

"Covering" the space of all possible histogram shapes

There is a very expensive computation I must make frequently.
The computation takes a small array of numbers (with about 20 entries) that sums to 1 (i.e. the histogram) and outputs something that I can store pretty easily.
I have 2 things going for me:
I can accept approximate answers
The "answers" change slowly. For example: [.1 .1 .8 0] and [.1
.1 .75 .05] will yield similar results.
Consequently, I want to build a look-up table of answers off-line. Then, when the system is running, I can look-up an approximate answer based on the "shape" of the input histogram.
To be precise, I plan to look-up the precomputed answer that corresponds to the histogram with the minimum Earth-Mover-Distance to the actual input histogram.
I can only afford to store about 80 to 100 precomputed (histogram , computation result) pairs in my look up table.
So, how do I "spread out" my precomputed histograms so that, no matter what the input histogram is, I'll always have a precomputed result that is "close"?
Finding N points in M-space that are a best spread-out set is more-or-less equivalent to hypersphere packing (1,2) and in general answers are not known for M>10. While a fair amount of research has been done to develop faster methods for hypersphere packings or approximations, it is still regarded as a hard problem.
It probably would be better to apply a technique like principal component analysis or factor analysis to as large a set of histograms as you can conveniently generate. The results of either analysis will be a set of M numbers such that linear combinations of histogram data elements weighted by those numbers will predict some objective function. That function could be the “something that you can store pretty easily” numbers, or could be case numbers. Also consider developing and training a neural net or using other predictive modeling techniques to predict the objective function.
Building on #jwpat7's answer, I would apply k-means clustering to a huge set of randomly generated (and hopefully representative) histograms. This would ensure that your space was spanned with whatever number of exemplars (precomputed results) you can support, with roughly equal weighting for each cluster.
The trick, of course, will be generating representative data to cluster in the first place. If you can recompute from time to time, you can recluster based on the actual data in the system so that your clusters might get better over time.
I second jwpat7's answer, but my very naive approach was to consider the count of items in each histogram bin as a y value, to consider the x values as just 0..1 in 20 steps, and then to obtain parameters a,b,c that describe x vs y as a cubic function.
To get a "covering" of the histograms I just iterated through "possible" values for each parameter.
e.g. to get 27 histograms to cover the "shape space" of my cubic histogram model I iterated the parameters through -1 .. 1, choosing 3 values linearly spaced.
Now, you could change the histogram model to be quartic if you think your data will often be represented that way, or whatever model you think is most descriptive, as well as generate however many histograms to cover. I used 27 because three partitions per parameter for three parameters is 3*3*3=27.
For a more comprehensive covering, like 100, you would have to more carefully choose your ranges for each parameter. 100**.3 isn't an integer, so the simple num_covers**(1/num_params) solution wouldn't work, but for 3 parameters 4*5*5 would.
Since the actual values of the parameters could vary greatly and still achieve the same shape it would probably be best to store ratios of them for comparison instead, e.g. for my 3 parmeters b/a and b/c.
Here is an 81 histogram "covering" using a quartic model, again with parameters chosen from linspace(-1,1,3):
edit: Since you said your histograms were described by arrays that were ~20 elements, I figured fitting parameters would be very fast.
edit2 on second thought I think using a constant in the model is pointless, all that matters is the shape.

Resources