How do I calculate the accuracy? - algorithm

I am in the final stage of SVM implementation in Java.
I have 10 datasets and among that 3 points missclassified.
So what will be the equation to find accuracy?

Accuracy is (TP + TN) / #samples, where:
TP are the true positives (actual value is +1, and it is classified as +1)
TN are the true negatives (actual value is -1, and it is classified as -1)
In classification tasks, beyond accuracy, many other measures are used to express the performance of a classifier, such as precision, recall, ROC, area under the ROC and F1 score.
You can find further information and equations here: http://en.wikipedia.org/wiki/Receiver_operating_characteristic and here: http://en.wikipedia.org/wiki/Accuracy_and_precision

The accuracy of the algorithm is: #samples_classified_correctly / #samples.
If you have 10 samples, 7 out of them are correctly classified your accuracy is 0.7.
However, note that 10 samples is not statistically enough to estimate expected accuracy for samples that you don't know their classification (in the "real world").

Related

Large Residual-Online Outlier Detection for Kalman Filter

I am trying to find outliers in Residual. I used three algorithms basically if the residuals magnitudes are less, the algorithm performances are good but if the residuals magnitude are big, the algorithm performances are not good.
1) 𝑿^𝟐=〖(𝒚−𝒉(𝒙))〗^𝑻 𝑺^(−𝟏) (𝒚−𝒉(𝒙)) - Chi-Square Test
if the matrix 3x3 - degree of freedom is 4.
𝑿^𝟐 > 13.277
2) Residual(i) > 3√(HP 𝐻^𝑇 + R) - Measurement Covariance Noise
3) Residual(i) > 3-Sigma
I have applied three algorithms to find the outliers. First one is Chi Square Test, second checks Measurement Covariance Noise, Third looks the 3 sigma.
Can you give any suggestion about the algorithms or I can implement a new way if you suggest?
The third case cannot be correct for all case because if there is a large residual, will fail. The second one is more stable because it is related to measurement noise covariance so that your residual should change according to the measurement covariance error.

What is a bad, decent, good, and excellent F1-measure range?

I understand F1-measure is a harmonic mean of precision and recall. But what values define how good/bad a F1-measure is? I can't seem to find any references (google or academic) answering my question.
Consider sklearn.dummy.DummyClassifier(strategy='uniform') which is a classifier that make random guesses (a.k.a bad classifier). We can view DummyClassifier as a benchmark to beat, now let's see it's f1-score.
In a binary classification problem, with balanced dataset: 6198 total sample, 3099 samples labelled as 0 and 3099 samples labelled as 1, f1-score is 0.5 for both classes, and weighted average is 0.5:
Second example, using DummyClassifier(strategy='constant'), i.e. guessing the same label every time, guessing label 1 every time in this case, average of f1-scores is 0.33, while f1 for label 0 is 0.00:
I consider these to be bad f1-scores, given the balanced dataset.
PS. summary generated using sklearn.metrics.classification_report
You did not find any reference for f1 measure range because there is not any range. The F1 measure is a combined matrix of precision and recall.
Let's say you have two algorithms, one has higher precision and lower recall. By this observation , you can not tell that which algorithm is better, unless until your goal is to maximize precision.
So, given this ambiguity about how to select superior algorithm among two (one with higher recall and other with higher precision), we use f1-measure to select superior among them.
f1-measure is a relative term that's why there is no absolute range to define how better your algorithm is.

How to optimize finding similarities?

I have a set of 30 000 documents represented by vectors of floats. All vectors have 100 elements. I can find similarity of two documents by comparing them using cosine measure between their vectors. The problem is that it takes to much time to find the most similar documents. Is there any algorithm which can help me with speeding up this?
EDIT
Now, my code just counts cosine similarity between first and all others vectors. It takes about 3 sec. I would like to speed it up ;) algorithm doesn't have to be accurate but should give similar results to full search.
Sum of elements of each vector is equal 1.
start = time.time()
first = allVectors[0]
for vec in allVectors[1:]:
cosine_measure(vec[1:], first[1:])
print str(time.time() - start)
Would locality sensitive hashing (LHS) help?
In case of LHS, the hashing function maps similar items near each other with a probability of your choice. It is claimed to be especially well-suited for high-dimensional similarity search / nearest neighbor search / near duplicate detection and it looks like to me that's exactly what you are trying to achieve.
See also How to understand Locality Sensitive Hashing?
There is a paper How to Approximate the Inner-product: Fast Dynamic Algorithms for Euclidean Similarity describing how to perform a fast approximation of the inner product. If this is not good or fast enough, I suggest to build an index containing all your documents. A structure similar to a quadtree but based on a geodesic grid would probably work really well, see Indexing the Sphere with the Hierarchical Triangular Mesh.
UPDATE: I completely forgot that you are dealing with 100 dimensions. Indexing high dimensional data is notoriously hard and I am not sure how well indexing a sphere will generalize to 100 dimensions.
If your vectors are normalized, the cosine is related to the Euclidean distance: ||a - b||² = (a - b)² = ||a||² + ||b||² - 2 ||a|| ||b|| cos(t) = 1 + 1 - 2 cos(t). So you can recast your problem in terms of Euclidean nearest neighbors.
A nice approach if that of the kD trees, a spatial data structure that generalizes the binary search (http://en.wikipedia.org/wiki/K-d_tree). Anyway, kD trees are known to be inefficient in high dimensions (your case), so that the so-called best-bin-first-search is preferred (http://en.wikipedia.org/wiki/Best-bin-first_search).

Reporting/visualizing scalability results from MPI code... which would be the best way?

As part of my research, I have computed the parallel solution to different banded systems using ScaLAPACK. I am interested in reporting the achieved speedup as a function of both the rank for the matrix, r, and its bandwidth, b.
How would this be better achieved?
Here's my selected universes for both values:
r in {10,000 25,000 50,000 75,000 100,000 500,000 1,000,000 5,000,000 10,000,000}
b in {2 4 8 16 32 64 128 256 512 1024}
The cluster I am using has 64 cores total, so p is in {1, ..., 64}.
I have computed both the speedup and the efficiency, s and e, as a function of p, r and b.
My goal is to somehow show how the speedup is performing based on r and b. I was thinking of creating some kind of surface projection of the (r,b)-space. But how can I resume the behavior of the speedup in one value?
A suggestion I had was to compute the Pearson correlation coefficient using both the attained and ideal (linear) speedup, however, this does NOT seem to work, since it does not take into account the existence of "speedup sweet-spots" that arise for smaller values of r.
Any hint?
Thanks in advance!
After having had some time to think about this, I have decided to report, the best achieved speedup multiplied by the Pearson linear correlation coefficient.
Such a plot looks as follows:
The best achieved speedup per instance of (r,b) is weighted by how "close to linear" it is, information contained on the Pearson linear correlation coefficient. Since the former is a value defined in [-1,1], then, for speedups far from linear, we will have a 0, while negative values will show slowdown, when this is expected. In the attached plot, we can see that the parallel solver, will indeed shod proper scalability for small values of the bandwidth, and it will get worse as this value gets increased.
If you guys have any hint, or any corrections, please let me know ;)

What does a Bayesian Classifier score represent?

I'm using the ruby classifier gem whose classifications method returns the scores for a given string classified against the trained model.
Is the score a percentage? If so, is the maximum difference 100 points?
It's the logarithm of a probability. With a large trained set, the actual probabilities are very small numbers, so the logarithms are easier to compare. Theoretically, scores will range from infinitesimally close to zero down to negative infinity. 10**score * 100.0 will give you the actual probability, which indeed has a maximum difference of 100.
Actually to calculate the probability of a typical naive bayes classifier where b is the base, it is b^score/(1+b^score). This is the inverse logit (http://en.wikipedia.org/wiki/Logit) However, given the independence assumptions of the NBC, these scores tend to be too high or too low and probabilities calculated this way will accumulate at the boundaries. It is better to calculate the scores in a holdout set and do a logistic regression of accurate(1 or 0) on score to get a better feel for the relationship between score and probability.
From a Jason Rennie paper:
2.7 Naive Bayes Outputs Are Often Overcondent
Text databases frequently have
10,000 to 100,000 distinct vocabulary words; documents often contain 100 or more
terms. Hence, there is great opportunity for duplication.
To get a sense of how much duplication there is, we trained a MAP Naive Bayes
model with 80% of the 20 Newsgroups documents. We produced p(cjd;D) (posterior)
values on the remaining 20% of the data and show statistics on maxc p(cjd;D) in
table 2.3. The values are highly overcondent. 60% of the test documents are assigned
a posterior of 1 when rounded to 9 decimal digits. Unlike logistic regression, Naive
Bayes is not optimized to produce reasonable probability values. Logistic regression
performs joint optimization of the linear coecients, converging to the appropriate
probability values with sucient training data. Naive Bayes optimizes the coecients
one-by-one. It produces realistic outputs only when the independence assumption
holds true. When the features include signicant duplicate information (as is usually
the case with text), the posteriors provided by Naive Bayes are highly overcondent.

Resources