If we have finished collecting sentiment data, then we want to label sentiment with fine-grained analysis and will use five sentiment labels, namely very negative, negative, neutral, positive, and very positive. How to distinguish very negative and negative sentiment for labeling if we have not star or rating feature? Then in positive and very positive sentiment, how to distinguish it for labeling?
Related
I would like to know whether precision recall curve is relevant for clustering algorithms. For example by using unsupervised learning techniques such as Mean shift or DBSCAN.(Or is it relevant only for classification algorithms). If yes how to get the plot points for low recall values? Is it allowed to change the model parameters to get low recall rates for a model?
PR curves (and ROC curves) require a ranking.
E.g. a classificator score that can be used to rank objects by how likely they belong to class A, or not.
In clustering, you usually do not have such a ranking.
Without a ranking, you don't get a curve. Also, what is precision and recall in clustering? Use ARI and NMI for evaluation.
But there are unsupervised methods such as outlier detection where, e.g., the ROC curve is a fairly common evaluation method. The PR curve is more problematic, because at 0 it is not defined, and ton shouldn't linearly interpolate. Thus, the popular "area under curve" is not well defined for PR curves. Since there are a dozen of other measures, I'd avoid PR-AUC because of this.
I am making use of the ELKI library to perform some distance measure between features.
Among other features, I am planing to implement Tamura features. From the research that I have done, this algorithm return a vector that represents three 'unrelated' features. (1st element: coarseness, 2nd element: contrast, 3rd-18th element: directional). Shall the distance between two tamura feature vectors be measured as a whole OR is it better for the distance between these three features to be measured independently (possible with different distance functions)?
Besides I read that Chisqaure and Quadratic-form distance are good algorithms to measure distance between histograms since they utilizes information across bins to retrieve more perceptually desirable results. However, I am still not sure whether such algorithms are adequate to measure the directionality histogram part of the Tamura feature. Can someone suggest a good distance function for such situation?
Thanks!
I want to evaluate my KNN where K = 1 classifier against Support Vector Machine Classifiers etc but I'm not sure if the way I am computing the ROC plot is correct. The classifier is constructed for a two class problem (positive and negative class).
If I understand correctly, to compute the ROC for a KNN for K=20, to get the first point on the plot we would get the true positive and false positive values for the tests samples where 1 or more of the 20 nearest neighbors are of the positive class. To get the second point we evaluate the true positive and false positive values for the test samples where 2 or more of the 20 nearest neighbors are of the positive class. This is repeated until the threshold reaches 20 out of 20 nearest neighbors.
For the case where K=1, does the ROC curve simply only have 1 point on the plot? Is there a better way to compute the ROC for the 1NN case? How can we fairly evaluate the performance for the 1NN classifier to a SVM classifier? Can we only compare the performance of the classifiers only at the single false positive value of the 1NN classifier?
Is SIFT a matching approach to replace ZNCC and NCC
or SIFT just provides input to NCC, in other words SIFT is proposed to be used as an alternative to Harris corner detection algorithm?
SIFT is actually a detection, description, and matching pipeline which is proposed by David Lowe. The reason for its popularity is that it works quite well out of the box.
The detection step of SIFT (which points in the image are interesting), comparable to the Harris corner detector that you mentioned, consists of a Difference of Gaussians detector. This detector is a center surround filter and is applied to a scale space pyramid (also applied in things like pyramidal LK tracking) to detect a maximal scale space response.
The description step (what distinguishes this region) then builds histograms of gradients in rectangular bins with several scales centered around the maximal response scale. This is meant as more descriptive and robust to illumination changes etc. than things like raw pixel values, color histograms, etc. There is also a normalization of dominant orientation to get in-plane rotational invariance.
The matching step (for a given descriptor/patch, which out of a pile of descriptors/patches is closest) for SIFT consist of a nearest distance ratio metric which tests for the ratio of distances between the closest match and second closest match. The idea is that if the ratio is low, then the first is much better than the second, thus you should make the match. Else, first and second is about equal and you should reject the match as noise, etc. can easily generate a false match in this scenario. This works better than Euclidean distance in practice. Though for large databases, you'll need vector quantization etc. to keep this working accurately and efficiently.
Overall, I'd argue that the SIFT descriptor/match is a much better/robust approach than NCC/ZNCC though you do pay for it in computational load.
Given a linearly separable dataset, is it necessarily better to use a a hard margin SVM over a soft-margin SVM?
I would expect soft-margin SVM to be better even when training dataset is linearly separable. The reason is that in a hard-margin SVM, a single outlier can determine the boundary, which makes the classifier overly sensitive to noise in the data.
In the diagram below, a single red outlier essentially determines the boundary, which is the hallmark of overfitting
To get a sense of what soft-margin SVM is doing, it's better to look at it in the dual formulation, where you can see that it has the same margin-maximizing objective (margin could be negative) as the hard-margin SVM, but with an additional constraint that each lagrange multiplier associated with support vector is bounded by C. Essentially this bounds the influence of any single point on the decision boundary, for derivation, see Proposition 6.12 in Cristianini/Shaw-Taylor's "An Introduction to Support Vector Machines and Other Kernel-based Learning Methods".
The result is that soft-margin SVM could choose decision boundary that has non-zero training error even if dataset is linearly separable, and is less likely to overfit.
Here's an example using libSVM on a synthetic problem. Circled points show support vectors. You can see that decreasing C causes classifier to sacrifice linear separability in order to gain stability, in a sense that influence of any single datapoint is now bounded by C.
Meaning of support vectors:
For hard margin SVM, support vectors are the points which are "on the margin". In the picture above, C=1000 is pretty close to hard-margin SVM, and you can see the circled points are the ones that will touch the margin (margin is almost 0 in that picture, so it's essentially the same as the separating hyperplane)
For soft-margin SVM, it's easer to explain them in terms of dual variables. Your support vector predictor in terms of dual variables is the following function.
Here, alphas and b are parameters that are found during training procedure, xi's, yi's are your training set and x is the new datapoint. Support vectors are datapoints from training set which are are included in the predictor, ie, the ones with non-zero alpha parameter.
In my opinion, Hard Margin SVM overfits to a particular dataset and thus can not generalize. Even in a linearly separable dataset (as shown in the above diagram), outliers well within the boundaries can influence the margin. Soft Margin SVM has more versatility because we have control over choosing the support vectors by tweaking the C.