I would like to know whether precision recall curve is relevant for clustering algorithms. For example by using unsupervised learning techniques such as Mean shift or DBSCAN.(Or is it relevant only for classification algorithms). If yes how to get the plot points for low recall values? Is it allowed to change the model parameters to get low recall rates for a model?
PR curves (and ROC curves) require a ranking.
E.g. a classificator score that can be used to rank objects by how likely they belong to class A, or not.
In clustering, you usually do not have such a ranking.
Without a ranking, you don't get a curve. Also, what is precision and recall in clustering? Use ARI and NMI for evaluation.
But there are unsupervised methods such as outlier detection where, e.g., the ROC curve is a fairly common evaluation method. The PR curve is more problematic, because at 0 it is not defined, and ton shouldn't linearly interpolate. Thus, the popular "area under curve" is not well defined for PR curves. Since there are a dozen of other measures, I'd avoid PR-AUC because of this.
Related
Say we are making a program to render the plot of a function (black box) provided by the user as a sequence of line segments. We want to get the minimum number of samples of the function so the resulting image "looks" like the function (the exact meaning of "looks" here is part of the question). A naive approach might be to just sample at fixed intervals but we can probably do better than that eg by sampling the "curvy bits" more than the "linear bits". Are there systematic approaches/research on this problem?
This reference can be helpful which is using the combined sampling method. Before that its related works explain more about other methods of sampling:
There are several strategies for plotting the function y = f(x) on interval Ω = [a, b]. The
naive approach based on sampling of f in a fixed amount of the equally spaced points is
described in [20]. The simple functions suffer from oversampling, while the oscillating curves
are under-sampled; these issues are mentioned in [14]. Another approach based on the interval
constraint plot constructing a hull of the curve was described in [6], [13], [20]. The automated
detection of a useful domain and a range of the function is mentioned in [41]; the generalized
interval arithmetic approach is described in [40].
A significant refinement is represented by adaptive sampling providing a higher sampling
density in the higher-curvature regions. The are several algorithms for the curve interpolation preserving the speed, for example: [37], [42], [43]. The adaptive feed rate technique
is described in [44]. An early implementation in the Mathematica software is presented in
[39]. By reducing data, these methods are very efficient for the curve plotting. The polygonal approximation of the parametric curve based on adaptive sampling is mentioned in the
several papers. The refinement criteria, as well as the recursive approach, are discussed in
[15]. An approximation by the polygonal curves is described in [7], the robust method for
the geometric and spatial approximation of the implicit curves can be found in [27], [10], the
affine arithmetic working in the triangulated models in [32]. However, the map projections
are never defined by the implicit equations. Similar approaches can be used for graph drawing
[21].
Other techniques based on the approximation by the breakpoints can be found in many
papers: [33], [9], [3]; these approaches are used for the polygonal approximation of the closed
curves and applied in computer vision.
Hence, these are the reference methods that define some measures for a "good" plot and introduce an approach to optimize the plot base on the measure:
constructing a hull of the curve
automated detection of a useful domain and a range of the function
adaptive sampling: providing a higher sampling density in the higher-curvature regions
providing a higher sampling density in the higher-curvature regions
approximation by the polygonal curves
affine arithmetic working in the triangulated models
combined sampling: providing the polygonal approximation of the parametric curve involving the discontinuities will be presented. The modified method will be used for the function f(x) reconstruction and plot. Based on the ideas of splitting the domain into the subintervals without the discontinuities, it represents a typical problem solvable by the recursive approach.
In machine learning, a lot of techniques require defining a metric between different data points. I want to know what are some popular metrics when the data are images.
An obvious way of measuring distance between images is to sum up the squares of pixel errors. But this is sensitive to simple transformations like translation. For example, even shifting the whole image by one pixel could result in a large distance.
What are some other distance measuring techniques that is more compatible with translation, rotations, etc.?
Wasserstein distance(earth mover's distance) and kullback leibler divergence are the two that I have come across while studying literature about Generative Adversarial Networks(GANs).
I was trying to make a application that compares the difference between 2 images in java with opencv. After trying various approaches I came across the algorithm called Demons algorithm.
To me it seems to give the difference of images by some transformation on each place. But I couldn't understand it since the references I found were too complex for me.
Even the demons algorithm does not do what I need I'm interested in learning it.
Can any one explain simply what happens in the demons algorithm and how to write a simple code to use that algorithm on 2 images.
I can give you an overview of general algorithms for deformable image registration, demons is one of them
There are 3 components of the algorithm, a similarity metric, a transformation model and an optimization algorithm.
A similarity metric is used to compute pixel based / patch based similarity between pixels/patches. Common similarity measures are SSD, normalized cross correlation for mono-modal images while information theoretic measures like mutual information are used in the case of multi-modal image registration.
In the case of deformable registration, they generally have a regular grid super-imposed over the image and the grid is deformed by solving an optimization problem which is formulated such that the similarity metric and the smoothness penalty imposed over the transformation is minimized. In deformable registration, once there are deformations over the grid, the final transformation at the pixel level is computed using a B-Spine interpolation of the grid at the pixel level so that the transformation is smooth and continuous.
There are 2 general approaches towards solving the optimization problem, some people use discrete optimization and solve it as a MRF optimization problem while some people use gradient descent, I think demons uses gradient descent.
In case of MRF based approaches, the unary cost is the cost for deforming each node in grid and it is the similarity computed between patches, the pairwise cost which imposes the smoothness of the grid, is generally a potts/truncated quadratic potential which ensures that neighboring nodes in the grid have almost the same displacement. Once you have the unary and pairwise cost, you feed it to a MRF optimization algorithm and get the displacements at the grid level, then you use a B-Spline interpolation to compute pixel level displacement. This process is repeated in a coarse to fine fashion over several scales and also the algorithm is run many times at each scale (reducing the displacement at each node every time).
In case of gradient descent based methods, they formulate the problem with the similarity metric and the grid transformation computed over the image and then compute the gradient of the energy function which they have formulated. The energy function is minimized using iterative gradient descent, however these approaches can get stuck in a local minima and are quite slow.
Some popular methods are DROP, Elastix, itk provides some tools
If you want to know more about algorithms related to deformable image registration, I will recommend you to take a look to FAIR( guide book), FAIR is a toolbox for Matlab so you will have examples to understand the theory.
http://www.cas.mcmaster.ca/~modersit/FAIR/
Then if you want to specifically see some demon example,, here you have this other toolbox:
http://www.mathworks.es/matlabcentral/fileexchange/21451-multimodality-non-rigid-demon-algorithm-image-registration
So I'm looking to apply a clustering algorithm to the earth data provided by the usgs.
http://earthquake.usgs.gov/earthquakes/feed/
My main goal is to determine the top 10 most dangerous places (either by amount of earthquakes or the magnitude of an earthquake that a place experiences) to be based on an earthquake feed.
Are there any suggestions on how to do it? I'm looking at k-means then just taking the sum of the k-means (with each earthquake magnitude weighted in each cluster) to look at the most dangerous clusters.
I'm also writing this in ruby as a code reference.
Thanks
K-means can't handle outliers in the data set very well.
Furthermore, it is designed around variance, but variance in latitude and longitude is not really meaningful. In fact, k-means cannot handle the latitude +-180° wrap-around. Instead, you will want to use the great-circle distance.
So try to use a density based clustering algorithm that allows you to use distances such as the great-circle distance!
Read up on Wikipedia and a good book on cluster analysis.
Given a linearly separable dataset, is it necessarily better to use a a hard margin SVM over a soft-margin SVM?
I would expect soft-margin SVM to be better even when training dataset is linearly separable. The reason is that in a hard-margin SVM, a single outlier can determine the boundary, which makes the classifier overly sensitive to noise in the data.
In the diagram below, a single red outlier essentially determines the boundary, which is the hallmark of overfitting
To get a sense of what soft-margin SVM is doing, it's better to look at it in the dual formulation, where you can see that it has the same margin-maximizing objective (margin could be negative) as the hard-margin SVM, but with an additional constraint that each lagrange multiplier associated with support vector is bounded by C. Essentially this bounds the influence of any single point on the decision boundary, for derivation, see Proposition 6.12 in Cristianini/Shaw-Taylor's "An Introduction to Support Vector Machines and Other Kernel-based Learning Methods".
The result is that soft-margin SVM could choose decision boundary that has non-zero training error even if dataset is linearly separable, and is less likely to overfit.
Here's an example using libSVM on a synthetic problem. Circled points show support vectors. You can see that decreasing C causes classifier to sacrifice linear separability in order to gain stability, in a sense that influence of any single datapoint is now bounded by C.
Meaning of support vectors:
For hard margin SVM, support vectors are the points which are "on the margin". In the picture above, C=1000 is pretty close to hard-margin SVM, and you can see the circled points are the ones that will touch the margin (margin is almost 0 in that picture, so it's essentially the same as the separating hyperplane)
For soft-margin SVM, it's easer to explain them in terms of dual variables. Your support vector predictor in terms of dual variables is the following function.
Here, alphas and b are parameters that are found during training procedure, xi's, yi's are your training set and x is the new datapoint. Support vectors are datapoints from training set which are are included in the predictor, ie, the ones with non-zero alpha parameter.
In my opinion, Hard Margin SVM overfits to a particular dataset and thus can not generalize. Even in a linearly separable dataset (as shown in the above diagram), outliers well within the boundaries can influence the margin. Soft Margin SVM has more versatility because we have control over choosing the support vectors by tweaking the C.