How do you plot an ROC curve for a 1 nearest neighbor classifier - knn

I want to evaluate my KNN where K = 1 classifier against Support Vector Machine Classifiers etc but I'm not sure if the way I am computing the ROC plot is correct. The classifier is constructed for a two class problem (positive and negative class).
If I understand correctly, to compute the ROC for a KNN for K=20, to get the first point on the plot we would get the true positive and false positive values for the tests samples where 1 or more of the 20 nearest neighbors are of the positive class. To get the second point we evaluate the true positive and false positive values for the test samples where 2 or more of the 20 nearest neighbors are of the positive class. This is repeated until the threshold reaches 20 out of 20 nearest neighbors.
For the case where K=1, does the ROC curve simply only have 1 point on the plot? Is there a better way to compute the ROC for the 1NN case? How can we fairly evaluate the performance for the 1NN classifier to a SVM classifier? Can we only compare the performance of the classifiers only at the single false positive value of the 1NN classifier?

Related

How to combine Rational B-spline Surfaces?

How to combine Rational B-spline Surfaces into one or fewer? How do metrics such as tolerance, u/v degree, u/v span influence the final result, if any?
In general, there is no way to create a single rational B-spline surface as the exact merge result of the 4 input rational B-spline surfaces. So, you will have to settle with an approximation. Consequently, there is no need for this approximating surface to be rational. The approximation schemes typically are divided into two categories:
1) Given degree and number of spans in U and V directions, try to find the "best fit" surface to the 4 surfaces. Typically, the max deviation between the output surface and the input surfaces are also computed so that users will know how well this surface fit the input.
2) Given degree in U and V directions and a tolerance value, try to find the "best fit" surface to the 4 surfaces where the max deviation between the output and the input is smaller than the input tolerance value.
The 2nd approach will normally use the algorithm for the 1st approach and iterate over the number of spans in U/V direction to determine the optimum number of spans. Therefore, typically it will take a lot longer of time when compared with the 1st approach.

How to determine if a pattern of distribution is different from a random/uniform distribution

Here is my case:
Let's say we have 50 polygons(looks like this:
and a point set distributed within these 50 polygons. So that for each polygon, there is an associated point density. What I want to test if whether the distribution pattern of this data set (for example, the fluctuations in density across 50 polygons) is kind of realization of spatial randomness.
The method I use is: in the uniform random case, the number of points of each ring follows a binomial distribution, i.e. X~B(n, p), where n is the total number of points and p is the probability of each point to be inside a particular polygon (p = Area_polygon/Area_semicircle). So that for each polygon, I can calculate the expected number of points and upon which we can calculate the density. And then I can apply the one-way ANOVA to compare two groups: the actual density group and the theoretical density group.
However, I found a problem: when calculating the density, I actually divide the expected number over the area. But, considering the expected number
E = N(total number)*Area_polygon/total area,
thus the density:
D = N(total number)/total area
which means for each polygon, the expected density is the same number.
So in that case, is it still suitable to use one-way ANOVA to compare my actual density group to a group within which all numbers are the same?
What if use numbers rather than density? Or is there any other more suitable tests?
You may want to look up a method called "quadrat test". It is explained in the online help for the function quadrat.test in the R package spatstat and more extensively in the spatstat book. (Disclaimer: I'm a coauthor.)

Classification of K-nearest neighbours algorithm

The x and y axis have different scales in this scatter plot.
Assume the centre of each shape to be the datapoint.
Q: What will be the classification of a test point for 9-nearest- neighbour classifier using this training set, use both features?
Q: On the scatter plot at the top of the page, in any order, name the class of three nearest neighbours for the bottom left unknown point, using both features to compute distance.
Here's my attempt:
1: A higher K, 9 in this case, that more voters in each prediction and hence is more resilient to outliers. Larger values of K will have smoother decision boundaries to decide either Pet or Wild here, which means lower variance but increased bias.
2: By using the Pythagorean theorem, the distance of the three nearest classes to the bottom left unknown point are,
Pet, Distance = 0.02
Pet, Distance = 2.20
Wild, Distance = 2.60
Therefore, the class is Pet.
Question 1 asks for a specific answer (Pet or Wild), which you have not provided. The statements you've made are generally true, but they don't actually answer the question. Notice that there are only 4 Pet points, and the rest are Wild. So no matter which 9 points are the nearest neighbors, at least 5 (a majority) will be Wild. Hence, a KNN classifier with K = 9 will always predict Wild using this data.
Question 2 looks mostly right. I don't have the exact coordinates of the points, but your numbers seem to be in the right ballpark, except you probably have a typo in the first distance. The classes are right, and the resulting prediction (which the question didn't explicitly ask for) is also right (assuming K = 3).

KNN classifier algorithm not working for all cases

To effectively find n nearest neighbors of a point in d-dimensional space, I selected the dimension with greatest scatter (i.e. in this coordinate differences between points are largest). The whole range from minimal to maximal value in this dimension was split into k bins. Each bin contains points which coordinates (in this dimensions) are within the range of that bin. It was ensured that there are at least 2n points in each bin.
The algorithm for finding n nearest neighbors of point x is following:
Identify bin kx,in which point x lies(its projection to be precise).
Compute distances between x and all the points in bin kx.
Sort computed distances in ascending order.
Select first n distances. Points to which these distances were measured are returned as n
nearest neighbors of x.
This algorithm is not working for all cases. When algorithm can fail to compute nearest neighbors?
Can anyone propose modification of the algorithm to ensure proper operation for all cases?
Where KNN failure:
If the data is a jumble of all different classes then knn will fail because it will try to find k nearest neighbours but all points are random
outliers points
Let's say you have two clusters of different classes. Then if you have a outlier point as query, knn will assign one of the classes even though the query point is far away from both clusters.
This is failing because (any of) the k nearest neighbors of x could be in a different bin than x.
What do you mean by "not working"? You do understand that, what you are doing is only an approximate method.
Try normalising the data and then choosing the dimension, else scatter makes no sense.
The best vector for discrimination or for clustering may not be one of the original dimensions, but any combination of dimensions.
Use PCA (Principal Component Analysis) or LDA (Linear Discriminant Analysis), to identify a discriminative dimension.

Sampson error for five point essential matrix estimation

I used the 5 point Method from Nister to calculate the Essential matrix . Further improved Outlier Rejection using RANSAC and Sampson Error Threshold. I randomly choose 5 point sets, estimate the essential matrix and evaluate the Sampson error for the vector of matches. Point coordinates whose Sampson error is below a threshold t (set to 0.01 in the example that I'm using), are set as inliers. The process is repeated for all essential matrices and we retain the one which posess the best score of inliers.
I have noticed that the majority of values of d, the vector of sampson errors are too big: for example if the size of d is (1x1437), if I do
g=find(abs(d)>0.01);
length(g)
then length(g)=1425 which means that only 7 values are inliers with this threshold which is not correct!
How to set the threshold? how to interprete Sampson error values?
Help me please. Thank you
The Sampson distance is the first order approximation of geometric distance. It could be understood as follows:
Given a Fundamental matrix F, and a pair of correspondence (x,x') such that x'Fx=e, what is the distance/error of this pair of correspondence? The geometric distance is defined for all correspondence (y,y') such that y'Fy=0, the minimum value of ||x-y||^2+||x'-y'||^2 (in other words, the closest correspondence pair to (x,x') that satisfies the F matrix exactly). And it can be shown that the Sampson error is a first approximation of this minimum distance.
Intuitively, Sampson error can be roughly thought as the squared distance between a point x to the corresponding epipolar line x'F. And in this context, a threshold of 0.01 is way too small (you rarely find a fundamental matrix such that all correspondences are within 0.1 pixel accuracy). The suggested threshold would be somewhere between 1 to 10 (1~3 pixel error), depending on the size/resolution/quality of your image pairs.
0.01 is a too small threshold.
As the last answer, 1 to 10 is better.
x and x' using sampson error means they are both not on each's epipolar line, we need to compute this error for both two points.
If you fix x , use F and x to compute the line in the second image (x' on this image), then you can compute the distance from point to line (x' to the epipolar line). This means you think point x is correct, right in it's line.
This two ways are different.
0.01 is not too small if you deal with normalized cameras, that is you multiplied your pixel coordinates with inverse intrinsic parameter matrix.

Resources