Classification of K-nearest neighbours algorithm - algorithm

The x and y axis have different scales in this scatter plot.
Assume the centre of each shape to be the datapoint.
Q: What will be the classification of a test point for 9-nearest- neighbour classifier using this training set, use both features?
Q: On the scatter plot at the top of the page, in any order, name the class of three nearest neighbours for the bottom left unknown point, using both features to compute distance.
Here's my attempt:
1: A higher K, 9 in this case, that more voters in each prediction and hence is more resilient to outliers. Larger values of K will have smoother decision boundaries to decide either Pet or Wild here, which means lower variance but increased bias.
2: By using the Pythagorean theorem, the distance of the three nearest classes to the bottom left unknown point are,
Pet, Distance = 0.02
Pet, Distance = 2.20
Wild, Distance = 2.60
Therefore, the class is Pet.

Question 1 asks for a specific answer (Pet or Wild), which you have not provided. The statements you've made are generally true, but they don't actually answer the question. Notice that there are only 4 Pet points, and the rest are Wild. So no matter which 9 points are the nearest neighbors, at least 5 (a majority) will be Wild. Hence, a KNN classifier with K = 9 will always predict Wild using this data.
Question 2 looks mostly right. I don't have the exact coordinates of the points, but your numbers seem to be in the right ballpark, except you probably have a typo in the first distance. The classes are right, and the resulting prediction (which the question didn't explicitly ask for) is also right (assuming K = 3).

Related

Find the diameter of a set of n points in d-dimensional space

I am interesting in finding the diameter of two points sets, in 128 dimensions. The first has 10000 points and the second 1000000. For that reason I would like to do something better than the naive approach which takes O(n²). The algorithm will be able to handle any number of points and dimensions, but I am currently very interested in these two particular data sets.
I am very interesting in gaining speed over accuracy, thus, based on this, I would find the (approximate) bounding box of the point set, by computing the min and max value per coordinate, thus O(n*d) time. Then, if I find the diameter of this box, the problem is solved.
In the 3d case, I could find the diameter of the one side, since I know the two edges and then, I could apply the Pythagorean theorem on the other, which is vertical to this side. I am not sure for this however and for sure, I can't see how to generalize it to d dimensions.
An interesting answer can be found here, but it seems to be specific for 3 dimensions and I want a method for d dimensions.
Interesting paper: On computing the diameter of a point set in high dimensional Euclidean space. Link. However, implementing the algorithm seems too much for me in this phase.
The classic 2-approximation algorithm for this problem, with running time O(nd), is to choose an arbitrary point and then return the maximum distance to another point. The diameter is no smaller than this value and no larger than twice this value.
I would like to add a comment, but not enough reputation for that...
I just want to warn other readers that the "bounding box" solution is very inaccurate. Take for example the Euclidean ball of radius one. This set has diameter two, but its bounding box is [-1, 1]^d, which has diameter twice the square root of d. For d = 128, this is already a very bad approximation.
For a crude estimate, I would stay with David Eisenstat's answer.
There is a precision based algorithm which performs very well on any dimension, which is based on computing the dimension of an axial bounding box.
The idea is that it's possible to find the lower and upper boundaries of the axis bounding box length function since it's partial derivatives are limited, and depend on the angle between the axises.
The limit of the local maxima derivatives between two axises in 2d space can be computed as:
sin(a/2)*(1 + tan(a/2))
That means that, for example, for 90deg between axises the boundary is 1.42 (sqrt(2))
Which reduces to a/2 when a => 0, so the upper boundary is proportional to the angle.
For a multidimensional case the formula varies slightly, but still it's easy to compute.
So, the search of local minima convolves in logarithmic time.
The good news is that we can run the search of such local maxima in parallel.
Also, we can filter out both the regions of the search based on the best achieved result so far, as well as the points themselves, which are belo the lower limit of the search in the worst region.
The worst case of the algorithm is where all of the points are placed on the surface of a sphere.
This can be firther improved: when we detect a local search which operates on just few points, we swap to bruteforce for this particular axis. It works fast, because we need only the points which are subject to that particular local search, which can be determined as points actually bound by two opposite spherical cones of a particular angle sharing the same axis.
It's hard to figure out the big O notation, because it depends on desired precision and the distribution of points (bad when most of the points are on a sphere's surface).
The algorithm i use is here:
Set the initial angle a = pi/2.
Take one axis for each dimension. The angle and the axises form the initial 'bucket'
For each axis, compute the span on that axis by projecting all the points onto the axis, and finding min and max of the coordinates on the axis.
Compute the upper and lower bounds of the diameter which is interesting. It's based on the formula: sin(a/2)*(1 + tan(a/2)) and multiplied by assimetry cooficient, computed from the length of the current axis projections.
For the next step, kill all of the points which fall under the lower bound in each dimension at the same time.
For each exis, If the amount of points above the upper bound is less then some reasonable amount (experimentally computed) then compute using a bruteforce (N^2) on the set of the points in question, and adjust the lower bound, and kill the axis for the next step.
For the next step, Kill all of the axises, which have all of their points under the lower bound.
If the precision is satisfactory (upper bound - lower bound) < epsilon, then return the upper bound as the result.
For all of the survived axises, there is a virtual cone on that axis (actually, the two opposite cones), which covers some area on a virtual sphere which encloses a face of the cube. If i'm not mistaken, it's angle would be a * sqrt(2). Set the new angle to a / sqrt(2). Create a whole bucket of new axises (2 * number of dimensions), so the new cone areas would cover the initial cone area. It's the hard part for me, as i have not enough imagination for n>3-dimensional case.
Continue from step (3).
You can paralellize the procedure, synchronizing the limits computed so far for the points from (5) through (7).
I'm going to summarize the algorithm proposed by Timothy Shields.
Pick random point x.
Pick point y furthest from x.
If not done, let x = y, and go to step 2
The more times you repeat, the more accurate the result will be... ??
EDIT: actually this algorithm is not very good. Think about a 2D rectangle with vertices ABCD. There are two maxima: between AC and BD, which are separated by a sizable valley. This algorithm will get stuck at one or the other 50/50. If AC is slightly larger than BD, you'll be getting the wrong answer 50% of the time no matter how many times you iterate. Other regular polygons have the same issue, and in higher dimensions it is even worse.

finding saddle points in 3d heightmap

Given a 3d heightmap (from a laser scanner), how do I find the saddle points?
I.e. given something like this:
I am looking for all points where the curvature is positive in one direction and negative in the other.
(These directions should not need to be aligned with the X and Y axis.
I know how to check whether the curvature in X direction has the opposite sign as the curvature in Y direction, but that does not cover all cases. To make matters worse, the resolution in X is different from the resolution in Y)
Ideally I am looking for an algorithm that can tolerate some amount of noise and only mark "significant" saddle points.
I've been exploring a similar problem for a computational topology class and have had some success with the method outlined below.
First you will need a comparison function that will evaluate the height at two input points and will return < or > (not equal) for any input. One way to do this is that if the points are equal height you use some position-based or random index to find the greater point. You can think of this as adding an infinitesimal perturbation to the height.
Now, for each point, you will compare the height at all the surrounding neighbors (there will be 8 neighbors on a 2D rectangular grid). The lower link for a point will be the set of all neighbors for which the height is less than the point.
If all the neighboring values are in the lower link, you are at a local maximum. If none of the points are in the lower link you are at a local minimum. Otherwise, if the lower link is a single connected set, you are at a regular point on a slope. But if the lower link is two unconnected sets, you are at a saddle.
In 2D you can construct a list of the 8 neighboring point in cyclic order around the point you are checking. You assign a value of +/-1 for each neighbor depending on your comparison function. You can then step through that list (remember to compare the two end points) and count how many times the sign changes to determine the number of connected components in the lower link.
Determining which saddles are "important" is a more difficult analysis. You may wish to look at this: http://www.cs.jhu.edu/~misha/ReadingSeminar/Papers/Gyulassy08.pdf for some guidance.
-Michael
(From a guess at the maths rather than practical experience)
Fit a quadratic to the surface in a small patch around each candidate point, e.g. with least squares. How big the patch is is one way of controlling noise, and you might gain by weighting points depending on their distance from the candidate point. In matrix notation, you can represent the quadratic as x'Ax + b'x + c, where A is symmetric.
The quadratic will have zero gradient at x = (A^-1)b/2. If this not within the patch, discard it.
If A has both +ve and -ve eigenvalues you have a saddle point at x. Since A is only 2x2 and so has at most two eigenvalues, you can ignore the case when it as a zero eigenvalue and so you couldn't invert it at the previous stage.

How to select points at a regular density

how do I select a subset of points at a regular density? More formally,
Given
a set A of irregularly spaced points,
a metric of distance dist (e.g., Euclidean distance),
and a target density d,
how can I select a smallest subset B that satisfies below?
for every point x in A,
there exists a point y in B
which satisfies dist(x,y) <= d
My current best shot is to
start with A itself
pick out the closest (or just particularly close) couple of points
randomly exclude one of them
repeat as long as the condition holds
and repeat the whole procedure for best luck. But are there better ways?
I'm trying to do this with 280,000 18-D points, but my question is in general strategy. So I also wish to know how to do it with 2-D points. And I don't really need a guarantee of a smallest subset. Any useful method is welcome. Thank you.
bottom-up method
select a random point
select among unselected y for which min(d(x,y) for x in selected) is largest
keep going!
I'll call it bottom-up and the one I originally posted top-down. This is much faster in the beginning, so for sparse sampling this should be better?
performance measure
If guarantee of optimality is not required, I think these two indicators could be useful:
radius of coverage: max {y in unselected} min(d(x,y) for x in selected)
radius of economy: min {y in selected != x} min(d(x,y) for x in selected)
RC is minimum allowed d, and there is no absolute inequality between these two. But RC <= RE is more desirable.
my little methods
For a little demonstration of that "performance measure," I generated 256 2-D points distributed uniformly or by standard normal distribution. Then I tried my top-down and bottom-up methods with them. And this is what I got:
RC is red, RE is blue. X axis is number of selected points. Did you think bottom-up could be as good? I thought so watching the animation, but it seems top-down is significantly better (look at the sparse region). Nevertheless, not too horrible given that it's much faster.
Here I packed everything.
http://www.filehosting.org/file/details/352267/density_sampling.tar.gz
You can model your problem with graphs, assume points as nodes, and connect two nodes with edge if their distance is smaller than d, Now you should find the minimum number of vertex such that they are with their connected vertices cover all nodes of graph, this is minimum vertex cover problem (which is NP-Hard in general), but you can use fast 2-approximation : repeatedly taking both endpoints of an edge into the vertex cover, then removing them from the graph.
P.S: sure you should select nodes which are fully disconnected from the graph, After removing this nodes (means selecting them), your problem is vertex cover.
A genetic algorithm may probably produce good results here.
update:
I have been playing a little with this problem and these are my findings:
A simple method (call it random-selection) to obtain a set of points fulfilling the stated condition is as follows:
start with B empty
select a random point x from A and place it in B
remove from A every point y such that dist(x, y) < d
while A is not empty go to 2
A kd-tree can be used to perform the look ups in step 3 relatively fast.
The experiments I have run in 2D show that the subsets generated are approximately half the size of the ones generated by your top-down approach.
Then I have used this random-selection algorithm to seed a genetic algorithm that resulted in a further 25% reduction on the size of the subsets.
For mutation, giving a chromosome representing a subset B, I randomly choose an hyperball inside the minimal axis-aligned hyperbox that covers all the points in A. Then, I remove from B all the points that are also in the hyperball and use the random-selection to complete it again.
For crossover I employ a similar approach, using a random hyperball to divide the mother and father chromosomes.
I have implemented everything in Perl using my wrapper for the GAUL library (GAUL can be obtained from here.
The script is here: https://github.com/salva/p5-AI-GAUL/blob/master/examples/point_density.pl
It accepts a list of n-dimensional points from stdin and generates a collection of pictures showing the best solution for every iteration of the genetic algorithm. The companion script https://github.com/salva/p5-AI-GAUL/blob/master/examples/point_gen.pl can be used to generate the random points with a uniform distribution.
Here is a proposal which makes an assumption of Manhattan distance metric:
Divide up the entire space into a grid of granularity d. Formally: partition A so that points (x1,...,xn) and (y1,...,yn) are in the same partition exactly when (floor(x1/d),...,floor(xn/d))=(floor(y1/d),...,floor(yn/d)).
Pick one point (arbitrarily) from each grid space -- that is, choose a representative from each set in the partition created in step 1. Don't worry if some grid spaces are empty! Simply don't choose a representative for this space.
Actually, the implementation won't have to do any real work to do step one, and step two can be done in one pass through the points, using a hash of the partition identifier (the (floor(x1/d),...,floor(xn/d))) to check whether we have already chosen a representative for a particular grid space, so this can be very, very fast.
Some other distance metrics may be able to use an adapted approach. For example, the Euclidean metric could use d/sqrt(n)-size grids. In this case, you might want to add a post-processing step that tries to reduce the cover a bit (since the grids described above are no longer exactly radius-d balls -- the balls overlap neighboring grids a bit), but I'm not sure how that part would look.
To be lazy, this can be casted to a set cover problem, which can be handled by mixed-integer problem solver/optimizers. Here is a GNU MathProg model for the GLPK LP/MIP solver. Here C denotes which point can "satisfy" each point.
param N, integer, > 0;
set C{1..N};
var x{i in 1..N}, binary;
s.t. cover{i in 1..N}: sum{j in C[i]} x[j] >= 1;
minimize goal: sum{i in 1..N} x[i];
With normally distributed 1000 points, it didn't find the optimum subset in 4 minutes, but it said it knew the true minimum and it selected only one more point.

Algorithm to find point of minimum total distance from locations

I'm building an application based around finding a "convenient meeting point" given a set of locations.
Currently I'm defining "convenient" as "minimising the total travel distance". This is a different problem from finding the centroid as illustrated by the following example (using cartesian coordinates rather than latitude and longitude for convenience):
A is at (0,0)
B is at (0,0)
C is at (0,12)
The location of minimum total travel for these points is at (0,0) with total travel distance of 12; the centroid is at (0,4) with total travel distance of 16 (4 + 4 + 8).
If the location were confined to being at one of the points, the problem appears to become simpler, but this isn't a constraint I intend to have (unlike, for example, this otherwise similar question).
What I can't seem to do is come up with any sort of algorithm to solve this - suggestions welcomed please!
Here is a solution that finds the geographical midpoint and then iteratively explores nearby positions to adjust that towards the minimum total distance point.
http://www.geomidpoint.com/calculation.html
This question is also quite similar to
Minimum Sum of All Travel Times
Here is a wikipedia article on the general problem you're trying to solve:
http://en.wikipedia.org/wiki/Geometric_median
In a way what you appear to be looking for is the center of mass of a triangle with equal weights at the vertices. That would point to barycentric coordinates.
When going beyond a triangle there are solutions for generalized barycentric coordinates and you could give priorities to persons by modifying the weight of the vertices. What that still would not account for is distances on a real map (can't just travel straight in any direction) but it may be a start?
One option is to define an objective (and gradient) function and use a generic optimization library, such as scipy.optimize. fmin_cg would be a good algorithm to try for your problem. Your objective would be the sum of distances as defined in the "Definition" section of the Geometric median Wikipedia page referenced by hatchet. The argument to your objective function is y.

Algorithm for finding symmetries of a tree

I have n sectors, enumerated 0 to n-1 counterclockwise. The boundaries between these sectors are infinite branches (n of them).
The sectors live in the complex plane, and for n even,
sector 0 and n/2 are bisected by the real axis, and the sectors are evenly spaced.
These branches meet at certain points, called junctions. Each junction is adjacent to a subset of the sectors (at least 3 of them).
Specifying the junctions, (in pre-fix order, lets say, starting from junction adjacent to sector 0 and 1), and the distance between the junctions, uniquely describes the tree.
Now, given such a representation, how can I see if it is symmetric wrt the real axis?
For example, n=6, the tree (0,1,5)(1,2,4,5)(2,3,4) have three junctions on the real line,
so it is symmetric wrt the real axis.
If the distances between (015) and (1245) is equal to distance from (1245) to (234),
this is also symmetric wrt the imaginary axis.
The tree (0,1,5)(1,2,5)(2,4,5)(2,3,4) have 4 junctions, and this is never symmetric wrt either imaginary or real axis, but it has 180 degrees rotation symmetry if the distance between the first two and the last two junctions in the representation are equal.
Edit:
Here are all trees with 6 branches, distances 1.
http://www2.math.su.se/~per/files/allTrees.pdf
So, given the description/representation, I want to find some algorithm to decide if it is symmetric wrt real, imaginary, and rotation 180 degrees. The last example have 180 degree symmetry.
Edit 2:
This is actually for my research. I have posted the question at mathoverflow as well,
but my days in competition programming tells me that this is more like an IOI task.
Code in mathematica would be excellent, but java, python, or any other language readable by a human suffices.
(These symmetries corresponds to special kinds of potential in the Schroedinger equation,
which has nice properties in quantum mechanics.)
Could you please define better what you mean by symmetry of the tree?
You first say that
"The sectors live in the complex
plane, and for n even, sector 0 and
n/2 are bisected by the real axis, and
the sectors are evenly spaced."
and that you want to find symmetry
wrt real, imaginary, and rotation 180 degrees
I would then expect that the symmetries would be purely geometrical, but then you also say, in the comment to Justin's answer
There is also not a canonical way to draw a tree,
and my drawing algorithm does not respect all possible
symmetries that a tree can have
How can you look for geometrical symmetry if the position of the vertices of the tree cannot be uniquely defined on the plane? Furthermore in many of the plots you have given (N=6, even) sectors 0 and 3 are not bisected by the x axis (real axis), so I would deem your own drawings wrong.
Since you already have an algorithm to construct the point set for the tree, you only need to determine if the point set has flip symmetry. Ideally your set is computed symbolically (and left in terms of sin(theta), cos(theta)) for non rational points, which should be fine since you seem to be using Mathematica.
You now want to know if your point set has a symmetry about some axis, so represent the flip/rotation transformation as a matrix A, and we have {x'} = A{x}. Sort the after image set {x'} (using the expressions not the numeric values), and compare to the original point set {x}. If there is not a 1-1 correspondence then you don't have a symmetry otherwise you do.
I think there is a mathematica function to find the unique expressions in a set (e.g. Unique[beforeImage] == Unique[afterImage])
I have not had time to implement this, perhaps someone here might take it further:
First partition the junctions by quadrant, this should produce 4 trees. { Tpp, Tmp, Tmm, Tpm} (p for plus, m for minus). Now checking for symmetry seems to be a directional breadth first traversal:
Its been a while on my mathematica, so none of this will compile
CheckRealFlip[T_] := And[TraverseCompare[Tpp[T], Tpm[T]],
TraverseCompare[Tmp[T], Tmm[T]];
CheckImFlip[T_] := And[TraverseCompare[Tpp[T], Tmp[T]],
TraverseCompare[Tpm[T], Tmm[T]];
Where TraverseCompare checks the structure of the tree using a breath first traversal along one tree, and a reverse order breadth first traversal along the other tree. (something like the following, but this won't work at ).
TraverseCompare[A_, B_] := Size[A] == Size[B] &&
Apply[TraverseCompare, Children[A], Reverse[Children[B]];

Resources