Correct implementation of weighted K-Nearest Neighbors

Correct implementation of weighted K-Nearest Neighbors - algorithm

From what I understood, the classical KNN algorithm works like this (for discrete data):
Let x be the point you want to classify
Let dist(a,b) be the Euclidean distance between points a and b
Iterate through the training set points pᵢ, taking the distances dist(pᵢ,x)
Classify x as the most frequent class between the K points closest (according to dist) to x.
How would I introduce weights on this classic KNN? I read that more importance should be given to nearer points, and I read this, but couldn't understand how this would apply to discrete data.
For me, first of all, using argmax doesn't make any sense, and if the weight acts increasing the distance, than it would make the distance worse. Sorry if I'm talking nonsense.

Consider a simple example with three classifications (red green blue) and the six nearest neighbors denoted by R, G, B. I'll make this linear to simplify visualization and arithmetic
R B G x G R R
The points listed with distance are
class dist
R 3
B 2
G 1
G 1
R 2
R 3
Thus, if we're using unweighted nearest neighbours, the simple "voting" algorithm is 3-2-1 in favor of Red. However, with the weighted influences, we have ...
red_total = 1/3^2 + 1/2^2 + 1/3^2 = 1/4 + 2/9 ~= .47
blue_total = 1/2^2 ..............................= .25
green_total = 1/1^2 + 1/1^2 ......................= 2.00
... and x winds up as Green due to proximity.
That lower-delta function is merely the classification function; in this simple example, it returns red | green | blue. In a more complex example, ... well, I'll leave that to later tutorials.

Okay, off the bat let me say I am not the fan of the link you provided, it has image equations and follows a different notation in the images and the text.
So leaving that off let's look at the regular k-NN algorithm. regular k-NN is actually just a special case of weighted k-NN. You assign a weight of 1 to k neighbors and 0 to the rest.
Let Wqj denote the weight associated with a point j relative to a point q
Let yj be the class label associated with the data point j. For simplicity let us assume we are classifying birds as either crows, hens or turkeys => discrete classes. So for all j, yj <- {crow, turkey, hen}
A good weight metric is the inverse of the distance , whatever distance be it Euclidean, Mahalanobis etc.
Given all this, the class label yq you would associate with the point q you are trying to predict would be the the sum of the wqj . yj terms diviided by the sum of all weights. You do not have to the division if you normalize the weights first.
You would end up with an equation as follows somevalue1 . crow + somevalue2 . hen + somevalue3 . turkey
One of these classes will have a higher somevalue. The class witht he highest value is what you will predict for point q
For the purpose of training you can factor in the error anyway you want. Since the classes are discrete there are a limited number of simple ways you can adjust the weight to improve accuracy

Related

What's the Difference Between Kendall's Distance and Kendall tau Distance?

I'm now trying to use Kendall's distance to improve sets of rankings based on Borda counts method.
I'm asked to follow a specific document's instructions. In the document it states that :
"The Kendall's distance counts the pairwise disagreements between items from two rankings as :
where
The Kendall's distance is normalized by its maximum value C2n. The less the Kendall’s distance is, the greater the similarity degree between the rankings is.
The Kendall's tau is another method for measuring the similarity degree between rankings, which is easy to be confused with the Kendall's distance.
The Kendall's tau is defined as:
The Kendall's tau is defined based on the normalized Kendall's distance. Note that the greater the Kendall's tau is, the greater the similarity degree between the compared rankings is. In this paper, we use the Kendall's distance rather than the Kendall's tau."
My goal is to improve the following ranking by using Kendall's distance :
x1 x2 x3 x4
A1 4 1 3 2
A2 4 1 3 2
A3 4 3 2 1
A4 1 4 3 2
A5 1 2 4 3
In this ranking, the ith row represents the ranking obtained based on Ai, and each column represents the ranking position of the corresponding item in each ranking. (i.e. xn represents the items to be ranked, Ai represents the ones who rank the items.)
I don't understand what's the difference between the two distances despite the explanation of the doc. And what what does the "(j,s), j != s" beneath the sigma symbol stand for? And finally how to implement Kendall's distance in the ranking provided above?

Distance and similarity are two related concepts, but for distance, exact identity means distance 0, and as things get more different, the distance between them gets greater, with no very obvious fixed limit. A well-behaved distance will obey the rules for a metric - see https://en.wikipedia.org/wiki/Metric_(mathematics). For a similarity, exact identity means similarity 1, and similarity decreases as things get greater, but usually never decreases below 0. Kendall's tau seems to be a way of turning Kendall's distance into a similarity.
"(j,s), j != s" means consider all possibilities for j and s except those for which j = s.
You can compute Kendall's distance by simply summing over all possibilities for j not equal to s - but the time taken for this goes up with the square of the number of items. There are ways for which the time taken only goes up as n * log(n) where n is the number of items - for this and much other stuff on Kendall see https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient

Genetic Algorithm : Find curve that fits points

I am working on a genetic algorithm. Here is how it works :
Input : a list of 2D points
Input : the degree of the curve
Output : the equation of the curve that passes through points the best way (try to minimize the sum of vertical distances from point's Ys to the curve)
The algorithm finds good equations for simple straight lines and for 2-degree equations.
But for 4 points and 3 degree equations and more, it gets more complicated. I cannot find the right combination of parameters : sometimes I have to wait 5 minutes and the curve found is still very bad. I tried modifying many parameters, from population size to number of parents selected...
Do famous combinations/theorems in GA programming can help me ?
Thank you ! :)

Based on what is given, you would need a polynomial interpolation in which, the degree of the equation is number of points minus 1.
n = (Number of points) - 1
Now having said that, let's assume you have 5 points that need to be fitted and I am going to define them in a variable:
var points = [[0,0], [2,3], [4,-1], [5,7], [6,9]]
Please be noted the array of the points have been ordered by the x values which you need to do.
Then the equation would be:
f(x) = a1*x^4 + a2*x^3 + a3*x^2 + a4*x + a5
Now based on definition (https://en.wikipedia.org/wiki/Polynomial_interpolation#Constructing_the_interpolation_polynomial), the coefficients are computed like this:
Now you need to used the referenced page to come up with the coefficient.

It is not that complicated, for the polynomial interpolation of degree n you get the following equation:
p(x) = c0 + c1 * x + c2 * x^2 + ... + cn * x^n = y
This means we need n + 1 genes for the coefficients c0 to cn.
The fitness function is the sum of all squared distances from the points to the curve, below is the formula for the squared distance. Like this a smaller value is obviously better, if you don't want that you can take the inverse (1 / sum of squared distances):
d_squared(xi, yi) = (yi - p(xi))^2
I think for faster conversion you could limit the mutation, e.g. when mutating choose a new value with 20% probability between min and max (e.g. -1000 and 1000) and with 80% probabilty a random factor between 0.8 and 1.2 with which you multiply the old value.

Finding Closest Pair using Manhattan Distance

I'm trying to implement the Closest Pair algorithm with Manhattan distance. With Euclidean distance, it's working fine, but with Manhattan distance, it gives the wrong result. CLRS Exercise 33.4-3 asks us to replace Euclidean distance with Manhattan distance. They simply ask us to change one line, but it isn't what modification is needed in the below code.
lst = [(2,2),(4,2),(5,3)]
min_dist = float("inf")
for i in range(len(lst)):
for j in range(i + 1 , len(lst)):
dist = abs(lst[i][0] - lst[j][0]) + abs(lst[i][1] - lst[j][1])
if(dist < min_dist):
min_dist = dist
global minp1, minp2
minp1 = lst[i]
minp2 = lst[j]

I guess that the outcome of programs with both distances are different.
Indeed, with the euclidian distance, the closest pair is (4,2)-(5,3) while with the Manhattan distance, both (2,2)-(4,2) and (4,2)-(5,3) are closest pairs. Given your program, you only pick up the first one in the order of appearance, and the outcome is (2,2)-(4,2). If your program would return all closest pairs, you would have seen (4,2)-(5,3).
But generally speaking, there is no reason for the outcome of both programs to be the same. For example, in your example, change (5,3) to (5,3.1). To have a concrete idea of how different both distances are, it may useful for you to plot the "unit circle" using for norms and you will see that the Manhattan circle is more square than round.

Find if a point is inside a convex hull for a set of points without computing the hull itself

What is the simplest way to test if a point P is inside a convex hull formed by a set of points X?
I'd like an algorithm that works in a high-dimensional space (say, up to 40 dimensions) that doesn't explicitly compute the convex hull itself. Any ideas?

The problem can be solved by finding a feasible point of a Linear Program. If you're interested in the full details, as opposed to just plugging an LP into an existing solver, I'd recommend reading Chapter 11.4 in Boyd and Vandenberghe's excellent book on convex optimization.
Set A = (X[1] X[2] ... X[n]), that is, the first column is v1, the second v2, etc.
Solve the following LP problem,
minimize (over x): 1
s.t. Ax = P
x^T * [1] = 1
x[i] >= 0 \forall i
where
x^T is the transpose of x
[1] is the all-1 vector.
The problem has a solution iff the point is in the convex hull.

The point lies outside of the convex hull of the other points if and only if the direction of all the vectors from it to those other points are on less than one half of a circle/sphere/hypersphere around it.
Here is a sketch for the situation of two points, a blue one inside the convex hull (green) and the red one outside:
For the red one, there exist bisections of the circle, such that the vectors from the point to the points on the convex hull intersect only one half of the circle.
For the blue point, it is not possible to find such a bisection.

You don't have to compute convex hull itself, as it seems quite troublesome in multidimensional spaces. There's a well-known property of convex hulls:
Any vector (point) v inside convex hull of points [v1, v2, .., vn] can be presented as sum(ki*vi), where 0 <= ki <= 1 and sum(ki) = 1. Correspondingly, no point outside of convex hull will have such representation.
In m-dimensional space, this will give us the set of m linear equations with n unknowns.
edit
I'm not sure about complexity of this new problem in general case, but for m = 2 it seems linear. Perhaps, somebody with more experience in this area will correct me.

I had the same problem with 16 dimensions. Since even qhull didn't work properly as too much faces had to be generated, I developed my own approach by testing, whether a separating hyperplane can be found between the new point and the reference data (I call this "HyperHull" ;) ).
The problem of finding a separating hyperplane can be transformed to a convex quadratic programming problem (see: SVM). I did this in python using cvxopt with less then 170 lines of code (including I/O). The algorithm works without modification in any dimension even if there exists the problem, that as higher the dimension as higher the number of points on the hull (see: On the convex hull of random points in a polytope). Since the hull isn't explicitely constructed but only checked, whether a point is inside or not, the algorithm has very big advantages in higher dimensions compared to e.g. quick hull.
This algorithm can 'naturally' be parallelized and speed up should be equal to number of processors.

Though the original post was three years ago, perhaps this answer will still be of help. The Gilbert-Johnson-Keerthi (GJK) algorithm finds the shortest distance between two convex polytopes, each of which is defined as the convex hull of a set of generators---notably, the convex hull itself does not have to be calculated. In a special case, which is the case being asked about, one of the polytopes is just a point. Why not try using the GJK algorithm to calculate the distance between P and the convex hull of the points X? If that distance is 0, then P is inside X (or at least on its boundary). A GJK implementation in Octave/Matlab, called ClosestPointInConvexPolytopeGJK.m, along with supporting code, is available at http://www.99main.com/~centore/MunsellAndKubelkaMunkToolbox/MunsellAndKubelkaMunkToolbox.html . A simple description of the GJK algorithm is available in Sect. 2 of a paper, at http://www.99main.com/~centore/ColourSciencePapers/GJKinConstrainedLeastSquares.pdf . I've used the GJK algorithm for some very small sets X in 31-dimensional space, and had good results. How the performance of GJK compares to the linear programming methods that others are recommending is uncertain (although any comparisons would be interesting). The GJK method does avoid computing the convex hull, or expressing the hull in terms of linear inequalities, both of which might be time-consuming. Hope this answer helps.

Are you willing to accept a heuristic answer that should usually work but is not guaranteed to? If you are then you could try this random idea.
Let f(x) be the cube of the distance to P times the number of things in X, minus the sum of the cubes of the distance to all of the points in X. Start somewhere random, and use a hill climbing algorithm to maximize f(x) for x in a sphere that is very far away from P. Excepting degenerate cases, if P is not in the convex hull this should have a very good probability of finding the normal to a hyperplane which P is on one side of, and everything in X is on the other side of.

A write-up to test if a point is in a hull space, using scipy.optimize.minimize.
Based on user1071136's answer.
It does go a lot faster if you compute the convex hull, so I added a couple of lines for people who want to do that. I switched from graham scan (2D only) to the scipy qhull algorithm.
scipy.optimize.minimize documentation:
https://docs.scipy.org/doc/scipy/reference/optimize.nonlin.html
import numpy as np
import scipy.optimize
import matplotlib.pyplot as plt
from scipy.spatial import ConvexHull
def hull_test(P, X, use_hull=True, verbose=True, hull_tolerance=1e-5, return_hull=True):
if use_hull:
hull = ConvexHull(X)
X = X[hull.vertices]
n_points = len(X)
def F(x, X, P):
return np.linalg.norm( np.dot( x.T, X ) - P )
bnds = [[0, None]]*n_points # coefficients for each point must be > 0
cons = ( {'type': 'eq', 'fun': lambda x: np.sum(x)-1} ) # Sum of coefficients must equal 1
x0 = np.ones((n_points,1))/n_points # starting coefficients
result = scipy.optimize.minimize(F, x0, args=(X, P), bounds=bnds, constraints=cons)
if result.fun < hull_tolerance:
hull_result = True
else:
hull_result = False
if verbose:
print( '# boundary points:', n_points)
print( 'x.T * X - P:', F(result.x,X,P) )
if hull_result:
print( 'Point P is in the hull space of X')
else:
print( 'Point P is NOT in the hull space of X')
if return_hull:
return hull_result, X
else:
return hull_result
Test on some sample data:
n_dim = 3
n_points = 20
np.random.seed(0)
P = np.random.random(size=(1,n_dim))
X = np.random.random(size=(n_points,n_dim))
_, X_hull = hull_test(P, X, use_hull=True, hull_tolerance=1e-5, return_hull=True)
Output:
# boundary points: 14
x.T * X - P: 2.13984259782e-06
Point P is in the hull space of X
Visualize it:
rows = max(1,n_dim-1)
cols = rows
plt.figure(figsize=(rows*3,cols*3))
for row in range(rows):
for col in range(row, cols):
col += 1
plt.subplot(cols,rows,row*rows+col)
plt.scatter(P[:,row],P[:,col],label='P',s=300)
plt.scatter(X[:,row],X[:,col],label='X',alpha=0.5)
plt.scatter(X_hull[:,row],X_hull[:,col],label='X_hull')
plt.xlabel('x{}'.format(row))
plt.ylabel('x{}'.format(col))
plt.tight_layout()

Algorithm to find the closest 3 points that when triangulated cover another point

Picture a canvas that has a bunch of points randomly dispersed around it. Now pick one of those points. How would you find the closest 3 points to it such that if you drew a triangle connecting those points it would cover the chosen point?
Clarification: By "closest", I mean minimum sum of distances to the point.
This is mostly out of curiosity. I thought it would be a good way to estimate the "value" of a point if it is unknown, but the surrounding points are known. With 3 surrounding points you could extrapolate the value. I haven't heard of a problem like this before, doesn't seem very trivial so I thought it might be a fun exercise, even if it's not the best way to estimate something.

Your problem description is ambiguous. Which triangle are you after in this figure, the red one or the blue one?
The blue triangle is closer based on lexicographic comparison of the distances of the points, while the red triangle is closer based on the sum of the distances of the points.
Edit: you clarified it to make it clear that you want the sum of distances to be minimized (the red triangle).
So, how about this sketch algorithm?
Assume that the chosen point is at the origin (makes description of algorithm easy).
Sort the points by distance from the origin: P(1) is closest, P(n) is farthest.
Start with i = 3, s = ∞.
For each triple of points P(a), P(b), P(i) with a < b < i, if the triangle contains the origin, let s = min(s, |P(a)| + |P(b)| + |P(i)|).
If s ≤ |P(1)| + |P(2)| + |P(i)|, stop.
If i = n, stop.
Otherwise, increment i and go back to step 4.
Obviously this is O(n³) in the worst case.
Here's a sketch of another algorithm. Consider all pairs of points (A, B). For a third point to make a triangle containing the origin, it must lie in the grey shaded region in this figure:
By representing the points in polar coordinates (r, θ) and sorting them according to θ, it is straightforward to examine all these points and pick the closest one to the origin.
This is also O(n³) in the worst case, but a sensible order of visiting pairs (A, B) should yield an early exit in many problem instances.

Just a warning on the iterative method. You may find a triangle with 3 "near points" whose "length" is greater than another resulting by adding a more distant point to the set. Sorry, can't post this as a comment.
See Graph.
Red triangle has perimeter near 4 R while the black one has 3 Sqrt[3] -> 5.2 R

Like #thejh suggests, sort your points by distance from the chosen point.
Starting with the first 3 points, look for a triangle covering the chosen point.
If no triangle is found, expand you range to include the next closest point, and try all combinations.
Once a triangle is found, you don't necessarily have the final answer. However, you have now limited the final set of points to check. The furthest possible point to check would be at a distance equal to the sum of the distances of the first triangle found. Any further than this, and the sum of the distances is guaranteed to exceed the first triangle that was found.
Increase your range of points to include the last point whose distance <= the sum of the distances of the first triangle found.
Now check all combinations, and the answer is the triangle found from this set with the minimal sum of distances.

second shot
subsolution: (analytic geometry basics, skip if you are familiar with this) finding point of the opposite half-plane
Example: Let's have two points: A=[a,b]=[2,3] and B=[c,d]=[4,1]. Find vector u = A-B = (2-4,3-1) = (-2,2). This vector is parallel to AB line, so is the vector (-1,1). The equation for this line is defined by vector u and point in AB (i.e. A):
X = 2 -1*t
Y = 3 +1*t
Where t is any real number. Get rid of t:
t = 2 - X
Y = 3 + t = 3 + (2 - X) = 5 - X
X + Y - 5 = 0
Any point that fits in this equation is in the line.
Now let's have another point to define the half-plane, i.e. C=[1,1], we get:
X + Y - 5 = 1 + 1 - 5 < 0
Any point with opposite non-equation sign is in another half-plane, which are these points:
X + Y - 5 > 0
solution: finding the minimum triangle that fits the point S
Find the closest point P as min(sqrt( (Xp - Xs)^2 + (Yp - Ys)^2 ))
Find perpendicular vector to SP as u = (-Yp+Ys,Xp-Xs)
Find two closest points A, B from the opposite half-plane to sigma = pP where p = Su (see subsolution), such as A is on the different site of line q = SP (see final part of the subsolution)
Now we have triangle ABP that covers S: calculate sum of distances |SP|+|SA|+|SB|
Find the second closest point to S and continue from 1. If the sum of distances is smaller than that in previous steps, remember it. Stop if |SP| is greater than the smallest sum of distances or no more points are available.
I hope this diagram makes it clear.

This is my first shot:
split the space into quadrants
with picked point at the [0,0]
coords
find the closest point
from each quadrant (so you have 4
points)
any triangle from these
points should be small enough (but not necesarilly the smallest)

Take the closest N=3 points. Check whether the triange fits. If not, increment N by one and try out all combinations. Do that until something fits or nothing does.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio