What's the Difference Between Kendall's Distance and Kendall tau Distance? - algorithm

I'm now trying to use Kendall's distance to improve sets of rankings based on Borda counts method.
I'm asked to follow a specific document's instructions. In the document it states that :
"The Kendall's distance counts the pairwise disagreements between items from two rankings as :
where
The Kendall's distance is normalized by its maximum value C2n. The less the Kendall’s distance is, the greater the similarity degree between the rankings is.
The Kendall's tau is another method for measuring the similarity degree between rankings, which is easy to be confused with the Kendall's distance.
The Kendall's tau is defined as:
The Kendall's tau is defined based on the normalized Kendall's distance. Note that the greater the Kendall's tau is, the greater the similarity degree between the compared rankings is. In this paper, we use the Kendall's distance rather than the Kendall's tau."
My goal is to improve the following ranking by using Kendall's distance :
x1 x2 x3 x4
A1 4 1 3 2
A2 4 1 3 2
A3 4 3 2 1
A4 1 4 3 2
A5 1 2 4 3
In this ranking, the ith row represents the ranking obtained based on Ai, and each column represents the ranking position of the corresponding item in each ranking. (i.e. xn represents the items to be ranked, Ai represents the ones who rank the items.)
I don't understand what's the difference between the two distances despite the explanation of the doc. And what what does the "(j,s), j != s" beneath the sigma symbol stand for? And finally how to implement Kendall's distance in the ranking provided above?

Distance and similarity are two related concepts, but for distance, exact identity means distance 0, and as things get more different, the distance between them gets greater, with no very obvious fixed limit. A well-behaved distance will obey the rules for a metric - see https://en.wikipedia.org/wiki/Metric_(mathematics). For a similarity, exact identity means similarity 1, and similarity decreases as things get greater, but usually never decreases below 0. Kendall's tau seems to be a way of turning Kendall's distance into a similarity.
"(j,s), j != s" means consider all possibilities for j and s except those for which j = s.
You can compute Kendall's distance by simply summing over all possibilities for j not equal to s - but the time taken for this goes up with the square of the number of items. There are ways for which the time taken only goes up as n * log(n) where n is the number of items - for this and much other stuff on Kendall see https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient

Related

Correct implementation of weighted K-Nearest Neighbors

From what I understood, the classical KNN algorithm works like this (for discrete data):
Let x be the point you want to classify
Let dist(a,b) be the Euclidean distance between points a and b
Iterate through the training set points pᵢ, taking the distances dist(pᵢ,x)
Classify x as the most frequent class between the K points closest (according to dist) to x.
How would I introduce weights on this classic KNN? I read that more importance should be given to nearer points, and I read this, but couldn't understand how this would apply to discrete data.
For me, first of all, using argmax doesn't make any sense, and if the weight acts increasing the distance, than it would make the distance worse. Sorry if I'm talking nonsense.
Consider a simple example with three classifications (red green blue) and the six nearest neighbors denoted by R, G, B. I'll make this linear to simplify visualization and arithmetic
R B G x G R R
The points listed with distance are
class dist
R 3
B 2
G 1
G 1
R 2
R 3
Thus, if we're using unweighted nearest neighbours, the simple "voting" algorithm is 3-2-1 in favor of Red. However, with the weighted influences, we have ...
red_total = 1/3^2 + 1/2^2 + 1/3^2 = 1/4 + 2/9 ~= .47
blue_total = 1/2^2 ..............................= .25
green_total = 1/1^2 + 1/1^2 ......................= 2.00
... and x winds up as Green due to proximity.
That lower-delta function is merely the classification function; in this simple example, it returns red | green | blue. In a more complex example, ... well, I'll leave that to later tutorials.
Okay, off the bat let me say I am not the fan of the link you provided, it has image equations and follows a different notation in the images and the text.
So leaving that off let's look at the regular k-NN algorithm. regular k-NN is actually just a special case of weighted k-NN. You assign a weight of 1 to k neighbors and 0 to the rest.
Let Wqj denote the weight associated with a point j relative to a point q
Let yj be the class label associated with the data point j. For simplicity let us assume we are classifying birds as either crows, hens or turkeys => discrete classes. So for all j, yj <- {crow, turkey, hen}
A good weight metric is the inverse of the distance , whatever distance be it Euclidean, Mahalanobis etc.
Given all this, the class label yq you would associate with the point q you are trying to predict would be the the sum of the wqj . yj terms diviided by the sum of all weights. You do not have to the division if you normalize the weights first.
You would end up with an equation as follows somevalue1 . crow + somevalue2 . hen + somevalue3 . turkey
One of these classes will have a higher somevalue. The class witht he highest value is what you will predict for point q
For the purpose of training you can factor in the error anyway you want. Since the classes are discrete there are a limited number of simple ways you can adjust the weight to improve accuracy

Genetic Algorithm : Find curve that fits points

I am working on a genetic algorithm. Here is how it works :
Input : a list of 2D points
Input : the degree of the curve
Output : the equation of the curve that passes through points the best way (try to minimize the sum of vertical distances from point's Ys to the curve)
The algorithm finds good equations for simple straight lines and for 2-degree equations.
But for 4 points and 3 degree equations and more, it gets more complicated. I cannot find the right combination of parameters : sometimes I have to wait 5 minutes and the curve found is still very bad. I tried modifying many parameters, from population size to number of parents selected...
Do famous combinations/theorems in GA programming can help me ?
Thank you ! :)
Based on what is given, you would need a polynomial interpolation in which, the degree of the equation is number of points minus 1.
n = (Number of points) - 1
Now having said that, let's assume you have 5 points that need to be fitted and I am going to define them in a variable:
var points = [[0,0], [2,3], [4,-1], [5,7], [6,9]]
Please be noted the array of the points have been ordered by the x values which you need to do.
Then the equation would be:
f(x) = a1*x^4 + a2*x^3 + a3*x^2 + a4*x + a5
Now based on definition (https://en.wikipedia.org/wiki/Polynomial_interpolation#Constructing_the_interpolation_polynomial), the coefficients are computed like this:
Now you need to used the referenced page to come up with the coefficient.
It is not that complicated, for the polynomial interpolation of degree n you get the following equation:
p(x) = c0 + c1 * x + c2 * x^2 + ... + cn * x^n = y
This means we need n + 1 genes for the coefficients c0 to cn.
The fitness function is the sum of all squared distances from the points to the curve, below is the formula for the squared distance. Like this a smaller value is obviously better, if you don't want that you can take the inverse (1 / sum of squared distances):
d_squared(xi, yi) = (yi - p(xi))^2
I think for faster conversion you could limit the mutation, e.g. when mutating choose a new value with 20% probability between min and max (e.g. -1000 and 1000) and with 80% probabilty a random factor between 0.8 and 1.2 with which you multiply the old value.

Finding Closest Pair using Manhattan Distance

I'm trying to implement the Closest Pair algorithm with Manhattan distance. With Euclidean distance, it's working fine, but with Manhattan distance, it gives the wrong result. CLRS Exercise 33.4-3 asks us to replace Euclidean distance with Manhattan distance. They simply ask us to change one line, but it isn't what modification is needed in the below code.
lst = [(2,2),(4,2),(5,3)]
min_dist = float("inf")
for i in range(len(lst)):
for j in range(i + 1 , len(lst)):
dist = abs(lst[i][0] - lst[j][0]) + abs(lst[i][1] - lst[j][1])
if(dist < min_dist):
min_dist = dist
global minp1, minp2
minp1 = lst[i]
minp2 = lst[j]
I guess that the outcome of programs with both distances are different.
Indeed, with the euclidian distance, the closest pair is (4,2)-(5,3) while with the Manhattan distance, both (2,2)-(4,2) and (4,2)-(5,3) are closest pairs. Given your program, you only pick up the first one in the order of appearance, and the outcome is (2,2)-(4,2). If your program would return all closest pairs, you would have seen (4,2)-(5,3).
But generally speaking, there is no reason for the outcome of both programs to be the same. For example, in your example, change (5,3) to (5,3.1). To have a concrete idea of how different both distances are, it may useful for you to plot the "unit circle" using for norms and you will see that the Manhattan circle is more square than round.

Calculate a Confidence measure for Image similarity

I am using Euclidean distance between Histograms of 2 images for calculating image similarity.
The Histogram is of 15 bins and is normalized with respect to the image size (Thus, sum of all bins = 1).
Now, for the user, the distance value is not of any use and I want to convert it to a more tangible value - such as a % Confidence measure.
So, if the distance is 0, the confidence is 100% and if the distance is maximum, i.e 1 (is this correct?), then the confidence is 0%.
However, the scaling is not linear because of the properties of the histogram and the distance metric i.e. distance = 0.5 doesn't equal a confidence measure of 50%.
Can someone suggest me a scaling function to convert distance to a confidence measure ?
You could give more weight to the results where the distance is closer to 0 with an inverse exponential. Something to the effect of the following might work, where d is the distance:
((2 - d) ^ 2 - 1) / 3
A distance of 1 would result in a confidence score of 1 (i.e. 100%), and a distance of 1 would result in a confidence of 0. You'd also get at .5 a confidence of ~0.412. You can weight the lower distances higher by increasing the exponent and the divisor. An exponent of 3 instead of 2 would mean you'd want to divide the whole thing by 7 instead of 3, and would pull down a distance of .5 to ~0.339.

Algorithm to find the closest 3 points that when triangulated cover another point

Picture a canvas that has a bunch of points randomly dispersed around it. Now pick one of those points. How would you find the closest 3 points to it such that if you drew a triangle connecting those points it would cover the chosen point?
Clarification: By "closest", I mean minimum sum of distances to the point.
This is mostly out of curiosity. I thought it would be a good way to estimate the "value" of a point if it is unknown, but the surrounding points are known. With 3 surrounding points you could extrapolate the value. I haven't heard of a problem like this before, doesn't seem very trivial so I thought it might be a fun exercise, even if it's not the best way to estimate something.
Your problem description is ambiguous. Which triangle are you after in this figure, the red one or the blue one?
The blue triangle is closer based on lexicographic comparison of the distances of the points, while the red triangle is closer based on the sum of the distances of the points.
Edit: you clarified it to make it clear that you want the sum of distances to be minimized (the red triangle).
So, how about this sketch algorithm?
Assume that the chosen point is at the origin (makes description of algorithm easy).
Sort the points by distance from the origin: P(1) is closest, P(n) is farthest.
Start with i = 3, s = ∞.
For each triple of points P(a), P(b), P(i) with a < b < i, if the triangle contains the origin, let s = min(s, |P(a)| + |P(b)| + |P(i)|).
If s ≤ |P(1)| + |P(2)| + |P(i)|, stop.
If i = n, stop.
Otherwise, increment i and go back to step 4.
Obviously this is O(n³) in the worst case.
Here's a sketch of another algorithm. Consider all pairs of points (A, B). For a third point to make a triangle containing the origin, it must lie in the grey shaded region in this figure:
By representing the points in polar coordinates (r, θ) and sorting them according to θ, it is straightforward to examine all these points and pick the closest one to the origin.
This is also O(n³) in the worst case, but a sensible order of visiting pairs (A, B) should yield an early exit in many problem instances.
Just a warning on the iterative method. You may find a triangle with 3 "near points" whose "length" is greater than another resulting by adding a more distant point to the set. Sorry, can't post this as a comment.
See Graph.
Red triangle has perimeter near 4 R while the black one has 3 Sqrt[3] -> 5.2 R
Like #thejh suggests, sort your points by distance from the chosen point.
Starting with the first 3 points, look for a triangle covering the chosen point.
If no triangle is found, expand you range to include the next closest point, and try all combinations.
Once a triangle is found, you don't necessarily have the final answer. However, you have now limited the final set of points to check. The furthest possible point to check would be at a distance equal to the sum of the distances of the first triangle found. Any further than this, and the sum of the distances is guaranteed to exceed the first triangle that was found.
Increase your range of points to include the last point whose distance <= the sum of the distances of the first triangle found.
Now check all combinations, and the answer is the triangle found from this set with the minimal sum of distances.
second shot
subsolution: (analytic geometry basics, skip if you are familiar with this) finding point of the opposite half-plane
Example: Let's have two points: A=[a,b]=[2,3] and B=[c,d]=[4,1]. Find vector u = A-B = (2-4,3-1) = (-2,2). This vector is parallel to AB line, so is the vector (-1,1). The equation for this line is defined by vector u and point in AB (i.e. A):
X = 2 -1*t
Y = 3 +1*t
Where t is any real number. Get rid of t:
t = 2 - X
Y = 3 + t = 3 + (2 - X) = 5 - X
X + Y - 5 = 0
Any point that fits in this equation is in the line.
Now let's have another point to define the half-plane, i.e. C=[1,1], we get:
X + Y - 5 = 1 + 1 - 5 < 0
Any point with opposite non-equation sign is in another half-plane, which are these points:
X + Y - 5 > 0
solution: finding the minimum triangle that fits the point S
Find the closest point P as min(sqrt( (Xp - Xs)^2 + (Yp - Ys)^2 ))
Find perpendicular vector to SP as u = (-Yp+Ys,Xp-Xs)
Find two closest points A, B from the opposite half-plane to sigma = pP where p = Su (see subsolution), such as A is on the different site of line q = SP (see final part of the subsolution)
Now we have triangle ABP that covers S: calculate sum of distances |SP|+|SA|+|SB|
Find the second closest point to S and continue from 1. If the sum of distances is smaller than that in previous steps, remember it. Stop if |SP| is greater than the smallest sum of distances or no more points are available.
I hope this diagram makes it clear.
This is my first shot:
split the space into quadrants
with picked point at the [0,0]
coords
find the closest point
from each quadrant (so you have 4
points)
any triangle from these
points should be small enough (but not necesarilly the smallest)
Take the closest N=3 points. Check whether the triange fits. If not, increment N by one and try out all combinations. Do that until something fits or nothing does.

Resources