Calculate a Confidence measure for Image similarity - image

I am using Euclidean distance between Histograms of 2 images for calculating image similarity.
The Histogram is of 15 bins and is normalized with respect to the image size (Thus, sum of all bins = 1).
Now, for the user, the distance value is not of any use and I want to convert it to a more tangible value - such as a % Confidence measure.
So, if the distance is 0, the confidence is 100% and if the distance is maximum, i.e 1 (is this correct?), then the confidence is 0%.
However, the scaling is not linear because of the properties of the histogram and the distance metric i.e. distance = 0.5 doesn't equal a confidence measure of 50%.
Can someone suggest me a scaling function to convert distance to a confidence measure ?

You could give more weight to the results where the distance is closer to 0 with an inverse exponential. Something to the effect of the following might work, where d is the distance:
((2 - d) ^ 2 - 1) / 3
A distance of 1 would result in a confidence score of 1 (i.e. 100%), and a distance of 1 would result in a confidence of 0. You'd also get at .5 a confidence of ~0.412. You can weight the lower distances higher by increasing the exponent and the divisor. An exponent of 3 instead of 2 would mean you'd want to divide the whole thing by 7 instead of 3, and would pull down a distance of .5 to ~0.339.

Related

What's the Difference Between Kendall's Distance and Kendall tau Distance?

I'm now trying to use Kendall's distance to improve sets of rankings based on Borda counts method.
I'm asked to follow a specific document's instructions. In the document it states that :
"The Kendall's distance counts the pairwise disagreements between items from two rankings as :
where
The Kendall's distance is normalized by its maximum value C2n. The less the Kendall’s distance is, the greater the similarity degree between the rankings is.
The Kendall's tau is another method for measuring the similarity degree between rankings, which is easy to be confused with the Kendall's distance.
The Kendall's tau is defined as:
The Kendall's tau is defined based on the normalized Kendall's distance. Note that the greater the Kendall's tau is, the greater the similarity degree between the compared rankings is. In this paper, we use the Kendall's distance rather than the Kendall's tau."
My goal is to improve the following ranking by using Kendall's distance :
x1 x2 x3 x4
A1 4 1 3 2
A2 4 1 3 2
A3 4 3 2 1
A4 1 4 3 2
A5 1 2 4 3
In this ranking, the ith row represents the ranking obtained based on Ai, and each column represents the ranking position of the corresponding item in each ranking. (i.e. xn represents the items to be ranked, Ai represents the ones who rank the items.)
I don't understand what's the difference between the two distances despite the explanation of the doc. And what what does the "(j,s), j != s" beneath the sigma symbol stand for? And finally how to implement Kendall's distance in the ranking provided above?
Distance and similarity are two related concepts, but for distance, exact identity means distance 0, and as things get more different, the distance between them gets greater, with no very obvious fixed limit. A well-behaved distance will obey the rules for a metric - see https://en.wikipedia.org/wiki/Metric_(mathematics). For a similarity, exact identity means similarity 1, and similarity decreases as things get greater, but usually never decreases below 0. Kendall's tau seems to be a way of turning Kendall's distance into a similarity.
"(j,s), j != s" means consider all possibilities for j and s except those for which j = s.
You can compute Kendall's distance by simply summing over all possibilities for j not equal to s - but the time taken for this goes up with the square of the number of items. There are ways for which the time taken only goes up as n * log(n) where n is the number of items - for this and much other stuff on Kendall see https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient

Correct implementation of weighted K-Nearest Neighbors

From what I understood, the classical KNN algorithm works like this (for discrete data):
Let x be the point you want to classify
Let dist(a,b) be the Euclidean distance between points a and b
Iterate through the training set points pᵢ, taking the distances dist(pᵢ,x)
Classify x as the most frequent class between the K points closest (according to dist) to x.
How would I introduce weights on this classic KNN? I read that more importance should be given to nearer points, and I read this, but couldn't understand how this would apply to discrete data.
For me, first of all, using argmax doesn't make any sense, and if the weight acts increasing the distance, than it would make the distance worse. Sorry if I'm talking nonsense.
Consider a simple example with three classifications (red green blue) and the six nearest neighbors denoted by R, G, B. I'll make this linear to simplify visualization and arithmetic
R B G x G R R
The points listed with distance are
class dist
R 3
B 2
G 1
G 1
R 2
R 3
Thus, if we're using unweighted nearest neighbours, the simple "voting" algorithm is 3-2-1 in favor of Red. However, with the weighted influences, we have ...
red_total = 1/3^2 + 1/2^2 + 1/3^2 = 1/4 + 2/9 ~= .47
blue_total = 1/2^2 ..............................= .25
green_total = 1/1^2 + 1/1^2 ......................= 2.00
... and x winds up as Green due to proximity.
That lower-delta function is merely the classification function; in this simple example, it returns red | green | blue. In a more complex example, ... well, I'll leave that to later tutorials.
Okay, off the bat let me say I am not the fan of the link you provided, it has image equations and follows a different notation in the images and the text.
So leaving that off let's look at the regular k-NN algorithm. regular k-NN is actually just a special case of weighted k-NN. You assign a weight of 1 to k neighbors and 0 to the rest.
Let Wqj denote the weight associated with a point j relative to a point q
Let yj be the class label associated with the data point j. For simplicity let us assume we are classifying birds as either crows, hens or turkeys => discrete classes. So for all j, yj <- {crow, turkey, hen}
A good weight metric is the inverse of the distance , whatever distance be it Euclidean, Mahalanobis etc.
Given all this, the class label yq you would associate with the point q you are trying to predict would be the the sum of the wqj . yj terms diviided by the sum of all weights. You do not have to the division if you normalize the weights first.
You would end up with an equation as follows somevalue1 . crow + somevalue2 . hen + somevalue3 . turkey
One of these classes will have a higher somevalue. The class witht he highest value is what you will predict for point q
For the purpose of training you can factor in the error anyway you want. Since the classes are discrete there are a limited number of simple ways you can adjust the weight to improve accuracy

Genetic Algorithm : Find curve that fits points

I am working on a genetic algorithm. Here is how it works :
Input : a list of 2D points
Input : the degree of the curve
Output : the equation of the curve that passes through points the best way (try to minimize the sum of vertical distances from point's Ys to the curve)
The algorithm finds good equations for simple straight lines and for 2-degree equations.
But for 4 points and 3 degree equations and more, it gets more complicated. I cannot find the right combination of parameters : sometimes I have to wait 5 minutes and the curve found is still very bad. I tried modifying many parameters, from population size to number of parents selected...
Do famous combinations/theorems in GA programming can help me ?
Thank you ! :)
Based on what is given, you would need a polynomial interpolation in which, the degree of the equation is number of points minus 1.
n = (Number of points) - 1
Now having said that, let's assume you have 5 points that need to be fitted and I am going to define them in a variable:
var points = [[0,0], [2,3], [4,-1], [5,7], [6,9]]
Please be noted the array of the points have been ordered by the x values which you need to do.
Then the equation would be:
f(x) = a1*x^4 + a2*x^3 + a3*x^2 + a4*x + a5
Now based on definition (https://en.wikipedia.org/wiki/Polynomial_interpolation#Constructing_the_interpolation_polynomial), the coefficients are computed like this:
Now you need to used the referenced page to come up with the coefficient.
It is not that complicated, for the polynomial interpolation of degree n you get the following equation:
p(x) = c0 + c1 * x + c2 * x^2 + ... + cn * x^n = y
This means we need n + 1 genes for the coefficients c0 to cn.
The fitness function is the sum of all squared distances from the points to the curve, below is the formula for the squared distance. Like this a smaller value is obviously better, if you don't want that you can take the inverse (1 / sum of squared distances):
d_squared(xi, yi) = (yi - p(xi))^2
I think for faster conversion you could limit the mutation, e.g. when mutating choose a new value with 20% probability between min and max (e.g. -1000 and 1000) and with 80% probabilty a random factor between 0.8 and 1.2 with which you multiply the old value.

Logarithmic score decay based on time

I'm looking for an algorithm that has a score that logarithmically gets smaller over time. This is similar to this question but the algorithm should have a nice curve instead of being linear. A time of 1 should have a score of 1, with the score diminishing as the time value increases and ideally there would be a configurable value where the score become crosses the X axis and becomes 0.
This function satisfies your criteria
score(t) = -A log(t) + 1
where A > 0
The score crosses the X-axis at
T = exp(1/A)

How can I make a random selection from an inversely-weighted list?

Given a list of integers, e.g. 1, 2, 3, 4, I know how to select items based on their weight. The example items would have probabilities of 10%, 20%, 30%, and 40%, respectively.
Is there an equally simple method of selecting items based on the inverse of their weight? With this method, the example list would be equal to a weighted list of 1, 1/2, 1/3, 1/4 (48%, 24%, 16%, 12%), but I want to avoid the conversion and use of floating-point arithmetic. (Assume all of the integers are positive and non-zero.)
You could divide the numbers' least common multiple by each number and get integral proportions.
For [1, 2, 3, 4], this is 12. Your weights are 12/1=12, 12/2=6, 12/3=4, 12/4=3.
You could also multiply them all together and not bother with the LCM as well. The numbers will be higher but the proportions will be the same: 24/1=24, 24/2=12, 24/3=8, 24/4=6.
First get the sum of the weights, call it S (e.g. 1 + 1/2 + 1/3 + 1/4 = 2.083). Then to find the probability of weight w_i, you divide w_i by S (e.g. 1/2.083 = 48%.
I don't think there's a nice, closed-form formula for this expression for general sequences of numbers.
The sum of the weights are harmonic numbers. For large n, the sum converges to ln(n)+gamma where gamma is the Euler–Mascheroni constant (~0.577). So for large n, you could use this formula to approximate the sum.
EDIT: There are ways to reduce floating point errors. One such way is to calculate the sum from the smallest term up to the largest term (e.g. 1/n + 1/(n-1) + ... + 1). This allows the intermediate calculations to maximize the number of bits of precision. By doing this, rounding issues should not be a problem.

Resources