Applying weights to KNN dimensions - elasticsearch

When doing a KNN searches in ES/OS it seems to be recommended to normalize the data in the knn vectors to prevent single dimensions from over powering the the final scoring.
In my current example I have a 3 dimensional vector where all values are normalized to values between 0 and 1
[0.2, 0.3, 0.2]
From the perspective of Euclidian distance based scoring this seems to give equal weight to all dimensions.
In my particular example I am using an l2 vector:
"method": {
"name": "hnsw",
"space_type": "l2",
"engine": "nmslib",
}
However, if I want to give more weight to one of my dimensions (say by a factor of 2), would it be acceptable to single out that dimension and normalize between 0-2 instead of the base range of 0-1?
Example:
[0.2, 0.3, 1.2] // Third vector is now between 0-2
The distance computation for this term would now be (2 * (xi - yi))^2 and lead to bigger diffs compared to the rest. As a result the overall score would be more sensitive to differences in this particular term.
In OS the score is calculated as 1 / (1 + Distance Function) so the higher the value returned from the distance function, the lower the score will be.
Is there a method to deciding what the weighting range should be? Setting the range too high would likely make the dimension too dominant?

Related

Is it fine to have a threshold greater than 1 in roc_curve metrics?

Predicting the probability of class assignment for each chosen sample from the Train_features:
probs = classifier.predict_proba(Train_features)`
Choosing the class for which the AUC has to be determined.
preds = probs[:,1]
Calculating false positive rate, true positive rate and the possible thresholds that can clearly separate TP and TN.
fpr, tpr, threshold = metrics.roc_curve(Train_labels, preds)
roc_auc = metrics.auc(fpr, tpr)
print(max(threshold))
Output : 1.97834
The previous answer did not really address your question of why the threshold is > 1, and in fact is misleading when it says the threshold does not have any interpretation.
The range of threshold should technically be [0,1] because it is the probability threshold. But scikit learn adds +1 to the last number in the threshold array to cover the full range [0, 1]. So if in your example the max(threshold) = 1.97834, the very next number in the threshold array should be 0.97834.
See this sklearn github issue thread for an explanation. It's a little funny because somebody thought this is a bug, but it's just how the creators of sklearn decided to define threshold.
Finally, because it is a probability threshold, it does have a very useful interpretation. The optimal cutoff is the threshold at which sensitivity + specificity are maximum. In sklearn learn this can be computed like so
fpr_p, tpr_p, thresh = roc_curve(true_labels, pred)
# maximize sensitivity + specificity, i.e. tpr + (1-fpr) or just tpr-fpr
th_optimal = thresh[np.argmax(tpr_p - fpr_p)]
The threshold value does not have any kind of interpretation, what really matters is the shape of the ROC curve. Your classifier performs well if there are thresholds (no matter their values) such that the generated ROC curve lies above the linear function (better than random guessing); your classifier has a perfect result (this happens rarely in practice) if for any threshold the ROC curve is only one point at (0,1); your classifier has the worst result if for any threshold the ROC curve is only one point at (1,0). A good indicator of the performance of your classifier is the integral of the ROC curve, this indicator is known as AUC and is limited between 0 and 1, 0 for the worst performance and 1 for perfect performance.

How to compute the variances in Expectation Maximization with n dimensions?

I have been reviewing Expectation Maximization (EM) in research papers such as this one:
http://pdf.aminer.org/000/221/588/fuzzy_k_means_clustering_with_crisp_regions.pdf
I have some doubts that I have not figured it out. For example, what would happen if we have many dimensions for each datapoint?
For example I have the following dataset with 6 datapoints and 4 dimensions:
>D1 D2 D3 D4
5, 19, 72, 5
6, 18, 14, 1
7, 22, 29, 4
3, 22, 51, 1
2, 21, 89, 2
1, 12, 28, 1
It means that for computing the expectation step, do I need to compute 4 standard deviations (one for each dimension)?
Do I also have to compute the variance for each cluster assuming k=3 (Do not know if it is necessary based on the formula from the paper...) or just the variances for each dimensions (4 attributes)?
Usually, you use a Covariance matrix, which also includes variances.
But it really depends on your chosen model. The simplest model does not use variances at all.
A more complex model has a single variance value, the average variance over all dimensions.
Next, you can have a separate variance for each dimension independently; and last but not least a full covariance matrix. That is probably the most flexible GMM in popular use.
Depending on your implementation, there can be many more.
From R's mclust documentation:
univariate mixture
"E" = equal variance (one-dimensional)
"V" = variable variance (one-dimensional)
multivariate mixture
"EII" = spherical, equal volume
"VII" = spherical, unequal volume
"EEI" = diagonal, equal volume and shape
"VEI" = diagonal, varying volume, equal shape
"EVI" = diagonal, equal volume, varying shape
"VVI" = diagonal, varying volume and shape
"EEE" = ellipsoidal, equal volume, shape, and orientation
"EEV" = ellipsoidal, equal volume and equal shape
"VEV" = ellipsoidal, equal shape
"VVV" = ellipsoidal, varying volume, shape, and orientation
single component
"X" = univariate normal
"XII" = spherical multivariate normal
"XXI" = diagonal multivariate normal
"XXX" = elliposidal multivariate normal

'Classifying with k-Nearest Neighbors' for not-number parameters

I have a fact data with set of parameters and some value that correspond to this parameters.
For example:
Street Color Shape Value
--------------------------------------
Versky Blue Ball 10
Soll Green Square 5
...
Now I need a create a function which get set of parameters [Holl, Red, Circle] and returns the predicted 'Value'.
If my parameters were the numbers I could use 'Classifying with k-Nearest Neighbors' algorithm, but they weren't.
Which machine-learning algorithm can I use to solve this task ?
Note that nearest neighbor is finding the nearest neighbor according to some distance metric. While indeed euclidean or similar metrics are widely used, any distance metric can be fine.
You can use a variation of Hamming distance:
Let x[i] be the i'th feature of vector x
Let the number of features be n
d(x,y) = Sum { (x[i] == y[i] ? 0 : 1) | i from 0 to n }
The above is a distance metric which is basically a variation of hamming distance where each feature got its unique alphabet.

2D coordinate normalization

I need to implement a function which normalizes coordinates. I define normalize as (please suggest a better term if Im wrong):
Mapping entries of a data set from their natural range to values between 0 and 1.
Now this was easy in one dimension:
static List<float> Normalize(float[] nums)
{
float max = Max(nums);
float min = Min(nums);
float delta = max - min;
List<float> li = new List<float>();
foreach (float i in nums)
{
li.Add((i - min) / delta);
}
return li;
}
I need a 2D version as well and that one has to keep the aspect ratio intact. But Im having some troubles figuring out the math.
Although the code posted is in C# the answers need not to be.
Thanks in advance. :)
I am posting my response as an answer because I do not have enough points to make a comment.
My interpretation of the question: How do we normalize the coordinates of a set of points in 2 dimensional space?
A normalization operation involves a "shift and scale" operation. In case of 1 dimensional space this is fairly easy and intuitive (as pointed out by #Mizipzor).
normalizedX=(originalX-minX)/(maxX-minX)
In this case we are first shifing the value by a distance of minX and then scaling it by the range which is given by (maxX-minX). The shift operation ensures that the minimum moves to 0 and the scale operation squashes the distribution such that the distribution has an upper limit of 1
In case of 2d , simply dividing by the largest dimension is not enought. Why?
Consider the simplified case with just 2 points as shown below.
The maximum value of any dimension is the Y value of point B and this 10000.
Coordinates of normalized A=>5000/10000,8000/10000 ,i.e 0.5,0.8
Coordinates of normalized A=>7000/10000,10000/10000 ,i.e 0.7,1.0
The X and Y values are all with 0 and 1. However, the distribution of the normalized values is far from uniform. The minimum value is just 0.5. Ideally this should be closer to 0.
Preferred approach for normalizing 2d coordinates
To get a more even distribution we should do a "shift" operation around the minimum of all X values and minimum of all Y values. This could be done around the mean of X and mean of Y as well. Considering the above example,
the minimum of all X is 5000
the minimum of all Y is 8000
Step 1 - Shift operation
A=>(5000-5000,8000-8000), i.e (0,0)
B=>(7000-5000,10000-8000), i.e. (2000,2000)
Step 2 - Scale operation
To scale down the values we need some maximum. We could use the diagonal AB whose length is 2000
A=>(0/2000,0/2000), i.e. (0,0)
B=>(2000/2000,2000/2000)i.e. (1,1)
What happens when there are more than 2 points?
The approach remains similar. We find the coordinates of the smallest bounding box which fits all the points.
We find the minimum value of X (MinX) and minimum value of Y (MinY) from all the points and do a shift operation. This changes the origin to the lower left corner of the bounding box.
We find the maximum value of X (MaxX) and maximum value of Y (MaxY) from all the points.
We calculate the length of the diagonal connecting (MinX,MinY) and (MaxX,MaxY) and use this value to do a scale operation.
.
length of diagonal=sqrt((maxX-minX)*(maxX-minX) + (maxY-minY)*(maxY-minY))
normalized X = (originalX - minX)/(length of diagonal)
normalized Y = (originalY - minY)/(length of diagonal)
How does this logic change if we have more than 2 dimensions?
The concept remains the same.
- We find the minimum value in each of the dimensions (X,Y,Z)
- We find the maximum value in each of the dimensions (X,Y,Z)
- Compute the length of the diagonal as a scaling factor
- Use the minimum values to shift the origin.
length of diagonal=sqrt((maxX-minX)*(maxX-minX)+(maxY-minY)*(maxY-minY)+(maxZ-minZ)*(maxZ-minZ))
normalized X = (originalX - minX)/(length of diagonal)
normalized Y = (originalY - minY)/(length of diagonal)
normalized Z = (originalZ - minZ)/(length of diagonal)
It seems you want each vector (1D, 2D or ND) to have length <= 1.
If that's the only requirement, you can just divide each vector by the length of the longest one.
double max = maximum (|vector| for each vector in 'data');
foreach (Vector v : data) {
li.add(v / max);
}
That will make the longest vector in result list to have length 1.
But this won't be equivalent of your current code for 1-dimensional case, as you can't find minimum or maximum in a set of points on the plane. Thus, no delta.
Simple idea: Find out which dimension is bigger and normalize in this dimension. The second dimension can be computed by using the ratio. This way the ratio is kept and your values are between 0 and 1.

Averaging angles... Again

I want to calculate the average of a set of angles, which represents source bearing (0 to 360 deg) - (similar to wind-direction)
I know it has been discussed before (several times). The accepted answer was Compute unit vectors from the angles and take the angle of their average.
However this answer defines the average in a non intuitive way. The average of 0, 0 and 90 will be atan( (sin(0)+sin(0)+sin(90)) / (cos(0)+cos(0)+cos(90)) ) = atan(1/2)= 26.56 deg
I would expect the average of 0, 0 and 90 to be 30 degrees.
So I think it is fair to ask the question again: How would you calculate the average, so such examples will give the intuitive expected answer.
Edit 2014:
After asking this question, I've posted an article on CodeProject which offers a thorough analysis. The article examines the following reference problems:
Given time-of-day [00:00-24:00) for each birth occurred in US in the year 2000 - Calculate the mean birth time-of-day
Given a multiset of direction measurements from a stationary transmitter to a stationary receiver, using a measurement technique with a wrapped normal distributed error – Estimate the direction.
Given a multiset of azimuth estimates between two points, made by “ordinary” humans (assuming to subject to a wrapped truncated normal distributed error) – Estimate the direction.
[Note the OP's question (but not title) appears to have changed to a rather specialised question ("...the average of a SEQUENCE of angles where each successive addition does not differ from the running mean by more than a specified amount." ) - see #MaR comment and mine. My following answer addresses the OP's title and the bulk of the discussion and answers related to it.]
This is not a question of logic or intuition, but of definition. This has been discussed on SO before without any real consensus. Angles should be defined within a range (which might be -PI to +PI, or 0 to 2*PI or might be -Inf to +Inf. The answers will be different in each case.
The word "angle" causes confusion as it means different things. The angle of view is an unsigned quantity (and is normally PI > theta > 0. In that cases "normal" averages might be useful. Angle of rotation (e.g. total rotation if an ice skater) might or might not be signed and might include theta > 2PI and theta < -2PI.
What is defined here is angle = direction whihch requires vectors. If you use the word "direction" instead of "angle" you will have captured the OP's (apparent original) intention and it will help to move away from scalar quantities.
Wikipedia shows the correct approach when angles are defined circularly such that
theta = theta+2*PI*N = theta-2*PI*N
The answer for the mean is NOT a scalar but a vector. The OP may not feel this is intuitive but it is the only useful correct approach. We cannot redefine the square root of -4 to be -2 because it's more initutive - it has to be +-2*i. Similarly the average of bearings -90 degrees and +90 degrees is a vector of zero length, not 0.0 degrees.
Wikipedia (http://en.wikipedia.org/wiki/Mean_of_circular_quantities) has a special section and states (The equations are LaTeX and can be seen rendered in Wikipedia):
Most of the usual means fail on
circular quantities, like angles,
daytimes, fractional parts of real
numbers. For those quantities you need
a mean of circular quantities.
Since the arithmetic mean is not
effective for angles, the following
method can be used to obtain both a
mean value and measure for the
variance of the angles:
Convert all angles to corresponding
points on the unit circle, e.g., α to
(cosα,sinα). That is convert polar
coordinates to Cartesian coordinates.
Then compute the arithmetic mean of
these points. The resulting point will
lie on the unit disk. Convert that
point back to polar coordinates. The
angle is a reasonable mean of the
input angles. The resulting radius
will be 1 if all angles are equal. If
the angles are uniformly distributed
on the circle, then the resulting
radius will be 0, and there is no
circular mean. In other words, the
radius measures the concentration of
the angles.
Given the angles
\alpha_1,\dots,\alpha_n the mean is
computed by
M \alpha = \operatorname{atan2}\left(\frac{1}{n}\cdot\sum_{j=1}^n
\sin\alpha_j,
\frac{1}{n}\cdot\sum_{j=1}^n
\cos\alpha_j\right)
using the atan2 variant of the
arctangent function, or
M \alpha = \arg\left(\frac{1}{n}\cdot\sum_{j=1}^n
\exp(i\cdot\alpha_j)\right)
using complex numbers.
Note that in the OP's question an angle of 0 is purely arbitrary - there is nothing special about wind coming from 0 as opposed to 180 (except in this hemisphere it's colder on the bicycle). Try changing 0,0,90 to 289, 289, 379 and see how the simple arithmetic no longer works.
(There are some distributions where angles of 0 and PI have special significance but they are not in scope here).
Here are some intense previous discussions which mirror the current spread of views :-)
Link
How do you calculate the average of a set of circular data?
http://forums.xkcd.com/viewtopic.php?f=17&t=22435
http://www.allegro.cc/forums/thread/595008
Thank you all for helping me see my problem more clearly.
I found what I was looking for.
It is called Mitsuta method.
The inputs and output are in the range [0..360).
This method is good for averaging data that was sampled using constant sampling intervals.
The method assumes that the difference between successive samples is less than 180 degrees (which means that if we won't sample fast enough, a 330 degrees change in the sampled signal would be incorrectly detected as a 30 degrees change in the other direction and will insert an error into the calculation). Nyquist–Shannon sampling theorem anybody ?
Here is a c++ code:
double AngAvrg(const vector<double>& Ang)
{
vector<double>::const_iterator iter= Ang.begin();
double fD = *iter;
double fSigD= *iter;
while (++iter != Ang.end())
{
double fDelta= *iter - fD;
if (fDelta < -180.) fD+= fDelta + 360.;
else if (fDelta > 180.) fD+= fDelta - 360.;
else fD+= fDelta ;
fSigD+= fD;
}
double fAvrg= fSigD / Ang.size();
if (fAvrg >= 360.) return fAvrg -360.;
if (fAvrg < 0. ) return fAvrg +360.;
return fAvrg ;
}
It is explained on page 51 of Meteorological Monitoring Guidance for Regulatory Modeling Applications (PDF)(171 pp, 02-01-2000, 454-R-99-005)
Thank you MaR for sending the link as a comment.
If the sampled data is constant, but our sampling device has an inaccuracy with a Von Mises distribution, a unit-vectors calculation will be appropriate.
This is incorrect on every level.
Vectors add according to the rules of vector addition. The "intuitive, expected" answer might not be that intuitive.
Take the following example. If I have one unit vector (1, 0), with origin at (0,0) that points in the +x-direction and another (-1, 0) that also has its origin at (0,0) that points in the -x-direction, what should the "average" angle be?
If I simply add the angles and divide by two, I can argue that the "average" is either +90 or -90. Which one do you think it should be?
If I add the vectors according to the rules of vector addition (component by component), I get the following:
(1, 0) + (-1, 0) = (0, 0)
In polar coordinates, that's a vector with zero magnitude and angle zero.
So what should the "average" angle be? I've got three different answers here for a simple case.
I think the answer is that vectors don't obey the same intuition that numbers do, because they have both magnitude and direction. Maybe you should describe what problem you're solving a bit better.
Whatever solution you decide on, I'd advise you to base it on vectors. It'll always be correct that way.
What does it even mean to average source bearings? Start by answering that question, and you'll get closer to being to define what you mean by the average of angles.
In my mind, an angle with tangent equal to 1/2 is the right answer. If I have a unit force pushing me in the direction of the vector (1, 0), another force pushing me in the direction of the vector (1, 0) and third force pushing me in the direction of the vector (0, 1), then the resulting force (the sum of these forces) is the force pushing me in the direction of (1, 2). These the the vectors representing the bearings 0 degrees, 0 degrees and 90 degrees. The angle represented by the vector (1, 2) has tangent equal to 1/2.
Responding to your second edit:
Let's say that we are measuring wind direction. Our 3 measurements were 0, 0, and 90 degrees. Since all measurements are equivalently reliable, why shouldn't our best estimate of the wind direction be 30 degrees? setting it to 25.56 degrees is a bias toward 0...
Okay, here's an issue. The unit vector with angle 0 doesn't have the same mathematical properties that the real number 0 has. Using the notation 0v to represent the vector with angle 0, note that
0v + 0v = 0v
is false but
0 + 0 = 0
is true for real numbers. So if 0v represents wind with unit speed and angle 0, then 0v + 0v is wind with double unit speed and angle 0. And then if we have a third wind vector (which I'll representing using the notation 90v) which has angle 90 and unit speed, then the wind that results from the sum of these vectors does have a bias because it's traveling at twice unit speed in the horizontal direction but only unit speed in the vertical direction.
In my opinion, this is about angles, not vectors. For that reason the average of 360 and 0 is truly 180.
The average of one turn and no turns should be half a turn.
Edit: Equivalent, but more robust algorithm (and simpler):
divide angles into 2 groups, [0-180) and [180-360)
numerically average both groups
average the 2 group averages with proper weighting
if wraparound occurred, correct by 180˚
This works because number averaging works "logically" if all the angles are in the same hemicircle. We then delay getting wraparound error until the very last step, where it is easily detected and corrected. I also threw in some code for handling opposite angle cases. If the averages are opposite we favor the hemisphere that had more angles in it, and in the case of equal angles in both hemispheres we return None because no average would make sense.
The new code:
def averageAngles2(angles):
newAngles = [a % 360 for a in angles];
smallAngles = []
largeAngles = []
# split the angles into 2 groups: [0-180) and [180-360)
for angle in newAngles:
if angle < 180:
smallAngles.append(angle)
else:
largeAngles.append(angle)
smallCount = len(smallAngles)
largeCount = len(largeAngles)
#averaging each of the groups will work with standard averages
smallAverage = sum(smallAngles) / float(smallCount) if smallCount else 0
largeAverage = sum(largeAngles) / float(largeCount) if largeCount else 0
if smallCount == 0:
return largeAverage
if largeCount == 0:
return smallAverage
average = (smallAverage * smallCount + largeAverage * largeCount) / \
float(smallCount + largeCount)
if largeAverage < smallAverage + 180:
# average will not hit wraparound
return average
elif largeAverage > smallAverage + 180:
# average will hit wraparound, so will be off by 180 degrees
return (average + 180) % 360
else:
# opposite angles: return whichever has more weight
if smallCount > largeCount:
return smallAverage
elif smallCount < largeCount:
return largeAverage
else:
return None
>>> averageAngles2([0, 0, 90])
30.0
>>> averageAngles2([30, 350])
10.0
>>> averageAngles2([0, 200])
280.0
Here's a slightly naive algorithm:
remove all oposite angles from the list
take a pair of angles
rotate them to the first and second quadrant and average them
rotate average angle back by same amount
for each remaining angle, average in same way, but with successively increasing weight to the composite angle
some python code (step 1 not implemented)
def averageAngles(angles):
newAngles = [a % 360 for a in angles];
average = 0
weight = 0
for ang in newAngles:
theta = 0
if 0 < ang - average <= 180:
theta = 180 - ang
else:
theta = 180 - average
r_ang = (ang + theta) % 360
r_avg = (average + theta) % 360
average = ((r_avg * weight + r_ang) / float(weight + 1) - theta) % 360
weight += 1
return average
Here's the answer I gave to this same question:
How do you calculate the average of a set of circular data?
It gives answers inline with what the OP says he wants, but attention should be paid to this:
"I would also like to stress that even though this is a true average of angles, unlike the vector solutions, that does not necessarily mean it is the solution you should be using, the average of the corresponding unit vectors may well be the value you actually should to be using."
You are correct that the accepted answer of using traditional average is wrong.
An average of a set of points x_1 ... x_n in a metric space X is an element x in X that minimizes the sum of distances squares to each point (See Frechet mean). If you try to find this minimum using simple calculus with regular real numbers, you will recover the standard "add up and divide by n" formula.
For an angle, our elements are actually points on the unit circle S1. Our metric isn't euclidean distance, but arc length, which is proportional to angle.
So, the average angle is the one that minimizes the square of the angle difference between each other angle. In other words,
if you have a function angleBetween(a, b) you want to find the angle a
such that sum over i of angleBetween(a_i, a) is minimized.
This is an optimization problem which can be solved using a numerical optimizer. Several of the answers here claim to provide simpler closed forms, or at least better approximations.
Statistics
As you point out in your article, you need to assume errors follow a Gaussian distribution to justify using least squares as the maximum likelyhood estimator. So in this application, where is the error? Is the random error in the position of two things, and the angle is just the normal of the line between them? If so, that normal will not follow a Gaussian distribution, even if the error in point position does. Taking means of angles only really makes sense if the random error is observed in the angle itself.
You could do this: Say you have a set of angles in an array angle, then to compute the array first do: angle[i] = angle[i] mod 360, now perform a simple average over the array. So when you have 360, 10, 20, you are averaging 0, 10 and 20 - the results are intuitive.
What is wrong with taking the set of angles as real values and just computing the arithmetic average of those numbers? Then you would get the intuitive (0+0+90)/3 = 30 deg.
Edit: Thanks for useful comments and pointing out that angles may exceed 360. I believe the answer could be the normal arithmetic average reduced "modulo" 360: we sum all the values, divide by the number of angles and then subtract/add a multiple of 360 so that the result lies in the interval [0..360).
I think the problem stems from how you treat angles greater than 180 (and those greater than 360 as well). If you reduce the angles to a range of +180 to -180 before adding them to the total, you get something more reasonable:
int AverageOfAngles(int angles[], int count)
{
int total = 0;
for (int index = 0; index < count; index++)
{
int angle = angles[index] % 360;
if (angle > 180) { angle -= 360; }
total += angle;
}
return (int)((float)total/count);
}
Maybe you could represent angles as quaternions and take average of these quaternions and convert it back to angle.
I don't know If it gives you what you want because quaternions are rather rotations than angles. I also don't know if it will give you anything different from vector solution.
Quaternions in 2D simplify to complex numbers so I guess It's just vectors but maybe some interesting quaternion averaging algorithm like http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/20070017872_2007014421.pdf when simplified to 2D will behave better than just vector average.
Here you go! The reference is https://www.wxforum.net/index.php?topic=8660.0
def avgWind(directions):
sinSum = 0
cosSum = 0
d2r = math.pi/180 #degree to radian
r2d = 180/math.pi
for i in range(len(directions)):
sinSum += math.sin(directions[i]*d2r)
cosSum += math.cos(directions[i]*d2r)
return ((r2d*(math.atan2(sinSum, cosSum)) + 360) % 360)
a= np.random.randint(low=0, high=360, size=6)
print(a)
avgWind(a)

Resources