comparison of clustering algorithms performance in rapidminer - validation

I have applied different clustering algos like kmean, kmediod kmean-fast and expectation max clustering on my biomedical dataset using Rapidminer. now i want to check the performance of these algos that which algo gives better clustering results.
For that i have applied some operators like 'cluster density performance' and 'cluster distance performance' which gives me avg within cluster distance for each cluster and davis bouldin. but I am confused is it the right way to check clustering performance of each algo with these operators?
I am also interested in Silhouette method which i can apply for each algo to check performance but can't understand from where i can get b(i) and a(i) values from clustering algo output.

The most reliable way of evaluating clusterings is by looking at your data. If the clusters are of any use to you and make sense to a domain expert!
Never just rely on numbers.
For example, you can evaluate clusterings numerically by taking the within-cluster variance.
However, k-means optimizes exactly this value, so k-means will always come out best, and in fact this measure decreases with the number of k - but the results do not at all become more meaningful!
It is somewhat okay to use one coefficient such as Silhouette coefficient to compare results of the same algorithm this way. Since Silhouette coefficient is somewhat orthogonal to the variance minimization, it will make k-means stop at some point, when the result is a reasonable balance of the two objectives.
However, applying such a measure to different algorithms - which may have a different amount of correlation to the measure - is inherently unfair. Most likely, you will be overestimating one algorithm and underestimating the performance of another.
Another popular way is external evaluation, with labeled data. While this should be unbiased against the method -- unless the labels were generated by a similar method -- it has different issues: it will punish a solution that actually discovers new clusters!
All in all, evaluation of unsupervised methods is hard. Really hard. Best you can do, is see if the results prove useful in practise!

It's very good advice never to rely on the numbers.
All the numbers can do is help you focus on particular clusterings that are mathematically interesting. The Davies-Bouldin validity measure is nice because it will show a minimum when it thinks the clusters are most compact with respect to themselves and most separated with respect to others. If you plot a graph of the Davies-Bouldin measure as a function of k, the "best" clustering will show up as a minimum. Of course the data may not form into spherical clusters so this measure may not be appropriate but that's another story.
The Silhouette measure tends to a maximum when it identifies a clustering that is relatively better than another.
The cluster density and cluster distance measures often exhibit an "elbow" as they tend to zero. This elbow often coincides with an interesting clustering (I have to be honest and say I'm never really convinced by this elbow criterion approach).
If you were to plot different validity measures as a function of k and all of the measures gave an indication that a particular k was better then others that would be good reason to consider that value in more detail to see if you, as the domain expert for the data, agree.
If you're interested I have some examples here.

There are many ways to evaluate the performance of clustering models in machine learning. They are broadly divided into 3 categories-
1. Supervised techniques
2. Unsupervised techniques
3. Hybrid techniques
Supervised techniques are evaluated by comparing the value of evaluation metrics with some pre-defined ground rules and values.
For example- Jaccard similarity index, Rand Index, Purity etc.
Unsupervised techniques comprised of some evaluation metrics which cannot be compared with pre-defined values but they can be compared among different clustering models and thereby we can choose the best model.
For example - Silhouette measure, SSE
Hybrid techniques are nothing but combination of supervised and unsupervised methods.
Now, let’s have a look at the intuition behind these methods-
Silhouette measure
Silhouette measure is derived from 2 primary measures- Cohesion and Separation.
Cohesion is nothing but the compactness or tightness of the data points within a cluster.
There are basically 2 ways to compute the cohesion-
· Graph based cohesion
· Prototype based cohesion
Let’s consider that A is a cluster with 4 data points as shown in the figure-
Graph based cohesion computes the cohesion value by adding the distances (Euclidean or Manhattan) from each point to every other points.
Here,
Graph Cohesion(A) = Constant * ( Dis(1,2) + Dis(1,3) + Dis(1,4) + Dis(2,3) + Dis(2,4) + Dis(3,4) )
Where,
Constant = 1/ (2 * Average of all distances)
Prototype based cohesion is calculated by adding the distance of all data points from a commonly accepted point like centroid.
Here, let’s consider C as centroid in cluster A
Then,
Prototype Cohesion(A) = Constant * (Dis(1,C) +Dis(2,C) + Dis(3,C) + Dis(4,C))
Where,
Constant = 1/ (2 * Average of all distances)
Separation is the distance or magnitude of difference between the data points of 2 different clusters.
Here also, we have primarily 2 kinds of methods of computation of separation value.
1. Graph based separation
2. Prototype based separation
Graph based separation calculate the value by adding the distance between all points from Cluster 1 to each and every point in Cluster 2.
For example, If A and B are 2 clusters with 4 data points each then,
Graph based separation = Constant * ( Dis(A1,B1) + Dis(A1,B1) + Dis(A1,B2) + Dis(A1,B3) + Dis(A1,B4) + Dis(A2,B1) + Dis(A2,B2) + Dis(A2,B3) + Dis(A2,B4) + Dis(A3,B1) + Dis(A3,B2) + Dis(A3,B3) + Dis(A3,B4) + Dis(A4,B1) + Dis(A4,B2) + Dis(A4,B3) + Dis(A4,B4) )
Where,
Constant = 1/ number of clusters
Prototype based separation is calculated by finding the distance between the commonly accepted points of a 2 clusters like centroid.
Here, we can simply calculate the distance between the centroid of 2 cluster A and B i.e. Dis(C(A),C(B)) multiplied by a constant where constant = 1/ number of clusters.
Silhouette measure = (b-a)/max(b,a)
Where,
a = cohesion value
b = separation value
If Silhouette measure = -1 then it means clustering is very poor.
If Silhouette measure = 0 then it means clustering is good but some improvements are still possible.
If Silhouette measure = 1 then it means clustering is very good.
When we have multiple clustering algorithms, it is always recommended to choose the one with high Silhouette measure.
SSE ( Sum of squared errors)
SSE is calculated by adding Cohesion and Separation values.
SSE = Value ( Cohesion) + Value ( Separation).
When we have multiple clustering algorithms, it is always recommended to choose the one with low SSE.
Jaccard similarity index
Jaccard similarity index is measured using the labels in data points. If data points are not provided then we cannot measure this index.
The data points are divided into 4 categories-
True Negative (TN)= Data points with same class and different cluster
True Positive (TP) = Data points with same class and same cluster
False Negative (FN) = Data points with same class and different cluster
False Positive (FP) = Data points with different class and same cluster
Here,
Note - nC2 means number of combinations with 2 elements possible from a set containing n elements
nC2 = n*(n-1)/2
TP = 5C2 + 4C2 + 2C2 + 3C2 = 20
FN = 5C1 * 1C1 + 5C1 * 2C1 + 1C1 * 4C1 + 1C1 * 2C1 + 1C1 * 3C1 = 24
FP = 5C1 * 1C1 + 4C1 * 1C1 +4C1 * 1C1 + 1C1 * 1C1 +3C1 * 2C1 = 20
TN= 5C1 * 4C1 + 5C1 * 1C1 + 5C1 * 3C1 + 1C1 * 1C1 + 1C1 * 1C1 + 1C1 * 2C1 + 1C1 * 3C1 + 4C1 * 3C1 + 4C1 * 2C1 + 1C1 * 3C1 + 1C1 * 2C1 = 72
Jaccard similarity index = TP/ (TP + TN + FP +FN)
Here, Jaccard similarity index = 20 / (20+ 72 + 20 + 24) = 0.15
Rand Index
Rand Index is similar to Jaccard similarity index. Its formula is given by-
Rand Index = (TP + TN) / (TP + TN + FP +FN)
Here, Rand Index = (20 + 72) / (20+ 72 + 20 + 24) = 0.67
When Rand index is above 0.7, it can be considered as a good clustering.
Similarly, when Jaccard similarity index is above 0.5, it can be considered as good clustering.
Purity
This metrics also requires labels in the data. The formula is given by-
Purity = (Number of data points belonging to the label which is maximum in Cluster 1 + Number of data points belonging to the label which is maximum in Cluster 2 +.... + Number of data points belonging to the label which is maximum in Cluster n ) / Total number of data points.
For example, lets consider 3 clusters – A , B and C with labelled data points
Purity = (a + b + c) / n
Where,
a = Number of black circles in cluster A (Since black is the maximum in count)
b = Number of red circles in cluster B (Since red is the maximum in count)
c = Number of green circles in cluster C (Since green is the maximum in count)
n = Total number of data points
Here, Purity = (5 + 6 + 3) / (8 + 9 + 5) = 0.6
If purity is greater than 0.7 then it can be considered as a good clustering.
Original source - https://qr.ae/pNsxIX

Related

Understanding Support Vector Regression (SVR) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I'm working with SVR, and using this resource. Erverything is super clear, with epsilon intensive loss function (from figure). Prediction comes with tube, to cover most training sample, and generalize bounds, using support vectors.
Then we have this explanation. This can be described by introducing (non-negative) slack variables , to measure the deviation of training samples outside -insensitive zone. I understand this error, outside tube, but don't know, how we can use this in optimization. Could somebody explain this?
In local source. I'm trying to achieve very simple optimization solution, without libraries. This what I have for loss function.
import numpy as np
# Kernel func, linear by default
def hypothesis(x, weight, k=None):
k = k if k else lambda z : z
k_x = np.vectorize(k)(x)
return np.dot(k_x, np.transpose(weight))
.......
import math
def boundary_loss(x, y, weight, epsilon):
prediction = hypothesis(x, weight)
scatter = np.absolute(
np.transpose(y) - prediction)
bound = lambda z: z \
if z >= epsilon else 0
return np.sum(np.vectorize(bound)(scatter))
First, let's look at the objective function. The first term, 1/2 * w^2 (wish this site had LaTeX support but this will suffice) correlates with the margin of the SVM. The article you linked doesn't, in my opinion, explain this very well and calls this term describing "the model's complexity", but perhaps this is not the best way of explaining it. Minimizing this term maximizes the margin (while still representing the data well), which is the predominant goal of using SVM's doing regression.
Warning, Math Heavy Explanation: The reason this is the case is that when maximizing the margin, you want to find the "farthest" non-outlier points right on the margin and minimize its distance. Let this farthest point be x_n. We want to find its Euclidean distance d from the plane f(w, x) = 0, which I will rewrite as w^T * x + b = 0 (where w^T is just the transpose of the weights matrix so that we can multiply the two). To find the distance, let us first normalize the plane such that |w^T * x_n + b| = epsilon, which we can do WLOG as w is still able to form all possible planes of the form w^T * x + b= 0. Then, let's note that w is perpendicular to the plane. This is obvious if you have dealt a lot with planes (particularly in vector calculus), but can be proven by choosing two points on the plane x_1 and x_2, then noticing that w^T * x_1 + b = 0, and w^T * x_2 + b = 0. Subtracting the two equations we get w^T(x_1 - x_2) = 0. Since x_1 - x_2 is just any vector strictly on the plane, and its dot product with w is 0, then we know that w is perpendicular to the plane. Finally, to actually calculate the distance between x_n and the plane, we take the vector formed by x_n' and some point on the plane x' (The vectors would then be x_n - x', and projecting it onto the vector w. Doing this, we get d = |w * (x_n - x') / |w||, which we can rewrite as d = (1 / |w|) * | w^T * x_n - w^T x'|, and then add and subtract b to the inside to get d = (1 / |w|) * | w^T * x_n + b - w^T * x' - b|. Notice that w^T * x_n + b is epsilon (from our normalization above), and that w^T * x' + b is 0, as this is just a point on our plane. Thus, d = epsilon / |w|. Notice that maximizing this distance subject to our constraint of finding the x_n and having |w^T * x_n + b| = epsilon is a difficult optimization problem. What we can do is restructure this optimization problem as minimizing 1/2 * w^T * w subject to the first two constraints in the picture you attached, that is, |y_i - f(x_i, w)| <= epsilon. You may think that I have forgotten the slack variables, and this is true, but when just focusing on this term and ignoring the second term, we ignore the slack variables for now, I will bring them back later. The reason these two optimizations are equivalent is not obvious, but the underlying reason lies in discrimination boundaries, which you are free to read more about (it's a lot more math that frankly I don't think this answer needs more of). Then, note that minimizing 1/2 * w^T * w is the same as minimizing 1/2 * |w|^2, which is the desired result we were hoping for. End of the Heavy Math
Now, notice that we want to make the margin big, but not so big that includes noisy outliers like the one in the picture you provided.
Thus, we introduce a second term. To motivate the margin down to a reasonable size the slack variables are introduced, (I will call them p and p* because I don't want to type out "psi" every time). These slack variables will ignore everything in the margin, i.e. those are the points that do not harm the objective and the ones that are "correct" in terms of their regression status. However, the points outside the margin are outliers, they do not reflect well on the regression, so we penalize them simply for existing. The slack error function that is given there is relatively easy to understand, it just adds up the slack error of every point (p_i + p*_i) for i = 1,...,N, and then multiplies by a modulating constant C which determines the relative importance of the two terms. A low value of C means that we are okay with having outliers, so the margin will be thinned and more outliers will be produced. A high value of C indicates that we care a lot about not having slack, so the margin will be made bigger to accommodate these outliers at the expense of representing the overall data less well.
A few things to note about p and p*. First, note that they are both always >= 0. The constraint in your picture shows this, but it also intuitively makes sense as slack should always add to the error, so it is positive. Second, notice that if p > 0, then p* = 0 and vice versa as an outlier can only be on one side of the margin. Last, all points inside the margin will have p and p* be 0, since they are fine where they are and thus do not contribute to the loss.
Notice that with the introduction of the slack variables, if you have any outliers then you won't want the condition from the first term, that is, |w^T * x_n + b| = epsilon as the x_n would be this outlier, and your whole model would be screwed up. What we allow for, then, is to change the constraint to be |w^T * x_n + b| = epsilon + (p + p*). When translated to the new optimization's constraint, we get the full constraint from the picture you attached, that is, |y_i - f(x_i, w)| <= epsilon + p + p*. (I combined the two equations into one here, but you could rewrite them as the picture is and that would be the same thing).
Hopefully after covering all this up, the motivation for the objective function and the corresponding slack variables makes sense to you.
If I understand the question correctly, you also want code to calculate this objective/loss function, which I think isn't too bad. I have not tested this (yet), but I think this should be what you want.
# Function for calculating the error/loss for a SVM. I assume that:
# - 'x' is 2d array representing the vectors of the data points
# - 'y' is an array representing the values each vector actually gives
# - 'weights' is an array of weights that we tune for the regression
# - 'epsilon' is a scalar representing the breadth of our margin.
def optimization_objective(x, y, weights, epsilon):
# Calculates first term of objective (note that norm^2 = dot product)
margin_term = np.dot(weight, weight) / 2
# Now calculate second term of objective. First get the sum of slacks.
slack_sum = 0
for i in range(len(x)): # For each observation
# First find the absolute distance between expected and observed.
diff = abs(hypothesis(x[i]) - y[i])
# Now subtract epsilon
diff -= epsilon
# If diff is still more than 0, then it is an 'outlier' and will have slack.
slack = max(0, diff)
# Add it to the slack sum
slack_sum += slack
# Now we have the slack_sum, so then multiply by C (I picked this as 1 aribtrarily)
C = 1
slack_term = C * slack_sum
# Now, simply return the sum of the two terms, and we are done.
return margin_term + slack_term
I got this function working on my computer with small data, and you may have to change it a little to work with your data if, for example, the arrays are structured differently, but the idea is there. Also, I am not the most proficient with python, so this may not be the most efficient implementation, but my intent was to make it understandable.
Now, note that this just calculates the error/loss (whatever you want to call it). To actually minimize it requires going into Lagrangians and intense quadratic programming which is a much more daunting task. There are libraries available for doing this but if you want to do this library free as you are doing with this, I wish you good luck because doing that is not a walk in the park.
Finally, I would like to note that most of this information I got from notes I took in my ML class I took last year, and the professor (Dr. Abu-Mostafa) was a great help to have me learn the material. The lectures for this class are online (by the same prof), and the pertinent ones for this topic are here and here (although in my very biased opinion you should watch all the lectures, they were a great help). Leave a comment/question if you need anything cleared up or if you think I made a mistake somewhere. If you still don't understand, I can try to edit my answer to make more sense. Hope this helps!

Dataset for meaningless “Nearest Neighbor”?

In the paper "When Is 'Nearest Neighbor' Meaningful?" we read that, "We show that under certain broad conditions (in terms of data and query distributions, or workload), as dimensionality increases, the distance to the nearest
neighbor approaches the distance to the farthest neighbor. In other words, the contrast in distances to different data points becomes nonexistent. The conditions
we have identified in which this happens are much broader than the independent and identically distributed (IID) dimensions assumption that other work
assumes."
My question is how I should generate a dataset that resembles this effect? I have created three points each with 1000 dimensions with random numbers ranging from 0-255 for each dimension but points create different distances and do not reproduce what is mentioned above. It seems changing dimensions (e.g. 10 or 100 or 1000 dimensions) and ranges (e.g. [0,1]) do not change anything. I still get different distances which should not be any problem for e.g. clustering algorithms!
I hadn't heard of this before either, so I am little defensive, since I have seen that real and synthetic datasets in high dimensions really do not support the claim of the paper in question.
As a result, what I would suggest, as a first, dirty, clumsy and maybe not good first attempt is to generate a sphere in a dimension of your choice (I do it like like this) and then place a query at the center of the sphere.
In that case, every point lies in the same distance with the query point, thus the Nearest Neighbor has a distance equal to the Farthest Neighbor.
This, of course, is independent from the dimension, but it's what came at a thought after looking at the figures of the paper. It should be enough to get you stared, but surely, better datasets may be generated, if any.
Edit about:
distances for each point got bigger with more dimensions!!!!
this is expected, since the higher the dimensional space, the sparser the space is, thus the greater the distance is. Moreover, this is expected, if you think for example, the Euclidean distance, which gets grater as the dimensions grow.
I think the paper is right. First, your test: One problem with your test may be that you are using too few points. I used 10000 point and below are my results (evenly distributed point in [0.0 ... 1.0] in all dimensions). For DIM=2, min/max differ almost by a factor of 1000, for DIM=1000, they only differ by a factor of 1.6, for DIM=10000 by 1.248 . So I'd say these results confirm the paper's hypothesis.
DIM/N = 2 / 10000
min/avg/max= 1.0150906548224441E-5 / 0.019347838262624064 / 0.9993862941797146
DIM/N = 10 / 10000.0
min/avg/max= 0.011363500131326938 / 0.9806472676701363 / 1.628460468042207
DIM/N = 100 / 10000
min/avg/max= 0.7701271349716637 / 1.3380320375218808 / 2.1878136533925328
DIM/N = 1000 / 10000
min/avg/max= 2.581913326565635 / 3.2871335447262178 / 4.177669393187736
DIM/N = 10000 / 10000
min/avg/max= 8.704666143050158 / 9.70540814778645 / 10.85760200249862
DIM/N = 100000 / 1000 (N=1000!)
min/avg/max= 30.448610133282717 / 31.14936583713578 / 31.99082677476165
I guess the explanation is as follows: Lets take three randomly generated vectors, A, B and C. The total distance is based on the sum of the distances of each individual row of these vectors. The more dimensions the vectors have, the more the total sum of differences will approach a common average. In other word, it is highly unlikely that a vector C has in all elements a larger distance to A than another vector B has to A. With increasing dimensions, C and B will have increasingly similar distance to A (and to each other).
My test dataset was created as follow. The dataset is essentially a cube ranging from 0.0 to 1.0 in every dimension. The coordinates were created with uniform distribution in every dimension between 0.0 and 1.0. Example code (N=10000, DIM=[2..10000]):
public double[] generate(int N, int DIM) {
double[] data = new double[N*DIM];
for (int i = 0; i < N; i++) {
int pos = DIM*i;
for (int d = 0; d < DIM; d++) {
data[pos+d] = R.nextDouble();
}
}
return data;
}
Following the equation given at the bottom of the accepted answer here, we get:
d=2 -> 98460
d=10 -> 142.3
d=100 -> 1.84
d=1,000 -> 0.618
d=10,000 -> 0.247
d=100,000 -> 0.0506 (using N=1000)

Weighted random number (without predefined values!)

currently I'm needing a function which gives a weighted, random number.
It should chose a random number between two doubles/integers (for example 4 and 8) while the value in the middle (6) will occur on average, about twice as often than the limiter values 4 and 8.
If this were only about integers, I could predefine the values with variables and custom probabilities, but I need the function to give a double with at least 2 digits (meaning thousands of different numbers)!
The environment I use, is the "Game Maker" which provides all sorts of basic random-generators, but not weighted ones.
Could anyone possibly lead my in the right direction how to achieve this?
Thanks in advance!
The sum of two independent continuous uniform(0,1)'s, U1 and U2, has a continuous symmetrical triangle distribution between 0 and 2. The distribution has its peak at 1 and tapers to zero at either end. We can easily translate that to a range of (4,8) via scaling by 2 and adding 4, i.e., 4 + 2*(U1 + U2).
However, you don't want a height of zero at the endpoints, you want half the peak's height. In other words, you want a triangle sitting on a rectangular base (i.e., uniform), with height h at the endpoints and height 2h in the middle. That makes life easy, because the triangle must have a peak of height h above the rectangular base, and a triangle with height h has half the area of a rectangle with the same base and height h. It follows that 2/3 of your probability is in the base, 1/3 is in the triangle.
Combining the elements above leads to the following pseudocode algorithm. If rnd() is a function call that returns continuous uniform(0,1) random numbers:
define makeValue()
if rnd() <= 2/3 # Caution, may want to use 2.0/3.0 for many languages
return 4 + (4 * rnd())
else
return 4 + (2 * (rnd() + rnd()))
I cranked out a million values using that and plotted a histogram:
For the case someone needs this in Game Maker (or a different language ) as an universal function:
if random(1) <= argument0
return argument1 + ((argument2-argument1) * random(1))
else
return argument1 + (((argument2-argument1)/2) * (random(1) + random(1)))
Called as follows (similar to the standard random_range function):
val = weight_random_range(FACTOR, FROM, TO)
"FACTOR" determines how much of the whole probability figure is the "base" for constant probability. E.g. 2/3 for the figure above.
0 will provide a perfect triangle and 1 a rectangle (no weightning).

Algorithm to smooth a curve while keeping the area under it constant

Consider a discrete curve defined by the points (x1,y1), (x2,y2), (x3,y3), ... ,(xn,yn)
Define a constant SUM = y1+y2+y3+...+yn. Say we change the value of some k number of y points (increase or decrease) such that the total sum of these changed points is less than or equal to the constant SUM.
What would be the best possible manner to adjust the other y points given the following two conditions:
The total sum of the y points (y1'+y2'+...+yn') should remain constant ie, SUM.
The curve should retain as much of its original shape as possible.
A simple solution would be to define some delta as follows:
delta = (ym1' + ym2' + ym3' + ... + ymk') - (ym1 + ym2 + ym3 + ... + ymk')
and to distribute this delta over the rest of the points equally. Here ym1' is the value of the modified point after modification and ym1 is the value of the modified point before modification to give delta as the total difference in modification.
However this would not ensure a totally smoothed curve as area near changed points would appear ragged. Does a better solution/algorithm exist for the this problem?
I've used the following approach, though it is a bit OTT.
Consider adding d[i] to y[i], to get s[i], the smoothed value.
We seek to minimise
S = Sum{ 1<=i<N-1 | sqr( s[i+1]-2*s[i]+s[i-1] } + f*Sum{ 0<=i<N | sqr( d[i])}
The first term is a sum of the squares of (an approximate) second derivative of the curve, and the second term penalises moving away from the original. f is a (positive) constant. A little algebra recasts this as
S = sqr( ||A*d - b||)
where the matrix A has a nice structure, and indeed A'*A is penta-diagonal, which means that the normal equations (ie d = Inv(A'*A)*A'*b) can be solved efficiently. Note that d is computed directly, there is no need to initialise it.
Given the solution d to this problem we can compute the solution d^ to the same problem but with the constraint One'*d = 0 (where One is the vector of all ones) like this
d^ = d - (One'*d/Q) * e
e = Inv(A'*A)*One
Q = One'*e
What value to use for f? Well a simple approach is to try out this procedure on sample curves for various fs and pick a value that looks good. Another approach is to pick a estimate of smoothness, for example the rms of the second derivative, and then a value that should attain, and then search for an f that gives that value. As a general rule, the bigger f is the less smooth the smoothed curve will be.
Some motivation for all this. The aim is to find a 'smooth' curve 'close' to a given one. For this we need a measure of smoothness (the first term in S) and a measure of closeness (the second term. Why these measures? Well, each are easy to compute, and each are quadratic in the variables (the d[]); this will mean that the problem becomes an instance of linear least squares for which there are efficient algorithms available. Moreover each term in each sum depends on nearby values of the variables, which will in turn mean that the 'inverse covariance' (A'*A) will have a banded structure and so the least squares problem can be solved efficiently. Why introduce f? Well, if we didn't have f (or set it to 0) we could minimise S by setting d[i] = -y[i], getting a perfectly smooth curve s[] = 0, which has nothing to do with the y curve. On the other hand if f is gigantic, then to minimise s we should concentrate on the second term, and set d[i] = 0, and our 'smoothed' curve is just the original. So it's reasonable to suppose that as we vary f, the corresponding solutions will vary between being very smooth but far from y (small f) and being close to y but a bit rough (large f).
It's often said that the normal equations, whose use I advocate here, are a bad way to solve least squares problems, and this is generally true. However with 'nice' banded systems -- like the one here -- the loss of stability through using the normal equations is not so great, while the gain in speed is so great. I've used this approach to smooth curves with many thousands of points in a reasonable time.
To see what A is, consider the case where we had 4 points. Then our expression for S comes down to:
sqr( s[2] - 2*s[1] + s[0]) + sqr( s[3] - 2*s[2] + s[1]) + f*(d[0]*d[0] + .. + d[3]*d[3]).
If we substitute s[i] = y[i] + d[i] in this we get, for example,
s[2] - 2*s[1] + s[0] = d[2]-2*d[1]+d[0] + y[2]-2*y[1]+y[0]
and so we see that for this to be sqr( ||A*d-b||) we should take
A = ( 1 -2 1 0)
( 0 1 -2 1)
( f 0 0 0)
( 0 f 0 0)
( 0 0 f 0)
( 0 0 0 f)
and
b = ( -(y[2]-2*y[1]+y[0]))
( -(y[3]-2*y[2]+y[1]))
( 0 )
( 0 )
( 0 )
( 0 )
In an implementation, though, you probably wouldn't want to form A and b, as they are only going to be used to form the normal equation terms, A'*A and A'*b. It would be simpler to accumulate these directly.
This is a constrained optimization problem. The functional to be minimized is the integrated difference of the original curve and the modified curve. The constraints are the area under the curve and the new locations of the modified points. It is not easy to write such codes on your own. It is better to use some open source optimization codes, like this one: ool.
what about to keep the same dynamic range?
compute original min0,max0 y-values
smooth y-values
compute new min1,max1 y-values
linear interpolate all values to match original min max y
y=min1+(y-min1)*(max0-min0)/(max1-min1)
that is it
Not sure for the area but this should keep the shape much closer to original one. I got this Idea right now while reading your question and now I face similar problem so I try to code it and try right now anyway +1 for the getting me this Idea :)
You can adapt this and combine with the area
So before this compute the area and apply #1..#4 and after that compute new area. Then multiply all values by old_area/new_area ratio. If you have also negative values and not computing absolute area then you have to handle positive and negative areas separately and find multiplication ration to best fit original area for booth at once.
[edit1] some results for constant dynamic range
As you can see the shape is slightly shifting to the left. Each image is after applying few hundreds smooth operations. I am thinking of subdivision to local min max intervals to improve this ...
[edit2] have finished the filter for mine own purposes
void advanced_smooth(double *p,int n)
{
int i,j,i0,i1;
double a0,a1,b0,b1,dp,w0,w1;
double *p0,*p1,*w; int *q;
if (n<3) return;
p0=new double[n<<2]; if (p0==NULL) return;
p1=p0+n;
w =p1+n;
q =(int*)((double*)(w+n));
// compute original min,max
for (a0=p[0],i=0;i<n;i++) if (a0>p[i]) a0=p[i];
for (a1=p[0],i=0;i<n;i++) if (a1<p[i]) a1=p[i];
for (i=0;i<n;i++) p0[i]=p[i]; // store original values for range restoration
// compute local min max positions to p1[]
dp=0.01*(a1-a0); // min delta treshold
// compute first derivation
p1[0]=0.0; for (i=1;i<n;i++) p1[i]=p[i]-p[i-1];
for (i=1;i<n-1;i++) // eliminate glitches
if (p1[i]*p1[i-1]<0.0)
if (p1[i]*p1[i+1]<0.0)
if (fabs(p1[i])<=dp)
p1[i]=0.5*(p1[i-1]+p1[i+1]);
for (i0=1;i0;) // remove zeros from derivation
for (i0=0,i=0;i<n;i++)
if (fabs(p1[i])<dp)
{
if ((i> 0)&&(fabs(p1[i-1])>=dp)) { i0=1; p1[i]=p1[i-1]; }
else if ((i<n-1)&&(fabs(p1[i+1])>=dp)) { i0=1; p1[i]=p1[i+1]; }
}
// find local min,max to q[]
q[n-2]=0; q[n-1]=0; for (i=1;i<n-1;i++) if (p1[i]*p1[i-1]<0.0) q[i-1]=1; else q[i-1]=0;
for (i=0;i<n;i++) // set sign as +max,-min
if ((q[i])&&(p1[i]<-dp)) q[i]=-q[i]; // this shifts smooth curve to the left !!!
// compute weights
for (i0=0,i1=1;i1<n;i0=i1,i1++) // loop through all local min,max intervals
{
for (;(!q[i1])&&(i1<n-1);i1++); // <i0,i1>
b0=0.5*(p[i0]+p[i1]);
b1=fabs(p[i1]-p[i0]);
if (b1>=1e-6)
for (b1=0.35/b1,i=i0;i<=i1;i++) // compute weights bigger near local min max
w[i]=0.8+(fabs(p[i]-b0)*b1);
}
// smooth few times
for (j=0;j<5;j++)
{
for (i=0;i<n ;i++) p1[i]=p[i]; // store data to avoid shifting by using half filtered data
for (i=1;i<n-1;i++) // FIR smooth filter
{
w0=w[i];
w1=(1.0-w0)*0.5;
p[i]=(w1*p1[i-1])+(w0*p1[i])+(w1*p1[i+1]);
}
for (i=1;i<n-1;i++) // avoid local min,max shifting too much
{
if (q[i]>0) // local max
{
if (p[i]<p[i-1]) p[i]=p[i-1]; // can not be lower then neigbours
if (p[i]<p[i+1]) p[i]=p[i+1];
}
if (q[i]<0) // local min
{
if (p[i]>p[i-1]) p[i]=p[i-1]; // can not be higher then neigbours
if (p[i]>p[i+1]) p[i]=p[i+1];
}
}
}
for (i0=0,i1=1;i1<n;i0=i1,i1++) // loop through all local min,max intervals
{
for (;(!q[i1])&&(i1<n-1);i1++); // <i0,i1>
// restore original local min,max
a0=p0[i0]; b0=p[i0];
a1=p0[i1]; b1=p[i1];
if (a0>a1)
{
dp=a0; a0=a1; a1=dp;
dp=b0; b0=b1; b1=dp;
}
b1-=b0;
if (b1>=1e-6)
for (dp=(a1-a0)/b1,i=i0;i<=i1;i++)
p[i]=a0+((p[i]-b0)*dp);
}
delete[] p0;
}
so p[n] is the input/output data. There are few things that can be tweaked like:
weights computation (constants 0.8 and 0.35 means weights are <0.8,0.8+0.35/2>)
number of smooth passes (now 5 in the for loop)
the bigger the weight the less the filtering 1.0 means no change
The main Idea behind is:
find local extremes
compute weights for smoothing
so near local extremes are almost none change of the output
smooth
repair dynamic range per each interval between all local extremes
[Notes]
I did also try to restore the area but that is incompatible with mine task because it distorts the shape a lot. So if you really need the area then focus on that and not on the shape. The smoothing causes signal to shrink mostly so after area restoration the shape rise on magnitude.
Actual filter state has none markable side shifting of shape (which was the main goal for me). Some images for more bumpy signal (the original filter was extremly poor on this):
As you can see no visible signal shape shifting. The local extremes has tendency to create sharp spikes after very heavy smoothing but that was expected
Hope it helps ...

Multiliteration implementation with inaccurate distance data

I am trying to create an android smartphone application which uses Apples iBeacon technology to determine the current indoor location of itself. I already managed to get all available beacons and calculate the distance to them via the rssi signal.
Currently I face the problem, that I am not able to find any library or implementation of an algorithm, which calculates the estimated location in 2D by using 3 (or more) distances of fixed points with the condition, that these distances are not accurate (which means, that the three "trilateration-circles" do not intersect in one point).
I would be deeply grateful if anybody can post me a link or an implementation of that in any common programming language (Java, C++, Python, PHP, Javascript or whatever). I already read a lot on stackoverflow about that topic, but could not find any answer I were able to convert in code (only some mathematical approaches with matrices and inverting them, calculating with vectors or stuff like that).
EDIT
I thought about an own approach, which works quite well for me, but is not that efficient and scientific. I iterate over every meter (or like in my example 0.1 meter) of the location grid and calculate the possibility of that location to be the actual position of the handset by comparing the distance of that location to all beacons and the distance I calculate with the received rssi signal.
Code example:
public Location trilaterate(ArrayList<Beacon> beacons, double maxX, double maxY)
{
for (double x = 0; x <= maxX; x += .1)
{
for (double y = 0; y <= maxY; y += .1)
{
double currentLocationProbability = 0;
for (Beacon beacon : beacons)
{
// distance difference between calculated distance to beacon transmitter
// (rssi-calculated distance) and current location:
// |sqrt(dX^2 + dY^2) - distanceToTransmitter|
double distanceDifference = Math
.abs(Math.sqrt(Math.pow(beacon.getLocation().x - x, 2)
+ Math.pow(beacon.getLocation().y - y, 2))
- beacon.getCurrentDistanceToTransmitter());
// weight the distance difference with the beacon calculated rssi-distance. The
// smaller the calculated rssi-distance is, the more the distance difference
// will be weighted (it is assumed, that nearer beacons measure the distance
// more accurate)
distanceDifference /= Math.pow(beacon.getCurrentDistanceToTransmitter(), 0.9);
// sum up all weighted distance differences for every beacon in
// "currentLocationProbability"
currentLocationProbability += distanceDifference;
}
addToLocationMap(currentLocationProbability, x, y);
// the previous line is my approach, I create a Set of Locations with the 5 most probable locations in it to estimate the accuracy of the measurement afterwards. If that is not necessary, a simple variable assignment for the most probable location would do the job also
}
}
Location bestLocation = getLocationSet().first().location;
bestLocation.accuracy = calculateLocationAccuracy();
Log.w("TRILATERATION", "Location " + bestLocation + " best with accuracy "
+ bestLocation.accuracy);
return bestLocation;
}
Of course, the downside of that is, that I have on a 300m² floor 30.000 locations I had to iterate over and measure the distance to every single beacon I got a signal from (if that would be 5, I do 150.000 calculations only for determine a single location). That's a lot - so I will let the question open and hope for some further solutions or a good improvement of this existing solution in order to make it more efficient.
Of course it has not to be a Trilateration approach, like the original title of this question was, it is also good to have an algorithm which includes more than three beacons for the location determination (Multilateration).
If the current approach is fine except for being too slow, then you could speed it up by recursively subdividing the plane. This works sort of like finding nearest neighbors in a kd-tree. Suppose that we are given an axis-aligned box and wish to find the approximate best solution in the box. If the box is small enough, then return the center.
Otherwise, divide the box in half, either by x or by y depending on which side is longer. For both halves, compute a bound on the solution quality as follows. Since the objective function is additive, sum lower bounds for each beacon. The lower bound for a beacon is the distance of the circle to the box, times the scaling factor. Recursively find the best solution in the child with the lower lower bound. Examine the other child only if the best solution in the first child is worse than the other child's lower bound.
Most of the implementation work here is the box-to-circle distance computation. Since the box is axis-aligned, we can use interval arithmetic to determine the precise range of distances from box points to the circle center.
P.S.: Math.hypot is a nice function for computing 2D Euclidean distances.
Instead of taking confidence levels of individual beacons into account, I would instead try to assign an overall confidence level for your result after you make the best guess you can with the available data. I don't think the only available metric (perceived power) is a good indication of accuracy. With poor geometry or a misbehaving beacon, you could be trusting poor data highly. It might make better sense to come up with an overall confidence level based on how well the perceived distance to the beacons line up with the calculated point assuming you trust all beacons equally.
I wrote some Python below that comes up with a best guess based on the provided data in the 3-beacon case by calculating the two points of intersection of circles for the first two beacons and then choosing the point that best matches the third. It's meant to get started on the problem and is not a final solution. If beacons don't intersect, it slightly increases the radius of each up until they do meet or a threshold is met. Likewise, it makes sure the third beacon agrees within a settable threshold. For n-beacons, I would pick 3 or 4 of the strongest signals and use those. There are tons of optimizations that could be done and I think this is a trial-by-fire problem due to the unwieldy nature of beaconing.
import math
beacons = [[0.0,0.0,7.0],[0.0,10.0,7.0],[10.0,5.0,16.0]] # x, y, radius
def point_dist(x1,y1,x2,y2):
x = x2-x1
y = y2-y1
return math.sqrt((x*x)+(y*y))
# determines two points of intersection for two circles [x,y,radius]
# returns None if the circles do not intersect
def circle_intersection(beacon1,beacon2):
r1 = beacon1[2]
r2 = beacon2[2]
dist = point_dist(beacon1[0],beacon1[1],beacon2[0],beacon2[1])
heron_root = (dist+r1+r2)*(-dist+r1+r2)*(dist-r1+r2)*(dist+r1-r2)
if ( heron_root > 0 ):
heron = 0.25*math.sqrt(heron_root)
xbase = (0.5)*(beacon1[0]+beacon2[0]) + (0.5)*(beacon2[0]-beacon1[0])*(r1*r1-r2*r2)/(dist*dist)
xdiff = 2*(beacon2[1]-beacon1[1])*heron/(dist*dist)
ybase = (0.5)*(beacon1[1]+beacon2[1]) + (0.5)*(beacon2[1]-beacon1[1])*(r1*r1-r2*r2)/(dist*dist)
ydiff = 2*(beacon2[0]-beacon1[0])*heron/(dist*dist)
return (xbase+xdiff,ybase-ydiff),(xbase-xdiff,ybase+ydiff)
else:
# no intersection, need to pseudo-increase beacon power and try again
return None
# find the two points of intersection between beacon0 and beacon1
# will use beacon2 to determine the better of the two points
failing = True
power_increases = 0
while failing and power_increases < 10:
res = circle_intersection(beacons[0],beacons[1])
if ( res ):
intersection = res
else:
beacons[0][2] *= 1.001
beacons[1][2] *= 1.001
power_increases += 1
continue
failing = False
# make sure the best fit is within x% (10% of the total distance from the 3rd beacon in this case)
# otherwise the results are too far off
THRESHOLD = 0.1
if failing:
print 'Bad Beacon Data (Beacon0 & Beacon1 don\'t intersection after many "power increases")'
else:
# finding best point between beacon1 and beacon2
dist1 = point_dist(beacons[2][0],beacons[2][1],intersection[0][0],intersection[0][1])
dist2 = point_dist(beacons[2][0],beacons[2][1],intersection[1][0],intersection[1][1])
if ( math.fabs(dist1-beacons[2][2]) < math.fabs(dist2-beacons[2][2]) ):
best_point = intersection[0]
best_dist = dist1
else:
best_point = intersection[1]
best_dist = dist2
best_dist_diff = math.fabs(best_dist-beacons[2][2])
if best_dist_diff < THRESHOLD*best_dist:
print best_point
else:
print 'Bad Beacon Data (Beacon2 distance to best point not within threshold)'
If you want to trust closer beacons more, you may want to calculate the intersection points between the two closest beacons and then use the farther beacon to tie-break. Keep in mind that almost anything you do with "confidence levels" for the individual measurements will be a hack at best. Since you will always be working with very bad data, you will defintiely need to loosen up the power_increases limit and threshold percentage.
You have 3 points : A(xA,yA,zA), B(xB,yB,zB) and C(xC,yC,zC), which respectively are approximately at dA, dB and dC from you goal point G(xG,yG,zG).
Let's say cA, cB and cC are the confidence rate ( 0 < cX <= 1 ) of each point.
Basically, you might take something really close to 1, like {0.95,0.97,0.99}.
If you don't know, try different coefficient depending of distance avg. If distance is really big, you're likely to be not very confident about it.
Here is the way i'll do it :
var sum = (cA*dA) + (cB*dB) + (cC*dC);
dA = cA*dA/sum;
dB = cB*dB/sum;
dC = cC*dC/sum;
xG = (xA*dA) + (xB*dB) + (xC*dC);
yG = (yA*dA) + (yB*dB) + (yC*dC);
xG = (zA*dA) + (zB*dB) + (zC*dC);
Basic, and not really smart but will do the job for some simple tasks.
EDIT
You can take any confidence coef you want in [0,inf[, but IMHO, restraining at [0,1] is a good idea to keep a realistic result.

Resources