Spark Principal component analysis(PCA) expected results - apache-spark-mllib

I working on a project where I have to do some K-means clustering with MLlib from Spark. The problem is that my data have 744 features. I did some research and I found out that PCA is what I need. The best part is that Spark PCA implemented, so I decided to do that.
double[][] array=new double[381][744];
int contor=0;
for (Vector vectorData : parsedTrainingData.collect()) {
contor++;
array[contor]=vectorData.toArray();
}
LinkedList<Vector> rowsList = new LinkedList<>();
for (int i = 0; i < array.length; i++) {
Vector currentRow = Vectors.dense(array[i]);
rowsList.add(currentRow);
}
JavaRDD<Vector> rows = jsc.parallelize(rowsList);
// Create a RowMatrix from JavaRDD<Vector>.
RowMatrix mat = new RowMatrix(rows.rdd());
// Compute the top 3 principal components.
Tuple2<Matrix, Vector> pc = mat.computePrincipalComponentsAndExplainedVariance(*param*);
RowMatrix projected = mat.multiply(pc._1);
// $example off$
Vector[] collectPartitions = (Vector[]) projected.rows().collect();
System.out.println("Projected vector of principal component:");
for (Vector vector : collectPartitions) {
System.out.println("\t" + vector);
}
System.out.println("\n Explanend Variance:");
System.out.println(pc._2);
double sum = 0;
for (Double val : pc._2.toArray()) {
sum += val;
}
System.out.println("\n the sum is: " + (double) sum);
About the data that I want to apply PCA I have 744 features who represents values(total seconds of active time) collected by sensors in a home on every hour, so it is something like (31 sensors * 24 h), in format(s(sensorNumber)(hour): s10, s11.....s123, s20, s21....223,.....s3123.
From what I understand one of the criteria for a reduction to not lose to much of the information is the sum of Explained Variance to be greater 0.9 (90%). After some tests I got this results:
*pram* sum
100 0.91
150 0.97
200 0.98
250 0.99
350 1
So from what I understand it will be safe to reduce my 744 features vector to a 100 features vector. My problem is that this results looks to good to be true. I search for some examples to have guidance, but I am still unsure if what I did is correct. So is this results plausible?

Related

The sum of my page ranks converge at 0.9

When I'm calculating page ranks of a set of crawled domains, using a dampening factor of 0.85. As mentioned in many page ranks papers, the sum of pageranks should converge to 1. But regardless of how many iterations I do, it seems to converge at 0.90xxx. If I lower dampening factor to 0.5, I move closer to 1 obviously.
Is it bad that the page ranks sum converge at 0.90, and what would this generally implicate?
Yes, it is bad, since it indicate a bug in your implementation. Pagerank gives as a result a probability space, and it must sum to 1 as a basic sanity check.
My guess of the problem is you did not handle 'sinks' - nodes that have no outgoing links.
Common ways to handle sinks are:
For a sink vi, regard all nodes (vi,vj) as existing except vi=vj
remove them from the graph completely (and repeat until convergence)
Link them back to all nodes that linked to them (if vi is a sink, for all edge (vj,vi), add (vi,vj) as well).
Consider the following toy example: 2 pages, A,B. A links to B, B links to nothing. The resulting matrix is:
W=
0 1
0 0
Now, using d=0.85, you get the following equations:
v = 0.85* W'v + 0.15*[1/2,1/2]
v1 = 0.85* (0*v1+0*v2) + 0.15*1/2 = 0.15*1/2 = 0.075
v2 = 0.85*(1*v1 + 0*v2) + 0.15/2 = 0.85v1 + 0.075 = 0.006375 + 0.075 = 0.13875
And the sum is not 1.
However, if you handle the sinks, in one of the suggested approach (let's examine approach (1)), you will get:
W =
0 1
1 0
You will now get the set of equations:
v = 0.85* W'v + 0.15*[1/2,1/2]
v1 = 0.85* (0*v1+1*v2) + 0.15*1/2 = 0.85v2 + 0.075
v2 = 0.85*(1*v1 + 0*v2) + 0.15/2 = 0.85v1 + 0.075 (/0.85)-> 1/0.85 * v2 = v1 + 0.075/0.85
-> (add 2 equations)
1/0.85*v2 + v1 = 0.85v2 + 0.075 + v1 + 0.075/0.85
-> (approximately)
0.326*v2 = 0.163
v2 = 0.5
As you can see, by using this method, we got a probability space and now, as expected, page rank of all nodes sum to 1.
This became the algorithm:
// data structures
private HashMap<String, Double> pageRanks;
private HashMap<String, Double> oldRanks;
private HashMap<String, Integer> numberOutlinks;
private HashMap<String, HashMap<String, Integer>> inlinks;
private HashSet<String> domainsWithNoOutlinks;
private double N;
// data parsing occluded
public void startAlgorithm() {
int maxIterations = 20;
int itr = 0;
double d = 0.85;
double dp = 0;
double dpp = (1 - d) / N;
// initialize pagerank
for (String s : oldRanks.keySet()) {
oldRanks.put(s, 1.0 / N);
}
System.out.println("Starting page rank iterations..");
while (maxIterations >= itr) {
System.out.println("Iteration: " + itr);
dp = 0;
// teleport probability
for (String domain : domainsWithNoOutlinks) {
dp = dp + d * oldRanks.get(domain) / N;
}
for (String domain : oldRanks.keySet()) {
pageRanks.put(domain, dp + dpp);
for (String inlink : inlinks.get(domain).keySet()) { // for every inlink of domain
pageRanks.put(domain, pageRanks.get(domain) + inlinks.get(domain).get(inlink) * d * oldRanks.get(inlink) / numberOutlinks.get(inlink));
}
}
// update pageranks with new values
for (String domain : pageRanks.keySet()) {
oldRanks.put(domain, pageRanks.get(domain));
}
itr++;
}
}
Where this line is the important one:
pageRanks.put(domain, pageRanks.get(domain) + inlinks.get(domain).get(inlink) * d * oldRanks.get(inlink) / numberOutlinks.get(inlink));
inlinks.get(domain).get(inlink) returns how much an inlink "like/referenced" the current domain, and we divide that by how many inlinks that current domain have. And "inlinks.get(domain).get(inlink)" is what I missed in my algorithm hence why the sum didn't converge at 1.
Read more: http://www.ccs.northeastern.edu/home/daikeshi/notes/PageRank.pdf

How can I find the center of a cluster of data points?

Let's say I plotted the position of a helicopter every day for the past year and came up with the following map:
Any human looking at this would be able to tell me that this helicopter is based out of Chicago.
How can I find the same result in code?
I'm looking for something like this:
$geoCodeArray = array([GET=http://pastebin.com/grVsbgL9]);
function findHome($geoCodeArray) {
// magic
return $geoCode;
}
Ultimately generating something like this:
UPDATE: Sample Dataset
Here's a map with a sample dataset: http://batchgeo.com/map/c3676fe29985f00e1605cd4f86920179
Here's a pastebin of 150 geocodes: http://pastebin.com/grVsbgL9
The above contains 150 geocodes. The first 50 are in a few clusters close to Chicago. The remaining are scattered throughout the country, including some small clusters in New York, Los Angeles, and San Francisco.
I have about a million (seriously) datasets like this that I'll need to iterate through and identify the most likely "home". Your help is greatly appreciated.
UPDATE 2: Airplane switched to Helicopter
The airplane concept was drawing too much attention toward physical airports. The coordinates can be anywhere in the world, not just airports. Let's assume it's a super helicopter not bound by physics, fuel, or anything else. It can land where it wants. ;)
The following solution works even if the points are scattered all over the Earth, by converting latitude and longitude to Cartesian coordinates. It does a kind of KDE (kernel density estimation), but in a first pass the sum of kernels is evaluated only at the data points. The kernel should be chosen to fit the problem. In the code below it is what I could jokingly/presumptuously call a Trossian, i.e., 2-d²/h² for d≤h and h²/d² for d>h (where d is the Euclidean distance and h is the "bandwidth" $global_kernel_radius), but it could also be a Gaussian (e-d²/2h²), an Epanechnikov kernel (1-d²/h² for d<h, 0 otherwise), or another kernel. An optional second pass refines the search locally, either by summing an independent kernel on a local grid, or by calculating the centroid, in both cases in a surrounding defined by $local_grid_radius.
In essence, each point sums all the points it has around (including itself), weighing them more if they are closer (by the bell curve), and also weighing them by the optional weight array $w_arr. The winner is the point with the maximum sum. Once the winner has been found, the "home" we are looking for can be found by repeating the same process locally around the winner (using another bell curve), or it can be estimated to be the "center of mass" of all points within a given radius from the winner, where the radius can be zero.
The algorithm must be adapted to the problem by choosing the appropriate kernels, by choosing how to refine the search locally, and by tuning the parameters. For the example dataset, the Trossian kernel for the first pass and the Epanechnikov kernel for the second pass, with all 3 radii set to 30 mi and a grid step of 1 mi could be a good starting point, but only if the two sub-clusters of Chicago should be seen as one big cluster. Otherwise smaller radii must be chosen.
function find_home($lat_arr, $lng_arr, $global_kernel_radius,
$local_kernel_radius,
$local_grid_radius, // 0 for no 2nd pass
$local_grid_step, // 0 for centroid
$units='mi',
$w_arr=null)
{
// for lat,lng <-> x,y,z see http://en.wikipedia.org/wiki/Geodetic_datum
// for K and h see http://en.wikipedia.org/wiki/Kernel_density_estimation
switch (strtolower($units)) {
/* */case 'nm' :
/*or*/case 'nmi': $m_divisor = 1852;
break;case 'mi': $m_divisor = 1609.344;
break;case 'km': $m_divisor = 1000;
break;case 'm': $m_divisor = 1;
break;default: return false;
}
$a = 6378137 / $m_divisor; // Earth semi-major axis (WGS84)
$e2 = 6.69437999014E-3; // First eccentricity squared (WGS84)
$lat_lng_count = count($lat_arr);
if ( !$w_arr) {
$w_arr = array_fill(0, $lat_lng_count, 1.0);
}
$x_arr = array();
$y_arr = array();
$z_arr = array();
$rad = M_PI / 180;
$one_e2 = 1 - $e2;
for ($i = 0; $i < $lat_lng_count; $i++) {
$lat = $lat_arr[$i];
$lng = $lng_arr[$i];
$sin_lat = sin($lat * $rad);
$sin_lng = sin($lng * $rad);
$cos_lat = cos($lat * $rad);
$cos_lng = cos($lng * $rad);
// height = 0 (!)
$N = $a / sqrt(1 - $e2 * $sin_lat * $sin_lat);
$x_arr[$i] = $N * $cos_lat * $cos_lng;
$y_arr[$i] = $N * $cos_lat * $sin_lng;
$z_arr[$i] = $N * $one_e2 * $sin_lat;
}
$h = $global_kernel_radius;
$h2 = $h * $h;
$max_K_sum = -1;
$max_K_sum_idx = -1;
for ($i = 0; $i < $lat_lng_count; $i++) {
$xi = $x_arr[$i];
$yi = $y_arr[$i];
$zi = $z_arr[$i];
$K_sum = 0;
for ($j = 0; $j < $lat_lng_count; $j++) {
$dx = $xi - $x_arr[$j];
$dy = $yi - $y_arr[$j];
$dz = $zi - $z_arr[$j];
$d2 = $dx * $dx + $dy * $dy + $dz * $dz;
$K_sum += $w_arr[$j] * ($d2 <= $h2 ? (2 - $d2 / $h2) : $h2 / $d2); // Trossian ;-)
// $K_sum += $w_arr[$j] * exp(-0.5 * $d2 / $h2); // Gaussian
}
if ($max_K_sum < $K_sum) {
$max_K_sum = $K_sum;
$max_K_sum_i = $i;
}
}
$winner_x = $x_arr [$max_K_sum_i];
$winner_y = $y_arr [$max_K_sum_i];
$winner_z = $z_arr [$max_K_sum_i];
$winner_lat = $lat_arr[$max_K_sum_i];
$winner_lng = $lng_arr[$max_K_sum_i];
$sin_winner_lat = sin($winner_lat * $rad);
$cos_winner_lat = cos($winner_lat * $rad);
$sin_winner_lng = sin($winner_lng * $rad);
$cos_winner_lng = cos($winner_lng * $rad);
$east_x = -$local_grid_step * $sin_winner_lng;
$east_y = $local_grid_step * $cos_winner_lng;
$east_z = 0;
$north_x = -$local_grid_step * $sin_winner_lat * $cos_winner_lng;
$north_y = -$local_grid_step * $sin_winner_lat * $sin_winner_lng;
$north_z = $local_grid_step * $cos_winner_lat;
if ($local_grid_radius > 0 && $local_grid_step > 0) {
$r = intval($local_grid_radius / $local_grid_step);
$r2 = $r * $r;
$h = $local_kernel_radius;
$h2 = $h * $h;
$max_L_sum = -1;
$max_L_sum_idx = -1;
for ($i = -$r; $i <= $r; $i++) {
$winner_east_x = $winner_x + $i * $east_x;
$winner_east_y = $winner_y + $i * $east_y;
$winner_east_z = $winner_z + $i * $east_z;
$j_max = intval(sqrt($r2 - $i * $i));
for ($j = -$j_max; $j <= $j_max; $j++) {
$x = $winner_east_x + $j * $north_x;
$y = $winner_east_y + $j * $north_y;
$z = $winner_east_z + $j * $north_z;
$L_sum = 0;
for ($k = 0; $k < $lat_lng_count; $k++) {
$dx = $x - $x_arr[$k];
$dy = $y - $y_arr[$k];
$dz = $z - $z_arr[$k];
$d2 = $dx * $dx + $dy * $dy + $dz * $dz;
if ($d2 < $h2) {
$L_sum += $w_arr[$k] * ($h2 - $d2); // Epanechnikov
}
}
if ($max_L_sum < $L_sum) {
$max_L_sum = $L_sum;
$max_L_sum_i = $i;
$max_L_sum_j = $j;
}
}
}
$x = $winner_x + $max_L_sum_i * $east_x + $max_L_sum_j * $north_x;
$y = $winner_y + $max_L_sum_i * $east_y + $max_L_sum_j * $north_y;
$z = $winner_z + $max_L_sum_i * $east_z + $max_L_sum_j * $north_z;
} else if ($local_grid_radius > 0) {
$r = $local_grid_radius;
$r2 = $r * $r;
$wx_sum = 0;
$wy_sum = 0;
$wz_sum = 0;
$w_sum = 0;
for ($k = 0; $k < $lat_lng_count; $k++) {
$xk = $x_arr[$k];
$yk = $y_arr[$k];
$zk = $z_arr[$k];
$dx = $winner_x - $xk;
$dy = $winner_y - $yk;
$dz = $winner_z - $zk;
$d2 = $dx * $dx + $dy * $dy + $dz * $dz;
if ($d2 <= $r2) {
$wk = $w_arr[$k];
$wx_sum += $wk * $xk;
$wy_sum += $wk * $yk;
$wz_sum += $wk * $zk;
$w_sum += $wk;
}
}
$x = $wx_sum / $w_sum;
$y = $wy_sum / $w_sum;
$z = $wz_sum / $w_sum;
$max_L_sum_i = false;
$max_L_sum_j = false;
} else {
return array($winner_lat, $winner_lng, $max_K_sum_i, false, false);
}
$deg = 180 / M_PI;
$a2 = $a * $a;
$e4 = $e2 * $e2;
$p = sqrt($x * $x + $y * $y);
$zeta = (1 - $e2) * $z * $z / $a2;
$rho = ($p * $p / $a2 + $zeta - $e4) / 6;
$rho3 = $rho * $rho * $rho;
$s = $e4 * $zeta * $p * $p / (4 * $a2);
$t = pow($s + $rho3 + sqrt($s * ($s + 2 * $rho3)), 1 / 3);
$u = $rho + $t + $rho * $rho / $t;
$v = sqrt($u * $u + $e4 * $zeta);
$w = $e2 * ($u + $v - $zeta) / (2 * $v);
$k = 1 + $e2 * (sqrt($u + $v + $w * $w) + $w) / ($u + $v);
$lat = atan($k * $z / $p) * $deg;
$lng = atan2($y, $x) * $deg;
return array($lat, $lng, $max_K_sum_i, $max_L_sum_i, $max_L_sum_j);
}
The fact that distances are Euclidean and not great-circle should have negligible effects for the task at hand. Calculating great-circle distances would be much more cumbersome, and would cause only the weight of very far points to be significantly lower - but these points already have a very low weight. In principle, the same effect could be achieved by a different kernel. Kernels that have a complete cut-off beyond some distance, like the Epanechnikov kernel, don't have this problem at all (in practice).
The conversion between lat,lng and x,y,z for the WGS84 datum is given exactly (although without guarantee of numerical stability) more as a reference than because of a true need. If the height is to be taken into account, or if a faster back-conversion is needed, please refer to the Wikipedia article.
The Epanechnikov kernel, besides being "more local" than the Gaussian and Trossian kernels, has the advantage of being the fastest for the second loop, which is O(ng), where g is the number of points of the local grid, and can also be employed in the first loop, which is O(n²), if n is big.
This can be solved by finding a jeopardy surface. See Rossmo's Formula.
This is the predator problem. Given a set of geographically-located carcasses, where is the lair of the predator? Rossmo's formula solves this problem.
Find the point with the largest density estimate.
Should be pretty much straightforward. Use a kernel radius that roughly covers a large airport in diameter. A 2D Gaussian or Epanechnikov kernel should be fine.
http://en.wikipedia.org/wiki/Multivariate_kernel_density_estimation
This is similar to computing a Heap Map: http://en.wikipedia.org/wiki/Heat_map
and then finding the brightest spot there. Except it computes the brightness right away.
For fun I read a 1% sample of the Geocoordinates of DBpedia (i.e. Wikipedia) into ELKI, projected it into 3D space and enabled the density estimation overlay (hidden in the visualizers scatterplot menu). You can see there is a hotspot on Europe, and to a lesser extend in the US. The hotspot in Europe is Poland I believe. Last I checked, someone apparently had created a Wikipedia article with Geocoordinates for pretty much any town in Poland. The ELKI visualizer, unfortunately, neither allows you to zoom in, rotate, or reduce the kernel bandwidth to visually find the most dense point. But it's straightforward to implement yourself; you probably also don't need to go into 3D space, but can just use latitudes and longitudes.
Kernel Density Estimation should be available in tons of applications. The one in R is probably much more powerful. I just recently discovered this heatmap in ELKI, so I knew how to quickly access it. See e.g. http://stat.ethz.ch/R-manual/R-devel/library/stats/html/density.html for a related R function.
On your data, in R, try for example:
library(kernSmooth)
smoothScatter(data, nbin=512, bandwidth=c(.25,.25))
this should show a strong preference for Chicago.
library(kernSmooth)
dens=bkde2D(data, gridsize=c(512, 512), bandwidth=c(.25,.25))
contour(dens$x1, dens$x2, dens$fhat)
maxpos = which(dens$fhat == max(dens$fhat), arr.ind=TRUE)
c(dens$x1[maxpos[1]], dens$x2[maxpos[2]])
yields [1] 42.14697 -88.09508, which is less than 10 miles from Chicago airport.
To get better coordinates try:
rerunning on a 20x20 miles area around the estimated coordinates
a non-binned KDE in that area
better bandwidth selection with dpik
higher grid resolution
in Astrophysics we use the so called "half mass radius". Given a distribution and its center, the half mass radius is the minimum radius of a circle that contains half of the points of your distribution.
This quantity is a characteristic length of a distribution of points.
If you want that the home of the helicopter is where the points are maximally concentrated so it is the point that has the minimum half mass radius!
My algorithm is as follows: for each point you compute this half mass radius centring the distribution in the current point. The "home" of the helicopter will be the point with the minimum half mass radius.
I've implemented it and the computed center is 42.149994 -88.133698 (which is in Chicago)
I've also used the 0.2 of the total mass instead of the 0.5(half) usually used in Astrophysics.
This is my (in python) alghorithm that finds the home of the helicopter:
import math
import numpy
def inside(points,center,radius):
ids=(((points[:,0]-center[0])**2.+(points[:,1]-center[1])**2.)<=radius**2.)
return points[ids]
points = numpy.loadtxt(open('points.txt'),comments='#')
npoints=len(points)
deltar=0.1
idcenter=None
halfrmin=None
for i in xrange(0,npoints):
center=points[i]
radius=0.
stayHere=True
while stayHere:
radius=radius+deltar
ninside=len(inside(points,center,radius))
#print 'point',i,'r',radius,'in',ninside,'center',center
if(ninside>=npoints*0.2):
if(halfrmin==None or radius<halfrmin):
halfrmin=radius
idcenter=i
print 'point',i,halfrmin,idcenter,points[idcenter]
stayHere=False
#print halfrmin,idcenter
print points[idcenter]
You can use DBSCAN for that task.
DBSCAN is a density based clustering with a notion of noise. You need two parameters:
First the number of points a cluster should have at minimum "minpoints".
And second a neighbourhood parameter called "epsilon" that sets a distance threshold to the surrounding points that should be included in your cluster.
The whole algorithm works like this:
Start with an arbitrary point in your set that hasn't been visited yet
Retrieve points from the epsilon neighbourhood mark all as visited
if you have found enough points in this neighbourhood (> minpoints parameter) you start a new cluster and assign those points. Now recurse into step 2 again for every point in this cluster.
if you don't have, declare this point as noise
go all over again until you've visited all points
It is really simple to implement and there are lots of frameworks that support this algorithm already. To find the mean of your cluster, you can simply take the mean of all the assigned points from its neighbourhood.
However, unlike the method that #TylerDurden proposes, this needs a parameterization- so you need to find some hand tuned parameters that fit your problem.
In your case, you can try to set the minpoints to 10% of your total points if the plane is likely to stay 10% of the time you track at an airport. The density parameter epsilon depends on the resolution of your geographic sensor and the distance metric you use- I would suggest the haversine distance for geographic data.
How about divide the map into many zones and then find the center of plane in zone with the most plane. Algorithm will be something like this
set Zones[40]
foreach Plane in Planes
Zones[GetZone(Plane.position)].Add(Plane)
set MaxZone = Zones[0]
foreach Zone in Zones
if MaxZone.Length() < Zone.Length()
MaxZone = Zone
set Center
foreach Plane in MaxZone
Center.X += Plane.X
Center.Y += Plane.Y
Center.X /= MaxZone.Length
Center.Y /= MaxZone.Length
All I have on this machine is an old compiler so I made an ASCII version of this. It "draws" (in ASCII) a map - dots are points, X is where the real source is, G is where the guessed source is. If the two overlap, only X is shown.
Examples (DIFFICULTY 1.5 and 3 respectively):
The points are generated by picking a random point as the source, then randomly distributing points, making them more likely to be closer to the source.
DIFFICULTY is a floating point constant that regulates the initial point generation - how much more likely the points are to be closer to the source - if it is 1 or less, the program should be able to guess the exact source, or very close. At 2.5, it should still be pretty decent. At 4+, it will start to guess worse, but I think it still guesses better than a human would.
It could be optimized by using binary search over X, then Y - this would make the guess worse, but would be much, much faster. Or by starting with larger blocks, then splitting the best block further (or the best block and the 8 surrounding it). For a higher resolution system, one of these would be necessary. This is quite a naive approach, though, but it seems to work well in an 80x24 system. :D
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#define Y 24
#define X 80
#define DIFFICULTY 1 // Try different values...
static int point[Y][X];
double dist(int x1, int y1, int x2, int y2)
{
return sqrt((y1 - y2)*(y1 - y2) + (x1 - x2)*(x1 - x2));
}
main()
{
srand(time(0));
int y = rand()%Y;
int x = rand()%X;
// Generate points
for (int i = 0; i < Y; i++)
{
for (int j = 0; j < X; j++)
{
double u = DIFFICULTY * pow(dist(x, y, j, i), 1.0 / DIFFICULTY);
if ((int)u == 0)
u = 1;
point[i][j] = !(rand()%(int)u);
}
}
// Find best source
int maxX = -1;
int maxY = -1;
double maxScore = -1;
for (int cy = 0; cy < Y; cy++)
{
for (int cx = 0; cx < X; cx++)
{
double score = 0;
for (int i = 0; i < Y; i++)
{
for (int j = 0; j < X; j++)
{
if (point[i][j] == 1)
{
double d = dist(cx, cy, j, i);
if (d == 0)
d = 0.5;
score += 1000 / d;
}
}
}
if (score > maxScore || maxScore == -1)
{
maxScore = score;
maxX = cx;
maxY = cy;
}
}
}
// Print out results
for (int i = 0; i < Y; i++)
{
for (int j = 0; j < X; j++)
{
if (i == y && j == x)
printf("X");
else if (i == maxY && j == maxX)
printf("G");
else if (point[i][j] == 0)
printf(" ");
else if (point[i][j] == 1)
printf(".");
}
}
printf("Distance from real source: %f", dist(maxX, maxY, x, y));
scanf("%d", 0);
}
Virtual earth has a very good explanation of how you can do it relatively quick. They also have provided code examples. Please have a look at http://soulsolutions.com.au/Articles/ClusteringVirtualEarthPart1.aspx
A simple mixture model seems to work pretty well for this problem.
In general, to get a point that minimizes the distance to all other points in a dataset, you can just take the mean. In this case, you want to find a point that minimizes the distance from a subset of concentrated points. If you postulate that a point can either come from the concentrated set of points of interest or from a diffuse set of background points, then this gives a mixture model.
I have included some python code below. The concentrated area is modeled by a high-precision normal distribution and the background point are modeled by either a low-precision normal distribution or a uniform distribution over a bounding box on the dataset (there is a line of code that can be commented out to switch between these options). Also, mixture models can be somewhat unstable, so running the EM algorithm a few times with random initial conditions and choosing the run with the highest log-likelihood gives better results.
If you are actually looking at airplanes, then adding some sort of time dependent dynamics will probably improve your ability to infer the home base immensely.
I would also be wary of Rossimo's formula because it includes some pretty-strong assumptions about crime distributions.
#the dataset
sdata='''41.892694,-87.670898
42.056048,-88.000488
41.941744,-88.000488
42.072361,-88.209229
42.091933,-87.982635
42.149994,-88.133698
42.171371,-88.286133
42.23241,-88.305359
42.196811,-88.099365
42.189689,-88.188629
42.17646,-88.173523
42.180531,-88.209229
42.18168,-88.187943
42.185496,-88.166656
42.170485,-88.150864
42.150634,-88.140564
42.156743,-88.123741
42.118555,-88.105545
42.121356,-88.112755
42.115499,-88.102112
42.119319,-88.112411
42.118046,-88.110695
42.117791,-88.109322
42.182189,-88.182449
42.194145,-88.183823
42.189057,-88.196182
42.186513,-88.200645
42.180917,-88.197899
42.178881,-88.192062
41.881656,-87.6297
41.875521,-87.6297
41.87872,-87.636566
41.872073,-87.62661
41.868239,-87.634506
41.86875,-87.624893
41.883065,-87.62352
41.881021,-87.619743
41.879998,-87.620087
41.8915,-87.633476
41.875163,-87.620773
41.879125,-87.62558
41.862763,-87.608757
41.858672,-87.607899
41.865192,-87.615795
41.87005,-87.62043
42.073061,-87.973022
42.317241,-88.187256
42.272546,-88.088379
42.244086,-87.890625
42.044512,-88.28064
39.754977,-86.154785
39.754977,-89.648437
41.043369,-85.12207
43.050074,-89.406738
43.082179,-87.912598
42.7281,-84.572754
39.974226,-83.056641
38.888093,-77.01416
39.923692,-75.168457
40.794318,-73.959961
40.877439,-73.146973
40.611086,-73.740234
40.627764,-73.234863
41.784881,-71.367187
42.371988,-70.993652
35.224587,-80.793457
36.753465,-76.069336
39.263361,-76.530762
25.737127,-80.222168
26.644083,-81.958008
30.50223,-87.275391
29.436309,-98.525391
30.217839,-97.844238
29.742023,-95.361328
31.500409,-97.163086
32.691688,-96.877441
32.691688,-97.404785
35.095754,-106.655273
33.425138,-112.104492
32.873244,-117.114258
33.973545,-118.256836
33.681497,-117.905273
33.622982,-117.734985
33.741828,-118.092041
33.64585,-117.861328
33.700707,-118.015137
33.801189,-118.251343
33.513132,-117.740479
32.777244,-117.235107
32.707939,-117.158203
32.703317,-117.268066
32.610821,-117.075806
34.419726,-119.701538
37.750358,-122.431641
37.50673,-122.387695
37.174817,-121.904297
37.157307,-122.321777
37.271492,-122.033386
37.435238,-122.217407
37.687794,-122.415161
37.542025,-122.299805
37.609506,-122.398682
37.544203,-122.0224
37.422151,-122.13501
37.395971,-122.080078
45.485651,-122.739258
47.719463,-122.255859
47.303913,-122.607422
45.176713,-122.167969
39.566,-104.985352
39.124201,-94.614258
35.454518,-97.426758
38.473482,-90.175781
45.021612,-93.251953
42.417881,-83.056641
41.371141,-81.782227
33.791132,-84.331055
30.252543,-90.439453
37.421248,-122.174835
37.47794,-122.181702
37.510628,-122.254486
37.56943,-122.346497
37.593373,-122.384949
37.620571,-122.489319
36.984249,-122.03064
36.553017,-121.893311
36.654442,-121.772461
36.482381,-121.876831
36.15042,-121.651611
36.274518,-121.838379
37.817717,-119.569702
39.31657,-120.140991
38.933041,-119.992676
39.13785,-119.778442
39.108019,-120.239868
38.586082,-121.503296
38.723354,-121.289062
37.878444,-119.437866
37.782994,-119.470825
37.973771,-119.685059
39.001377,-120.17395
40.709076,-73.948975
40.846346,-73.861084
40.780452,-73.959961
40.778829,-73.958931
40.78372,-73.966012
40.783688,-73.965325
40.783692,-73.965615
40.783675,-73.965741
40.783835,-73.965873
'''
import StringIO
import numpy as np
import re
import matplotlib.pyplot as plt
def lp(l):
return map(lambda m: float(m.group()),re.finditer('[^, \n]+',l))
data=np.array(map(lp,StringIO.StringIO(sdata)))
xmn=np.min(data[:,0])
xmx=np.max(data[:,0])
ymn=np.min(data[:,1])
ymx=np.max(data[:,1])
# area of the point set bounding box
area=(xmx-xmn)*(ymx-ymn)
M_ITER=100 #maximum number of iterations
THRESH=1e-10 # stopping threshold
def em(x):
print '\nSTART EM'
mlst=[]
mu0=np.mean( data , 0 ) # the sample mean of the data - use this as the mean of the low-precision gaussian
# the mean of the high-precision Gaussian - this is what we are looking for
mu=np.random.rand( 2 )*np.array([xmx-xmn,ymx-ymn])+np.array([xmn,ymn])
lam_lo=.001 # precision of the low-precision Gaussian
lam_hi=.1 # precision of the high-precision Gaussian
prz=np.random.rand( 1 ) # probability of choosing the high-precision Gaussian mixture component
for i in xrange(M_ITER):
mlst.append(mu[:])
l_hi=np.log(prz)+np.log(lam_hi)-.5*lam_hi*np.sum((x-mu)**2,1)
#low-precision normal background distribution
l_lo=np.log(1.0-prz)+np.log(lam_lo)-.5*lam_lo*np.sum((x-mu0)**2,1)
#uncomment for the uniform background distribution
#l_lo=np.log(1.0-prz)-np.log(area)
#expectation step
zs=1.0/(1.0+np.exp(l_lo-l_hi))
#compute bound on the likelihood
lh=np.sum(zs*l_hi+(1.0-zs)*l_lo)
print i,lh
#maximization step
mu=np.sum(zs[:,None]*x,0)/np.sum(zs) #mean
lam_hi=np.sum(zs)/np.sum(zs*.5*np.sum((x-mu)**2,1)) #precision
prz=1.0/(1.0+np.sum(1.0-zs)/np.sum(zs)) #mixure component probability
try:
if np.abs((lh-old_lh)/lh)<THRESH:
break
except:
pass
old_lh=lh
mlst.append(mu[:])
return lh,lam_hi,mlst
if __name__=='__main__':
#repeat the EM algorithm a number of times and get the run with the best log likelihood
mx_prm=em(data)
for i in xrange(4):
prm=em(data)
if prm[0]>mx_prm[0]:
mx_prm=prm
print prm[0]
print mx_prm[0]
lh,lam_hi,mlst=mx_prm
mu=mlst[-1]
print 'best loglikelihood:', lh
#print 'final precision value:', lam_hi
print 'point of interest:', mu
plt.plot(data[:,0],data[:,1],'.b')
for m in mlst:
plt.plot(m[0],m[1],'xr')
plt.show()
You can easily adapt the Rossmo's formula, quoted by Tyler Durden to your case with few simple notes:
The formula :
This formula give something close to a probability of presence of the base operation for a predator or a serial killer. In your case it could give the probability of a base to be in a certain point. I'll explain later how to use it. U can write it this way :
Proba(base on point A)= Sum{on all spots} ( Phi/(dist^f)+(1-Phi)(B*(g-f))/(2B-dist)^g )
Using Euclidian distance
You want an Euclidian distance and not the Manhattan's one because an airplane or helicopter is not bound to road/streets. So using Euclidian distance is the correct way, if you are tracking an airplane & not a serial killer. So "dist" in the formula is the euclidian distance between the spot ur testing and the spot considered
Taking reasonable variable B
Variable B was used to represent the rule "reasonably smart killer will not kill his neighbor". In your case the will also applied because no one use an airplane/roflcopter to get to the next street corner. we can suppose that the minimal journey is for example 10km or anything reasonable when applied to your case.
Exponential factor f
Factor f is used to add a weight to the distance. For example if all the spots are in a small area you could want a big factor f because the probability of the airport/base/HQ will decrease fast if all your datapoint are in the same sector. g works in a similar way, allow to choose the size of "base is unlikely to be just next to the spot" area
Factor Phi :
Again this factor has to be determined using your knowledge of the problem. It permits to choose the most accurate factor between "base is close to spots" and "i'll not use the plane to make 5 m" if for example u think that the second one is almost irrelevent you can set Phi to 0.95 (0<Phi<1) If both are interesting phi will be around 0.5
How to implement it as something usefull :
First you want to divide your map into little squares : meshing the map ( just like invisal did) (the smaller the squares ,the more accurate the result (in general)) then using the formula to find the more probable location. In fact the mesh is just an array with all possible locations. (if u want to be accurate you increase the number of possible spots but it will require more computational time and PhP is not well-known for it's amazing speed)
Algorithm :
//define all the factors you need(B , f , g , phi)
for(i=0..mesh_size) // computing the probability of presence for each square of the mesh
{
P(i)=0;
geocode squarePosition;//GeoCode of the square's center
for(j=0..geocodearray_size)//sum on all the known spots
{
dist=Distance(geocodearray[j],squarePosition);//small function returning distance between two geocodes
P(i)+=(Phi/pow(dist,f))+(1-Phi)*pow(B,g-f)/pow(2B-dist,g);
}
}
return geocode corresponding to max(P(i))
Hope that it will help you
First I would like to express my fondness of your method in illustrating and explaining the problem ..
If I were in your shoes, I would go for a density based algorithm like DBSCAN
and then after clustering the areas and removing the noise points a few areas (choices) will remain .. then I'll take the cluster with the highest density of points and calculate the average point and find the nearest real point to it . done, found the place! :).
Regards,
Why not something like this:
For each point, calculate it's distance from all other points and sum the total.
The point with the smallest sum is your center.
Maybe sum isn't the best metric to use. Possibly the point with the most "small distances"?
Sum over the distances. Take the point with the smallest summed distance.
function () {
for i in points P:
S[i] = 0
for j in points P:
S[i] += distance(P[i], P[j])
return min(S);
}
You can take a minimum spanning tree and remove the longest edges. The smaller trees give you the centeroid to lookup. The algorithm name is single-link k-clustering. There is a post here: https://stats.stackexchange.com/questions/1475/visualization-software-for-clustering.

Grouping relatively close values using LINQ or loop

I guess this would be more maths than C#. I've got an array of float values, where most values belong to one of the few tightly packed ranges. Here's an example (Lower Limit=0,Upper Limit=612):
3.4,5.0,6.1,
144.0,144.14,145.0,147.0,
273.77,275.19,279.0,
399.4,399.91,401.45,
533.26,537.0,538.9
This is a single array of 16 values, just separated them to show those "groups". What I need to do is to somehow group them, either using Linq, or a manual loop or whatever, so that those close values fall in a single group.
A simple math operation like dividing by 10 (or 100) won't work, because 399 would fall in a different group than 401 (4th group in the above example). Another approach would be to create a histogram of some kind, but I'm looking for something simple here. Any help would be greatly appreciated.
Just an another idea of a clustering with using GroupBy with a custom comparer
var numbers = new float[]
{
3.4f, 5.0f, 6.1f, 144.0f, 144.14f, 145.0f,
147.0f, 273.77f, 275.19f, 279.0f, 399.4f, 399.91f, 401.45f,
49, 50, 51,
533.26f, 537.0f, 538.9f
};
foreach (var group in numbers.GroupBy(i => i, new ClosenessComparer(4f)))
Console.WriteLine(string.Join(", ", group));
And the custom ClosenessComparer:
public class ClosenessComparer : IEqualityComparer<float>
{
private readonly float delta;
public ClosenessComparer(float delta)
{
this.delta = delta;
}
public bool Equals(float x, float y)
{
return Math.Abs((x + y)/ 2f - y) < delta;
}
public int GetHashCode(float obj)
{
return 0;
}
}
And the output:
1: 3,4 5 6,1
2: 144 144,14 145 147
3: 273,77 275,19 279
4: 399,4 399,91 401,45
6: 49 50 51
5: 533,26 537 538,9
Here's a method that groups elements if they're within a certain delta (4 by default) of the previous value:
IEnumerable<IEnumerable<double>> GetClusters(IEnumerable<double> data,
double delta = 4.0)
{
var cluster = new List<double>();
foreach (var item in data.OrderBy(x=>x))
{
if (cluster.Count > 0 && item > cluster[cluster.Count - 1] + delta)
{
yield return cluster;
cluster = new List<double>();
}
cluster.Add(item);
}
if (cluster.Count > 0)
yield return cluster;
}
You can get tweak the algorithm by changing what you use for cluster[cluster.Count - 1] + delta. For example, you might use
cluster[0] + delta - delta from first element in the cluster
cluster.Average() + delta - delta from the mean of the cluster so far
cluster[cluster.Count / 2] + delta - delta from the median of the cluster so far

How do you calculate the average of a set of circular data?

I want to calculate the average of a set of circular data. For example, I might have several samples from the reading of a compass. The problem of course is how to deal with the wraparound. The same algorithm might be useful for a clockface.
The actual question is more complicated - what do statistics mean on a sphere or in an algebraic space which "wraps around", e.g. the additive group mod n. The answer may not be unique, e.g. the average of 359 degrees and 1 degree could be 0 degrees or 180, but statistically 0 looks better.
This is a real programming problem for me and I'm trying to make it not look like just a Math problem.
Compute unit vectors from the angles and take the angle of their average.
This question is examined in detail in the book:
"Statistics On Spheres", Geoffrey S. Watson, University of Arkansas Lecture
Notes in the Mathematical Sciences, 1983 John Wiley & Sons, Inc. as mentioned at http://catless.ncl.ac.uk/Risks/7.44.html#subj4 by Bruce Karsh.
A good way to estimate an average angle, A, from a set of angle measurements
a[i] 0<=i
sum_i_from_1_to_N sin(a[i])
a = arctangent ---------------------------
sum_i_from_1_to_N cos(a[i])
The method given by starblue is computationally equivalent, but his reasons are clearer and probably programmatically more efficient, and also work well in the zero case, so kudos to him.
The subject is now explored in more detail on Wikipedia, and with other uses, like fractional parts.
I see the problem - for example, if you have a 45' angle and a 315' angle, the "natural" average would be 180', but the value you want is actually 0'.
I think Starblue is onto something. Just calculate the (x, y) cartesian coordinates for each angle, and add those resulting vectors together. The angular offset of the final vector should be your required result.
x = y = 0
foreach angle {
x += cos(angle)
y += sin(angle)
}
average_angle = atan2(y, x)
I'm ignoring for now that a compass heading starts at north, and goes clockwise, whereas "normal" cartesian coordinates start with zero along the X axis, and then go anti-clockwise. The maths should work out the same way regardless.
FOR THE SPECIAL CASE OF TWO ANGLES:
The answer ( (a + b) mod 360 ) / 2 is WRONG. For angles 350 and 2, the closest point is 356, not 176.
The unit vector and trig solutions may be too expensive.
What I've got from a little tinkering is:
diff = ( ( a - b + 180 + 360 ) mod 360 ) - 180
angle = (360 + b + ( diff / 2 ) ) mod 360
0, 180 -> 90 (two answers for this: this equation takes the clockwise answer from a)
180, 0 -> 270 (see above)
180, 1 -> 90.5
1, 180 -> 90.5
20, 350 -> 5
350, 20 -> 5 (all following examples reverse properly too)
10, 20 -> 15
350, 2 -> 356
359, 0 -> 359.5
180, 180 -> 180
ackb is right that these vector based solutions cannot be considered true averages of angles, they are only an average of the unit vector counterparts. However, ackb's suggested solution does not appear to mathematically sound.
The following is a solution that is mathematically derived from the goal of minimising (angle[i] - avgAngle)^2 (where the difference is corrected if necessary), which makes it a true arithmetic mean of the angles.
First, we need to look at exactly which cases the difference between angles is different to the difference between their normal number counterparts. Consider angles x and y, if y >= x - 180 and y <= x + 180, then we can use the difference (x-y) directly. Otherwise, if the first condition is not met then we must use (y+360) in the calculation instead of y. Corresponding, if the second condition is not met then we must use (y-360) instead of y. Since the equation of the curve we are minimising only changes at the points where these inequalities change from true to false or vice versa, we can separate the full [0,360) range into a set of segments, separated by these points. Then, we only need to find the minimum of each of these segments, and then the minimum of each segment's minimum, which is the average.
Here's an image demonstrating where the problems occur in calculating angle differences. If x lies in the gray area then there will be a problem.
To minimise a variable, depending on the curve, we can take the derivative of what we want to minimise and then we find the turning point (which is where the derivative = 0).
Here we will apply the idea of minimise the squared difference to derive the common arithmetic mean formula: sum(a[i])/n. The curve y = sum((a[i]-x)^2) can be minimised in this way:
y = sum((a[i]-x)^2)
= sum(a[i]^2 - 2*a[i]*x + x^2)
= sum(a[i]^2) - 2*x*sum(a[i]) + n*x^2
dy\dx = -2*sum(a[i]) + 2*n*x
for dy/dx = 0:
-2*sum(a[i]) + 2*n*x = 0
-> n*x = sum(a[i])
-> x = sum(a[i])/n
Now applying it to curves with our adjusted differences:
b = subset of a where the correct (angular) difference a[i]-x
c = subset of a where the correct (angular) difference (a[i]-360)-x
cn = size of c
d = subset of a where the correct (angular) difference (a[i]+360)-x
dn = size of d
y = sum((b[i]-x)^2) + sum(((c[i]-360)-b)^2) + sum(((d[i]+360)-c)^2)
= sum(b[i]^2 - 2*b[i]*x + x^2)
+ sum((c[i]-360)^2 - 2*(c[i]-360)*x + x^2)
+ sum((d[i]+360)^2 - 2*(d[i]+360)*x + x^2)
= sum(b[i]^2) - 2*x*sum(b[i])
+ sum((c[i]-360)^2) - 2*x*(sum(c[i]) - 360*cn)
+ sum((d[i]+360)^2) - 2*x*(sum(d[i]) + 360*dn)
+ n*x^2
= sum(b[i]^2) + sum((c[i]-360)^2) + sum((d[i]+360)^2)
- 2*x*(sum(b[i]) + sum(c[i]) + sum(d[i]))
- 2*x*(360*dn - 360*cn)
+ n*x^2
= sum(b[i]^2) + sum((c[i]-360)^2) + sum((d[i]+360)^2)
- 2*x*sum(x[i])
- 2*x*360*(dn - cn)
+ n*x^2
dy/dx = 2*n*x - 2*sum(x[i]) - 2*360*(dn - cn)
for dy/dx = 0:
2*n*x - 2*sum(x[i]) - 2*360*(dn - cn) = 0
n*x = sum(x[i]) + 360*(dn - cn)
x = (sum(x[i]) + 360*(dn - cn))/n
This alone is not quite enough to get the minimum, while it works for normal values, that has an unbounded set, so the result will definitely lie within set's range and is therefore valid. We need the minimum within a range (defined by the segment). If the minimum is less than our segment's lower bound then the minimum of that segment must be at the lower bound (because quadratic curves only have 1 turning point) and if the minimum is greater than our segment's upper bound then the segment's minimum is at the upper bound. After we have the minimum for each segment, we simply find the one that has the lowest value for what we're minimising (sum((b[i]-x)^2) + sum(((c[i]-360)-b)^2) + sum(((d[i]+360)-c)^2)).
Here is an image to the curve, which shows how it changes at the points where x=(a[i]+180)%360. The data set is in question is {65,92,230,320,250}.
Here is an implementation of the algorithm in Java, including some optimisations, its complexity is O(nlogn). It can be reduced to O(n) if you replace the comparison based sort with a non comparison based sort, such as radix sort.
static double varnc(double _mean, int _n, double _sumX, double _sumSqrX)
{
return _mean*(_n*_mean - 2*_sumX) + _sumSqrX;
}
//with lower correction
static double varlc(double _mean, int _n, double _sumX, double _sumSqrX, int _nc, double _sumC)
{
return _mean*(_n*_mean - 2*_sumX) + _sumSqrX
+ 2*360*_sumC + _nc*(-2*360*_mean + 360*360);
}
//with upper correction
static double varuc(double _mean, int _n, double _sumX, double _sumSqrX, int _nc, double _sumC)
{
return _mean*(_n*_mean - 2*_sumX) + _sumSqrX
- 2*360*_sumC + _nc*(2*360*_mean + 360*360);
}
static double[] averageAngles(double[] _angles)
{
double sumAngles;
double sumSqrAngles;
double[] lowerAngles;
double[] upperAngles;
{
List<Double> lowerAngles_ = new LinkedList<Double>();
List<Double> upperAngles_ = new LinkedList<Double>();
sumAngles = 0;
sumSqrAngles = 0;
for(double angle : _angles)
{
sumAngles += angle;
sumSqrAngles += angle*angle;
if(angle < 180)
lowerAngles_.add(angle);
else if(angle > 180)
upperAngles_.add(angle);
}
Collections.sort(lowerAngles_);
Collections.sort(upperAngles_,Collections.reverseOrder());
lowerAngles = new double[lowerAngles_.size()];
Iterator<Double> lowerAnglesIter = lowerAngles_.iterator();
for(int i = 0; i < lowerAngles_.size(); i++)
lowerAngles[i] = lowerAnglesIter.next();
upperAngles = new double[upperAngles_.size()];
Iterator<Double> upperAnglesIter = upperAngles_.iterator();
for(int i = 0; i < upperAngles_.size(); i++)
upperAngles[i] = upperAnglesIter.next();
}
List<Double> averageAngles = new LinkedList<Double>();
averageAngles.add(180d);
double variance = varnc(180,_angles.length,sumAngles,sumSqrAngles);
double lowerBound = 180;
double sumLC = 0;
for(int i = 0; i < lowerAngles.length; i++)
{
//get average for a segment based on minimum
double testAverageAngle = (sumAngles + 360*i)/_angles.length;
//minimum is outside segment range (therefore not directly relevant)
//since it is greater than lowerAngles[i], the minimum for the segment
//must lie on the boundary lowerAngles[i]
if(testAverageAngle > lowerAngles[i]+180)
testAverageAngle = lowerAngles[i];
if(testAverageAngle > lowerBound)
{
double testVariance = varlc(testAverageAngle,_angles.length,sumAngles,sumSqrAngles,i,sumLC);
if(testVariance < variance)
{
averageAngles.clear();
averageAngles.add(testAverageAngle);
variance = testVariance;
}
else if(testVariance == variance)
averageAngles.add(testAverageAngle);
}
lowerBound = lowerAngles[i];
sumLC += lowerAngles[i];
}
//Test last segment
{
//get average for a segment based on minimum
double testAverageAngle = (sumAngles + 360*lowerAngles.length)/_angles.length;
//minimum is inside segment range
//we will test average 0 (360) later
if(testAverageAngle < 360 && testAverageAngle > lowerBound)
{
double testVariance = varlc(testAverageAngle,_angles.length,sumAngles,sumSqrAngles,lowerAngles.length,sumLC);
if(testVariance < variance)
{
averageAngles.clear();
averageAngles.add(testAverageAngle);
variance = testVariance;
}
else if(testVariance == variance)
averageAngles.add(testAverageAngle);
}
}
double upperBound = 180;
double sumUC = 0;
for(int i = 0; i < upperAngles.length; i++)
{
//get average for a segment based on minimum
double testAverageAngle = (sumAngles - 360*i)/_angles.length;
//minimum is outside segment range (therefore not directly relevant)
//since it is greater than lowerAngles[i], the minimum for the segment
//must lie on the boundary lowerAngles[i]
if(testAverageAngle < upperAngles[i]-180)
testAverageAngle = upperAngles[i];
if(testAverageAngle < upperBound)
{
double testVariance = varuc(testAverageAngle,_angles.length,sumAngles,sumSqrAngles,i,sumUC);
if(testVariance < variance)
{
averageAngles.clear();
averageAngles.add(testAverageAngle);
variance = testVariance;
}
else if(testVariance == variance)
averageAngles.add(testAverageAngle);
}
upperBound = upperAngles[i];
sumUC += upperBound;
}
//Test last segment
{
//get average for a segment based on minimum
double testAverageAngle = (sumAngles - 360*upperAngles.length)/_angles.length;
//minimum is inside segment range
//we test average 0 (360) now
if(testAverageAngle < 0)
testAverageAngle = 0;
if(testAverageAngle < upperBound)
{
double testVariance = varuc(testAverageAngle,_angles.length,sumAngles,sumSqrAngles,upperAngles.length,sumUC);
if(testVariance < variance)
{
averageAngles.clear();
averageAngles.add(testAverageAngle);
variance = testVariance;
}
else if(testVariance == variance)
averageAngles.add(testAverageAngle);
}
}
double[] averageAngles_ = new double[averageAngles.size()];
Iterator<Double> averageAnglesIter = averageAngles.iterator();
for(int i = 0; i < averageAngles_.length; i++)
averageAngles_[i] = averageAnglesIter.next();
return averageAngles_;
}
The arithmetic mean of a set of angles may not agree with your intuitive idea of what the average should be. For example, the arithmetic mean of the set {179,179,0,181,181} is 216 (and 144). The answer you immediately think of is probably 180, however it is well known that the arithmetic mean is heavily affected by edge values. You should also remember that angles are not vectors, as appealing as that may seem when dealing with angles sometimes.
This algorithm does of course also apply to all quantities that obey modular arithmetic (with minimal adjustment), such as the time of day.
I would also like to stress that even though this is a true average of angles, unlike the vector solutions, that does not necessarily mean it is the solution you should be using, the average of the corresponding unit vectors may well be the value you actually should to be using.
You have to define average more accurately. For the specific case of two angles, I can think of two different scenarios:
The "true" average, i.e. (a + b) / 2 % 360.
The angle that points "between" the two others while staying in the same semicircle, e.g. for 355 and 5, this would be 0, not 180. To do this, you need to check if the difference between the two angles is larger than 180 or not. If so, increment the smaller angle by 360 before using the above formula.
I don't see how the second alternative can be generalized for the case of more than two angles, though.
I'd like to share an method I used with a microcontroller which did not have floating point or trigonometry capabilities. I still needed to "average" 10 raw bearing readings in order to smooth out variations.
Check whether the first bearing is the range 270-360 or 0-90 degrees (northern two quadrants)
If it is, rotate this and all subsequent readings by 180 degrees, keeping all values in the range 0 <= bearing < 360. Otherwise take the readings as they come.
Once 10 readings have been taken calculate the numerical average assuming that there has been no wraparound
If the 180 degree rotation had been in effect then rotate the calculated average by 180 degrees to get back to a "true" bearing.
It's not ideal; it can break. I got away with it in this case because the device only rotates very slowly. I'll put it out there in case anyone else finds themselves working under similar restrictions.
Like all averages, the answer depends upon the choice of metric. For a given metric M, the average of some angles a_k in [-pi,pi] for k in [1,N] is that angle a_M which minimizes the sum of squared distances d^2_M(a_M,a_k). For a weighted mean, one simply includes in the sum the weights w_k (such that sum_k w_k = 1). That is,
a_M = arg min_x sum_k w_k d^2_M(x,a_k)
Two common choices of metric are the Frobenius and the Riemann metrics. For the Frobenius metric, a direct formula exists that corresponds to the usual notion of average bearing in circular statistics. See "Means and Averaging in the Group of Rotations", Maher Moakher, SIAM Journal on Matrix Analysis and Applications, Volume 24, Issue 1, 2002, for details.
http://link.aip.org/link/?SJMAEL/24/1/1
Here's a function for GNU Octave 3.2.4 that does the computation:
function ma=meanangleoct(a,w,hp,ntype)
% ma=meanangleoct(a,w,hp,ntype) returns the average of angles a
% given weights w and half-period hp using norm type ntype
% Ref: "Means and Averaging in the Group of Rotations",
% Maher Moakher, SIAM Journal on Matrix Analysis and Applications,
% Volume 24, Issue 1, 2002.
if (nargin<1) | (nargin>4), help meanangleoct, return, end
if isempty(a), error('no measurement angles'), end
la=length(a); sa=size(a);
if prod(sa)~=la, error('a must be a vector'); end
if (nargin<4) || isempty(ntype), ntype='F'; end
if ~sum(ntype==['F' 'R']), error('ntype must be F or R'), end
if (nargin<3) || isempty(hp), hp=pi; end
if (nargin<2) || isempty(w), w=1/la+0*a; end
lw=length(w); sw=size(w);
if prod(sw)~=lw, error('w must be a vector'); end
if lw~=la, error('length of w must equal length of a'), end
if sum(w)~=1, warning('resumming weights to unity'), w=w/sum(w); end
a=a(:); % make column vector
w=w(:); % make column vector
a=mod(a+hp,2*hp)-hp; % reduce to central period
a=a/hp*pi; % scale to half period pi
z=exp(i*a); % U(1) elements
% % NOTA BENE:
% % fminbnd can get hung up near the boundaries.
% % If that happens, shift the input angles a
% % forward by one half period, then shift the
% % resulting mean ma back by one half period.
% X=fminbnd(#meritfcn,-pi,pi,[],z,w,ntype);
% % seems to work better
x0=imag(log(sum(w.*z)));
X=fminbnd(#meritfcn,x0-pi,x0+pi,[],z,w,ntype);
% X=real(X); % truncate some roundoff
X=mod(X+pi,2*pi)-pi; % reduce to central period
ma=X*hp/pi; % scale to half period hp
return
%%%%%%
function d2=meritfcn(x,z,w,ntype)
x=exp(i*x);
if ntype=='F'
y=x-z;
else % ntype=='R'
y=log(x'*z);
end
d2=y'*diag(w)*y;
return
%%%%%%
% % test script
% %
% % NOTA BENE: meanangleoct(a,[],[],'R') will equal mean(a)
% % when all abs(a-b) < pi/2 for some value b
% %
% na=3, a=sort(mod(randn(1,na)+1,2)-1)*pi;
% da=diff([a a(1)+2*pi]); [mda,ndx]=min(da);
% a=circshift(a,[0 2-ndx]) % so that diff(a(2:3)) is smallest
% A=exp(i*a), B1=expm(a(1)*[0 -1; 1 0]),
% B2=expm(a(2)*[0 -1; 1 0]), B3=expm(a(3)*[0 -1; 1 0]),
% masimpl=[angle(mean(exp(i*a))) mean(a)]
% Bsum=B1+B2+B3; BmeanF=Bsum/sqrt(det(Bsum));
% % this expression for BmeanR should be correct for ordering of a above
% BmeanR=B1*(B1'*B2*(B2'*B3)^(1/2))^(2/3);
% mamtrx=real([[0 1]*logm(BmeanF)*[1 0]' [0 1]*logm(BmeanR)*[1 0]'])
% manorm=[meanangleoct(a,[],[],'F') meanangleoct(a,[],[],'R')]
% polar(a,1+0*a,'b*'), axis square, hold on
% polar(manorm(1),1,'rs'), polar(manorm(2),1,'gd'), hold off
% Meanangleoct Version 1.0
% Copyright (C) 2011 Alphawave Research, robjohnson#alphawaveresearch.com
% Released under GNU GPLv3 -- see file COPYING for more info.
%
% Meanangle is free software: you can redistribute it and/or modify
% it under the terms of the GNU General Public License as published by
% the Free Software Foundation, either version 3 of the License, or (at
% your option) any later version.
%
% Meanangle is distributed in the hope that it will be useful, but
% WITHOUT ANY WARRANTY; without even the implied warranty of
% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
% General Public License for more details.
%
% You should have received a copy of the GNU General Public License
% along with this program. If not, see `http://www.gnu.org/licenses/'.
In python, with angles between [-180, 180)
def add_angles(a, b):
return (a + b + 180) % 360 - 180
def average_angles(a, b):
return add_angles(a, add_angles(-a, b)/2)
Details:
For the average of two angles there are two averages 180° apart, but we may want the closer average.
Visually, the average of the blue (b) and green (a) the yields the teal point:
Angles 'wrap around' (e.g. 355 + 10 = 5), but standard arithmetic will ignore this branch point.
However if angle b is opposite to the branch point, then (b + g)/2 gives the closest average: the teal point.
For any two angles, we can rotate the problem so one of the angles is opposite to the branch point, perform standard averaging, then rotate back.
Here is the full solution:
(the input is an array of bearing in degrees (0-360)
public static int getAvarageBearing(int[] arr)
{
double sunSin = 0;
double sunCos = 0;
int counter = 0;
for (double bearing : arr)
{
bearing *= Math.PI/180;
sunSin += Math.sin(bearing);
sunCos += Math.cos(bearing);
counter++;
}
int avBearing = INVALID_ANGLE_VALUE;
if (counter > 0)
{
double bearingInRad = Math.atan2(sunSin/counter, sunCos/counter);
avBearing = (int) (bearingInRad*180f/Math.PI);
if (avBearing<0)
avBearing += 360;
}
return avBearing;
}
In english:
Make a second data set with all angles shifted by 180.
Take the variance of both data sets.
Take the average of the data set with the smallest variance.
If this average is from the shifted set then shift the answer again by 180.
In python:
A #numpy NX1 array of angles
if np.var(A) < np.var((A-180)%360):
average = np.average(A)
else:
average = (np.average((A-180)%360)+180)%360
If anyone is looking for a JavaScript solution to this, I've translated the example given in the wikipedia page Mean of circular quantities (which was also referred to in Nick's answer) into JavaScript/NodeJS code, with help from the mathjs library.
If your angles are in degrees:
const maths = require('mathjs');
getAverageDegrees = (array) => {
let arrayLength = array.length;
let sinTotal = 0;
let cosTotal = 0;
for (let i = 0; i < arrayLength; i++) {
sinTotal += maths.sin(array[i] * (maths.pi / 180));
cosTotal += maths.cos(array[i] * (maths.pi / 180));
}
let averageDirection = maths.atan(sinTotal / cosTotal) * (180 / maths.pi);
if (cosTotal < 0) {
averageDirection += 180;
} else if (sinTotal < 0) {
averageDirection += 360;
}
return averageDirection;
}
This solution worked really well for me in order to find the average direction from a set of compass directions. I've tested this on a large range of directional data (0-360 degrees) and it seems very robust.
Alternatively, if your angles are in radians:
const maths = require('mathjs');
getAverageRadians = (array) => {
let arrayLength = array.length;
let sinTotal = 0;
let cosTotal = 0;
for (let i = 0; i < arrayLength; i++) {
sinTotal += maths.sin(array[i]);
cosTotal += maths.cos(array[i]);
}
let averageDirection = maths.atan(sinTotal / cosTotal);
if (cosTotal < 0) {
averageDirection += 180;
} else if (sinTotal < 0) {
averageDirection += 360;
}
return averageDirection;
}
Hopefully these solutions are helpful to someone facing a similar programming challenge to me.
I would go the vector way using complex numbers. My example is in Python, which has built-in complex numbers:
import cmath # complex math
def average_angle(list_of_angles):
# make a new list of vectors
vectors= [cmath.rect(1, angle) # length 1 for each vector
for angle in list_of_angles]
vector_sum= sum(vectors)
# no need to average, we don't care for the modulus
return cmath.phase(vector_sum)
Note that Python does not need to build a temporary new list of vectors, all of the above can be done in one step; I just chose this way to approximate pseudo-code applicable to other languages too.
Here's a complete C++ solution:
#include <vector>
#include <cmath>
double dAngleAvg(const vector<double>& angles) {
auto avgSin = double{ 0.0 };
auto avgCos = double{ 0.0 };
static const auto conv = double{ 0.01745329251994 }; // PI / 180
static const auto i_conv = double{ 57.2957795130823 }; // 180 / PI
for (const auto& theta : angles) {
avgSin += sin(theta*conv);
avgCos += cos(theta*conv);
}
avgSin /= (double)angles.size();
avgCos /= (double)angles.size();
auto ret = double{ 90.0 - atan2(avgCos, avgSin) * i_conv };
if (ret<0.0) ret += 360.0;
return fmod(ret, 360.0);
}
It takes the angles in the form of a vector of doubles, and returns the average simply as a double. The angles must be in degrees, and of course the average is in degrees as well.
Based on Alnitak's answer, I've written a Java method for calculating the average of multiple angles:
If your angles are in radians:
public static double averageAngleRadians(double... angles) {
double x = 0;
double y = 0;
for (double a : angles) {
x += Math.cos(a);
y += Math.sin(a);
}
return Math.atan2(y, x);
}
If your angles are in degrees:
public static double averageAngleDegrees(double... angles) {
double x = 0;
double y = 0;
for (double a : angles) {
x += Math.cos(Math.toRadians(a));
y += Math.sin(Math.toRadians(a));
}
return Math.toDegrees(Math.atan2(y, x));
}
Here's an idea: build the average iteratively by always calculating the average of the angles that are closest together, keeping a weight.
Another idea: find the largest gap between the given angles. Find the point that bisects it, and then pick the opposite point on the circle as the reference zero to calculate the average from.
Let's represent these angles with points on the circumference of the circle.
Can we assume that all these points fall on the same half of the circle? (Otherwise, there is no obvious way to define the "average angle". Think of two points on the diameter, e.g. 0 deg and 180 deg --- is the average 90 deg or 270 deg? What happens when we have 3 or more evenly spread out points?)
With this assumption, we pick an arbitrary point on that semicircle as the "origin", and measure the given set of angles with respect to this origin (call this the "relative angle"). Note that the relative angle has an absolute value strictly less than 180 deg. Finally, take the mean of these relative angles to get the desired average angle (relative to our origin of course).
There's no single "right answer". I recommend reading the book,
K. V. Mardia and P. E. Jupp, "Directional Statistics", (Wiley, 1999),
for a thorough analysis.
(Just want to share my viewpoint from Estimation Theory or Statistical Inference)
Nimble's trial is to get the MMSE^ estimate of a set of angles, but it's one of choices to find an "averaged" direction; one can also find an MMAE^ estimate, or some other estimate to be the "averaged" direction, and it depends on your metric quantifying error of direction; or more generally in estimation theory, the definition of cost function.
^ MMSE/MMAE corresponds to minimum mean squared/absolute error.
ackb said "The average angle phi_avg should have the property that sum_i|phi_avg-phi_i|^2 becomes minimal...they average something, but not angles"
---- you quantify errors in mean-squared sense and it's one of the mostly common way, however, not the only way. The answer favored by most people here (i.e., sum of the unit vectors and get the angle of the result) is actually one of the reasonable solutions. It is (can be proved) the ML estimator that serves as the "averaged" direction we want, if the directions of the vectors are modeled as von Mises distribution. This distribution is not fancy, and is just a periodically sampled distribution from a 2D Guassian. See Eqn. (2.179) in Bishop's book "Pattern Recognition and Machine Learning". Again, by no means it's the only best one to represent "average" direction, however, it is quite reasonable one that have both good theoretical justification and simple implementation.
Nimble said "ackb is right that these vector based solutions cannot be considered true averages of angles, they are only an average of the unit vector counterparts"
----this is not true. The "unit vector counterparts" reveals the information of the direction of a vector. The angle is a quantity without considering the length of the vector, and the unit vector is something with additional information that the length is 1. You can define your "unit" vector to be of length 2, it does not really matter.
You can see a solution and a little explanation in the following link, for ANY programming language:
https://rosettacode.org/wiki/Averages/Mean_angle
For instance, C++ solution:
#include<math.h>
#include<stdio.h>
double
meanAngle (double *angles, int size)
{
double y_part = 0, x_part = 0;
int i;
for (i = 0; i < size; i++)
{
x_part += cos (angles[i] * M_PI / 180);
y_part += sin (angles[i] * M_PI / 180);
}
return atan2 (y_part / size, x_part / size) * 180 / M_PI;
}
int
main ()
{
double angleSet1[] = { 350, 10 };
double angleSet2[] = { 90, 180, 270, 360};
double angleSet3[] = { 10, 20, 30};
printf ("\nMean Angle for 1st set : %lf degrees", meanAngle (angleSet1, 2));
printf ("\nMean Angle for 2nd set : %lf degrees", meanAngle (angleSet2, 4));
printf ("\nMean Angle for 3rd set : %lf degrees\n", meanAngle (angleSet3, 3));
return 0;
}
Output:
Mean Angle for 1st set : -0.000000 degrees
Mean Angle for 2nd set : -90.000000 degrees
Mean Angle for 3rd set : 20.000000 degrees
Or Matlab solution:
function u = mean_angle(phi)
u = angle(mean(exp(i*pi*phi/180)))*180/pi;
end
mean_angle([350, 10])
ans = -2.7452e-14
mean_angle([90, 180, 270, 360])
ans = -90
mean_angle([10, 20, 30])
ans = 20.000
Here is a completely arithmetic solution using moving averages and taking care to normalize values. It is fast and delivers correct answers if all angles are on one side of the circle (within 180° of each other).
It is mathimatically equivalent to adding the offset which shifts the values into the range (0, 180), calulating the mean and then subtracting the offset.
The comments describe what range a specific value can take on at any given time
// angles have to be in the range [0, 360) and within 180° of each other.
// n >= 1
// returns the circular average of the angles int the range [0, 360).
double meanAngle(double* angles, int n)
{
double average = angles[0];
for (int i = 1; i<n; i++)
{
// average: (0, 360)
double diff = angles[i]-average;
// diff: (-540, 540)
if (diff < -180)
diff += 360;
else if (diff >= 180)
diff -= 360;
// diff: (-180, 180)
average += diff/(i+1);
// average: (-180, 540)
if (average < 0)
average += 360;
else if (average >= 360)
average -= 360;
// average: (0, 360)
}
return average;
}
Well I'm hugely late to the party but thought I'd add my 2 cents worth as I couldn't really find any definitive answer. In the end I implemented the following Java version of the Mitsuta method which, I hope, provides a simple and robust solution. Particularly as the Standard Deviation provides both a measure dispersion and, if sd == 90, indicates that the input angles result in an ambiguous mean.
EDIT: Actually I realised that my original implementation can be even further simplified, in fact worryingly simple considering all the conversation and trigonometry going on in the other answers.
/**
* The Mitsuta method
*
* #param angles Angles from 0 - 360
* #return double array containing
* 0 - mean
* 1 - sd: a measure of angular dispersion, in the range [0..360], similar to standard deviation.
* Note if sd == 90 then the mean can also be its inverse, i.e. 360 == 0, 300 == 60.
*/
public static double[] getAngleStatsMitsuta(double... angles) {
double sum = 0;
double sumsq = 0;
for (double angle : angles) {
if (angle >= 180) {
angle -= 360;
}
sum += angle;
sumsq += angle * angle;
}
double mean = sum / angles.length;
return new double[]{mean <= 0 ? 360 + mean: mean, Math.sqrt(sumsq / angles.length - (mean * mean))};
}
... and for all you (Java) geeks out there, you can use the above approach to get the mean angle in one line.
Arrays.stream(angles).map(angle -> angle<180 ? angle: (angle-360)).sum() / angles.length;
Alnitak has the right solution. Nick Fortescue's solution is functionally the same.
For the special case of where
( sum(x_component) = 0.0 && sum(y_component) = 0.0 ) // e.g. 2 angles of 10. and 190. degrees ea.
use 0.0 degrees as the sum
Computationally you have to test for this case since atan2(0. , 0.) is undefined and will generate an error.
The average angle phi_avg should have the property that sum_i|phi_avg-phi_i|^2 becomes minimal, where the difference has to be in [-Pi, Pi) (because it might be shorter to go the other way around!). This is easily achieved by normalizing all input values to [0, 2Pi), keeping a running average phi_run and choosing normalizing |phi_i-phi_run| to [-Pi,Pi)
(by adding or subtractin 2Pi). Most suggestions above do something else that does not
have that minimal property, i.e., they average something, but not angles.
I solved the problem with the help of the answer from #David_Hanak.
As he states:
The angle that points "between" the two others while staying in the same semicircle, e.g. for 355 and 5, this would be 0, not 180. To do this, you need to check if the difference between the two angles is larger than 180 or not. If so, increment the smaller angle by 360 before using the above formula.
So what I did was calculate the average of all the angles. And then all the angles that are less than this, increase them by 360. Then recalculate the average by adding them all and dividing them by their length.
float angleY = 0f;
int count = eulerAngles.Count;
for (byte i = 0; i < count; i++)
angleY += eulerAngles[i].y;
float averageAngle = angleY / count;
angleY = 0f;
for (byte i = 0; i < count; i++)
{
float angle = eulerAngles[i].y;
if (angle < averageAngle)
angle += 360f;
angleY += angle;
}
angleY = angleY / count;
Works perfectly.
Python function:
from math import sin,cos,atan2,pi
import numpy as np
def meanangle(angles,weights=0,setting='degrees'):
'''computes the mean angle'''
if weights==0:
weights=np.ones(len(angles))
sumsin=0
sumcos=0
if setting=='degrees':
angles=np.array(angles)*pi/180
for i in range(len(angles)):
sumsin+=weights[i]/sum(weights)*sin(angles[i])
sumcos+=weights[i]/sum(weights)*cos(angles[i])
average=atan2(sumsin,sumcos)
if setting=='degrees':
average=average*180/pi
return average
You can use this function in Matlab:
function retVal=DegreeAngleMean(x)
len=length(x);
sum1=0;
sum2=0;
count1=0;
count2=0;
for i=1:len
if x(i)<180
sum1=sum1+x(i);
count1=count1+1;
else
sum2=sum2+x(i);
count2=count2+1;
end
end
if (count1>0)
k1=sum1/count1;
end
if (count2>0)
k2=sum2/count2;
end
if count1>0 && count2>0
if(k2-k1 >= 180)
retVal = ((sum1+sum2)-count2*360)/len;
else
retVal = (sum1+sum2)/len;
end
elseif count1>0
retVal = k1;
else
retVal = k2;
end
While starblue's answer gives the angle of the average unit vector, it is possible to extend the concept of the arithmetic mean to angles if you accept that there may be more than one answer in the range of 0 to 2*pi (or 0° to 360°). For example, the average of 0° and 180° may be either 90° or 270°.
The arithmetic mean has the property of being the single value with the minimum sum of squared distances to the input values. The distance along the unit circle between two unit vectors can be easily calculated as the inverse cosine of their dot product. If we choose a unit vector by minimizing the sum of the squared inverse cosine of the dot product of our vector and each input unit vector then we have an equivalent average. Again, keep in mind that there may be two or more minimums in exceptional cases.
This concept could be extended to any number of dimensions, since the distance along the unit sphere can be calculated in the exact same way as the distance along the unit circle--the inverse cosine of the dot product of two unit vectors.
For circles we could solve for this average in a number of ways, but I propose the following O(n^2) algorithm (angles are in radians, and I avoid calculating the unit vectors):
var bestAverage = -1
double minimumSquareDistance
for each a1 in input
var sumA = 0;
for each a2 in input
var a = (a2 - a1) mod (2*pi) + a1
sumA += a
end for
var averageHere = sumA / input.count
var sumSqDistHere = 0
for each a2 in input
var dist = (a2 - averageHere + pi) mod (2*pi) - pi // keep within range of -pi to pi
sumSqDistHere += dist * dist
end for
if (bestAverage < 0 OR sumSqDistHere < minimumSquareDistance) // for exceptional cases, sumSqDistHere may be equal to minimumSquareDistance at least once. In these cases we will only find one of the averages
minimumSquareDistance = sumSqDistHere
bestAverage = averageHere
end if
end for
return bestAverage
If all the angles are within 180° of each other, then we could use a simpler O(n)+O(sort) algorithm (again using radians and avoiding use of unit vectors):
sort(input)
var largestGapEnd = input[0]
var largestGapSize = (input[0] - input[input.count-1]) mod (2*pi)
for (int i = 1; i < input.count; ++i)
var gapSize = (input[i] - input[i - 1]) mod (2*pi)
if (largestGapEnd < 0 OR gapSize > largestGapSize)
largestGapSize = gapSize
largestGapEnd = input[i]
end if
end for
double sum = 0
for each angle in input
var a2 = (angle - largestGapEnd) mod (2*pi) + largestGapEnd
sum += a2
end for
return sum / input.count
To use degrees, simply replace pi with 180. If you plan to use more dimensions then you will most likely have to use an iterative method to solve for the average.
The problem is extremely simple.
1. Make sure all angles are between -180 and 180 degrees.
2. a Add all non-negative angles, take their average, and COUNT how many
2. b.Add all negative angles, take their average and COUNT how many.
3. Take the difference of pos_average minus neg_average
If difference is greater than 180 then change difference to 360 minus difference. Otherwise just change the sign of difference. Note that difference is always non-negative.
The Average_Angle equals the pos_average plus difference times the "weight", negative count divided by the sum of negative and positive count
Here is some java code to average angles, I think it's reasonably robust.
public static double getAverageAngle(List<Double> angles)
{
// r = right (0 to 180 degrees)
// l = left (180 to 360 degrees)
double rTotal = 0;
double lTotal = 0;
double rCtr = 0;
double lCtr = 0;
for (Double angle : angles)
{
double norm = normalize(angle);
if (norm >= 180)
{
lTotal += norm;
lCtr++;
} else
{
rTotal += norm;
rCtr++;
}
}
double rAvg = rTotal / Math.max(rCtr, 1.0);
double lAvg = lTotal / Math.max(lCtr, 1.0);
if (rAvg > lAvg + 180)
{
lAvg += 360;
}
if (lAvg > rAvg + 180)
{
rAvg += 360;
}
double rPortion = rAvg * (rCtr / (rCtr + lCtr));
double lPortion = lAvg * (lCtr / (lCtr + rCtr));
return normalize(rPortion + lPortion);
}
public static double normalize(double angle)
{
double result = angle;
if (angle >= 360)
{
result = angle % 360;
}
if (angle < 0)
{
result = 360 + (angle % 360);
}
return result;
}

Algorithm to order 'tag line' campaigns based on resulting sales

I want to be able to introduce new 'tag lines' into a database that are shown 'randomly' to users. (These tag lines are shown as an introduction as animated text.)
Based upon the number of sales that result from those taglines I'd like the good ones to trickle to the top, but still show the others less frequently.
I could come up with a basic algorithm quite easily but I want something thats a little more 'statistically accurate'.
I dont really know where to start. Its been a while since I've done anything more than basic statistics. My model would need to be sensitive to tolerances, but obviously it doesnt need to be worthy of a PHD.
Edit: I am currently tracking a 'conversion rate' - i.e. hits per order. This value would probably be best calculated as a cumulative 'all time' convertsion rate to be fed into the algorithm.
Looking at your problem, I would modify the requirements a bit -
1) The most popular one should be shown most often.
2) Taglines should "age", so one that got a lot of votes (purchase) in the past, but none recently should be shown less often
3) Brand new taglines should be shown more often during their first days.
If you agree on those, then a algorithm could be something like:
START:
x = random(1, 3);
if x = 3 goto NEW else goto NORMAL
NEW:
TagVec = Taglines.filterYounger(5 days); // I'm taking a LOT of liberties with the pseudo code,,,
x = random(1, TagVec.Length);
return tagVec[x-1]; // 0 indexed vectors even in made up language,
NORMAL:
// Similar to EBGREEN above
sum = 0;
ForEach(TagLine in TagLines) {
sum += TagLine.noOfPurhcases;
}
x = random(1, sum);
ForEach(TagLine in TagLines) {
x -= TagLine.noOfPurchase;
if ( x > 0) return TagLine; // Find the TagLine that represent our random number
}
Now, as a setup I would give every new tagline 10 purchases, to avoid getting really big slanting for one single purchase.
The aging process I would count a purchase older than a week as 0.8 purhcase per week of age. So 1 week old gives 0.8 points, 2 weeks give 0.8*0.8 = 0,64 and so forth...
You would have to play around with the Initial purhcases parameter (10 in my example) and the aging speed (1 week here) and the aging factor (0.8 here) to find something that suits you.
I would suggest randomly choosing with a weighting factor based on previous sales. So let's say you had this:
tag1 = 1 sale
tag2 = 0 sales
tag3 = 1 sale
tag4 = 2 sales
tag5 = 3 sales
A simple weighting formula would be 1 + number of sales, so this would be the probability of selecting each tag:
tag1 = 2/12 = 16.7%
tag2 = 1/12 = 8.3%
tag3 = 2/12 = 16.6%
tag4 = 3/12 = 25%
tag5 = 4/12 = 33.3%
You could easily change the weighting formula to get just the distribution that you want.
You have to come up with a weighting formula based on sales.
I don't think there's any such thing as a "statistically accurate" formula here - it's all based on your preference.
No one can say "this is the correct weighting and the other weighting is wrong" because there isn't a final outcome you are attempting to model - this isn't like trying to weigh responses to a poll about an upcoming election (where you are trying to model results to represent something that will happen in the future).
Heres an example in javascript. Not that I'm not suggesting running this client side...
Also there is alot of optimization that can be done.
Note: createMemberInNormalDistribution() is implemented here Converting a Uniform Distribution to a Normal Distribution
/*
* an example set of taglines
* hits are sales
* views are times its been shown
*/
var taglines = [
{"tag":"tagline 1","hits":1,"views":234},
{"tag":"tagline 2","hits":5,"views":566},
{"tag":"tagline 3","hits":3,"views":421},
{"tag":"tagline 4","hits":1,"views":120},
{"tag":"tagline 5","hits":7,"views":200}
];
/*set up our stat model for the tags*/
var TagModel = function(set){
var hits, views, sumOfDiff, sumOfSqDiff;
hits = views = sumOfDiff = sumOfSqDiff = 0;
/*find average*/
for (n in set){
hits += set[n].hits;
views += set[n].views;
}
this.avg = hits/views;
/*find standard deviation and variance*/
for (n in set){
var diff =((set[n].hits/set[n].views)-this.avg);
sumOfDiff += diff;
sumOfSqDiff += diff*diff;
}
this.variance = sumOfDiff;
this.std_dev = Math.sqrt(sumOfSqDiff/set.length);
/*return tag to use fChooser determines likelyhood of tag*/
this.getTag = function(fChooser){
var m = this;
set.sort(function(a,b){
return fChooser((a.hits/a.views),(b.hits/b.views), m);
});
return set[0];
};
};
var config = {
"uniformDistribution":function(a,b,model){
return Math.random()*b-Math.random()*a;
},
"normalDistribution":function(a,b,model){
var a1 = createMemberInNormalDistribution(model.avg,model.std_dev)* a;
var b1 = createMemberInNormalDistribution(model.avg,model.std_dev)* b;
return b1-a1;
},
//say weight = 10^n... higher n is the more even the distribution will be.
"weight": .5,
"weightedDistribution":function(a,b,model){
var a1 = createMemberInNormalDistribution(model.avg,model.std_dev*config.weight)* a;
var b1 = createMemberInNormalDistribution(model.avg,model.std_dev*config.weight)* b;
return b1-a1;
}
}
var model = new TagModel(taglines);
//to use
model.getTag(config.uniformDistribution).tag;
//running 10000 times: ({'tagline 4':836, 'tagline 5':7608, 'tagline 1':100, 'tagline 2':924, 'tagline 3':532})
model.getTag(config.normalDistribution).tag;
//running 10000 times: ({'tagline 4':1775, 'tagline 5':3471, 'tagline 1':1273, 'tagline 2':1857, 'tagline 3':1624})
model.getTag(config.weightedDistribution).tag;
//running 10000 times: ({'tagline 4':1514, 'tagline 5':5045, 'tagline 1':577, 'tagline 2':1627, 'tagline 3':1237})
config.weight = 2;
model.getTag(config.weightedDistribution).tag;
//running 10000 times: {'tagline 4':1941, 'tagline 5':2715, 'tagline 1':1559, 'tagline 2':1957, 'tagline 3':1828})

Resources