Principal Component Analysis - Dimensionality Reduction - statsmodels

When we talk about PCA we say that we use it to reduce the dimensionality of the data. I have 2-d data, and using PCA reduced the dimensionality to 1-d.
Now,
The first component will be in such a way that it captures the maximum variance. What does it mean that the 1st component has max. variance?
Also, if we take 3-d data and reduce its dimensionality to 2-d then the 1st component will be built with max variance along the x-axis or y-axis?

PCA works by first centering the data at the origin (subtracting the mean from each data point), and then rotating it to be in line with the axes (diagonalizing the covariance matrix into a “variance” matrix). The components are then sorted so that the diagonal of the variance matrix is in descending order, which translates to the first component having the largest variance, the second having the next largest variance, etc. Later, you squish your original data by zero-ing out less important components (projecting onto principal components), and then undoing the aforementioned transformations.
To answer your questions:
The first component having the max variance means that its corresponding entry in the variance matrix is the largest one.
I suppose it depends on what you call your axes.
Source: Probability and Statistics for Computer Science by David Forsyth.

Related

Find clusters in 3D point data using a massively parallel algorithm

I have a large number of points in 3D space (x,y,z) represented as an array of 3 float structs. I also have access to a strong graphics card with CUDA capability. I want the following:
Divide the points in the array into clusters so that every point within a cluster has a maximum euclidean distance of X to at least one other point within the cluster.
Examle in 2D:
The "brute force" way of doing this is of course to calculate the distance between every point and every other point, to see if any of the distances is below the threshold X, and if so mark those points as belonging to the same cluster. This is an O(n²) algorithm.
This can be done in parallel in CUDA ofcourse with n² threads, but is there a better way?
The algorithm can be reduced to O(n) by using binning:
impose a 3D grid spaced as X, that is a 3D lattice (each cell of the lattice is a cubic bin);
assign each points in space to the corresponding bin (the bin that geometrically contains that points);
every time you need to evaluate the distances from one point, you just use only the points in the bin of the point itself and the ones in the 26 neighbouring bins (3x3x3 = 27)
The points in the other bins are further than X, so you don't need to evaluate the distances at all.
In this way, assuming a constant density in the points, you will have to compute the distance only for a constant number of pair points / total number of points.
Assigning the points to the bins is O(n) as well.
If the points are not uniformly distributed, the bins can be smaller (and you must consider more than 26 neighbours to evaluate the distances) and eventually sparse.
This is a typical trick used for molecular dynamics, ray tracing, meshing,... However I know of the term binning from molecular dynamics simulation: the name can change (link-cell, kd-trees too use the same principle, even if more articulated), the algorithm remains the same!
And, good news, the algorithm is well suited for parallel implementation.
refs:
https://en.wikipedia.org/wiki/Cell_lists

To make a distance matrix or to repeatedly calculate distance

I'm working on K-medoids algorithm implementation. It is a clustering algorithm and one of its steps includes finding the most representative point in a cluster.
So, here's the thing
I have a certain number of clusters
Each cluster contains a certain number of points
I need to find the point in each cluster that results with the least error if it is picked as a cluster representative
Distance from each point to all the other in the cluster needs to be calculated
This distance calculation could be simple as Euclidean or more complex like DTW (Dynamic Time Warping) between two signals
There are two approaches, one is to calculate distance matrix that will save values between all the points in the dataset and the other is to calculate distances during clustering, which results that distances between some points will be calculated repeatedly.
On one hand, to build distance matrix you must calculate distances between all points in the whole dataset and some of calculated values will never be used.
On the other hand, if you don't build the distance matrix, you will repeat some calculations in certain number of iterations.
Which is the better approach?
I'm also considering MapReduce implementation, so opinions from that angle are also welcome.
Thanks
A 3rd approach could be a combination of both, and is lazily evaluating the distance matrix. Initialize a matrix with default values (unrealistic values, like negative ones), and when you need to calculate distance between two points, if the values is already present in the matrix - just take it from it.
Otherwise, calculate it and store it in the matrix.
This approach trades calculations (and is optimal in doing the lowest number of possible pair calculations), for more branches in the code, and a few more instructions. However, due to branch predictors, I assume this overhead will not be that dramatic.
I predict it will have better performance when the calculation is relatively expansive.
Another optimization of it could be to dynamically switch for a plain matrix implementation (and calculate the remaining part of the matrix) when the number of already calculated exceeds a certain threshold. This can be achieved pretty nicely in OOP languages, by switching the implementation of the interface when a certain threshold is met.
Which is actually better implementation is going to rely heavily on the cost of the distance function, and the data you are clustering, as some will need to calculate the same points more often than other data sets.
I suggest doing a benchmark, and using statistical tools to evaluate which method is actually better.

Algorithm: 2D transformation, find outlying pairs of points and omit

I am looking for the following type of algorithm:
There are n matched pairs of points in 2D. How can I identify outlying pairs of points according to Affine / Helmert transformation and omit them from the transformation key? We do not know the exact number of such outlying pairs.
I cannot use Trimmed Least Squares method because there is a basic assumption that a k percentage of pairs is correct. But we do not have any information about the sample and do not know the k... In such a sample of all pairs could be correct or vice versa.
Which types of algorithms are suitable for this problem?
Use RANSAC:
Repeat the following steps a fixed number of times:
Randomly select as much pairs as are necessary to compute the transformation parameters.
Compute the parameters.
Compute the subset of pairs that have small projection error (the 'consensus set').
If the consensus set is large enough, compute a projection for it (e.g. with Least Squares).
Computer the consensus set's projection error
Remember the model if it is the best you found so far.
You have to experiment to find good values for
"a fixed number of times"
"small projection error"
"consensus set is large enough".
The simplest approach is compute your transformation based on all points, compute the residuals for each point, remove the points with high residuals until you reach an acceptable transformation or hit the minimum number of acceptable input points. The residual for any given point is the join distance between the forward transformed value for a point, and the intended target point.
Note that the residuals between an affine transformation and a Helmert (conformal) transformation will be very different as these transformations do different things. The non-uniform scale of the affine has more 'stretch' and will hence lead to smaller residuals.

What's the best method to compare original trajectory with two compressed trajectory

Suppose to have a GPS trajectory - i.e.: a series of spatio-temporal coords, every coord is a (x,y,t) information, where x is longitude, y is latitude and t is the time stamp.
Suppose each trajectory identified by 1000 (x,y) points, a compressed trajectory is trajectory with fewer points than the original, for instance 300 points. A compression algorithm (Douglas-Peucker, Bellman, etc) decide what points will be in compressed trajectory and what point will be discarded.
Each algorithm make his own choice. Better algorithms choice the points not only by spatial characteristics (x, y) but using spatio-temporal characteristics (x,y,t).
Now I need a way to compare two compressed trajectories against the original to understand what compression algorithm better reduce a spatio-temporal (temporal component is really important) trajectory.
I've thinked of DTW algorithm to check trajectory similarity, but this probably don't care about temporal component. What algorithm can I use to make this control?
What is the best compression algorithm depends to a large extent on what you are trying to achieve with it, and is dependent on other external variables. Typically, were going to identify and remove spikes, and then remove redundant data. For example;
Known minimum and maximum velocity, acceleration, and ability to tuen will let you remove spikes. If we look at the join distance between a pair of points divided by the time where
velocity = sqrt((xb - xa)^2 + (yb - ya))/(tb-ta)
we can eliminate points where the distance couldn't be travelled in the elapsed time given the speed constraint. We can do the same with acceleration constraints, and change in direction constraints for a given velocity. These constraints change whether the GPS receiver is static, hand held, in a car, in an aeroplane etc...
We can remove redundant points using a moving window looking at three points, where if the an interpolated (x,y,t) for middle point can be compared with an observed point, and the observed point removed if it lies within a specified distance + time tolerance of the interpolated point. We can also curve fit the data and consider the distance to the curve rather than using a moving 3 point window.
The compression may also have differing goals based on the constraints given, e.g. to simply reduce the data size by removing redundant observations and spikes, or to smooth the data as well.
For the former, after checking for spikes based on defined constraints, we simply check the 3d distance of each point to the polyline connecting the compressed points. This is achieved by finding the pair of points before and after the point that has been removed, interpolating a position on the line connecting those points based on the observed time, and comparing the interpolated position with the observed position. The amount of points removed will increase as we allow this distance tolerance to increase.
For the latter we also have to consider how well the smoothed result models the data, the weights imposed by the constraints, and the design shape / curve parameters.
Hope this makes some sense.
Maybe you could use mean square distance between trajectories over time.
Probably, simply looking at distance at time 1s,2s,... will be enough, but you can also do it more precise between time stamps integrating, (x1(t)-x2(t))^2 + (y1(t)-y2(t))^2. Note that between 2 time stamps both trajectories will be straight line.
I've found what I need to compute spatio-temporal error.
As written in paper "Compression and Mining of GPS Trace Data:
New Techniques and Applications" by Lawson, Ravi & Hwang:
Synchronized Euclidean distance (sed) measures the distance between
two points at identical time stamps. In Figure 1, five time steps (t1
through t5) are shown. The simplified line (which can be thought of as
the compressed representation of the trace) is comprised of only two
points (P't1 and P't5); thereby, it does not include points P't2, P't3
and P't4. To quantify the error introduced by these missing points,
distance is measured at the identical time steps. Since three points
were removed between P't1 and P't5, the line is divided into four
equal sized line segments using the three points P't2, P't3 and P't4
for the purposes of measuring the error. The total error is measured
as the sum of the distance between all points at the synchronized time
instants, as shown below. (In the following expression, n represents
the total number of points considered.)

Histogram peak identification and gauss fitting with minimal accumulated hight difference in c++

I already asked a similar question some time ago in the following thread: previous thread. Until now I unfortunately couldn't entirely solve that issue and only worked around. Since it is difficult to include all the new information with the previous thread
I post a refined and extended question with distinct context here and link it to the old thread.
I am currently implementing an algorithm from a paper which extracts certain regions of a 3D data set by dynamically identifying value ranges in the data sets histogram.
In a simplified way the method could be described as following:
Find the highest peak in the histogram
Fit a gaussian to the peak
Using the value range defined by the gaussians mean (µ)+/- deviation(ϭ) certain regions
of the histogram are identified, and the voxels (=3D pixels) of these regions are removed from the original histogram.
As a result of the previous step a new highest peak should be revealed, based on which
the steps 1-3 can be repeated. The steps are repeated until the data set histogram is empty.
My questions relate to step 1 and 2 of the above description which is described as following in the paper: "The highest peak is identified and a Gaussian curve is fitted to its shape. The Gaussian is described by its midpoint µ, height h and deviation ϭ. The fitting process minimizes the accumulated height difference between the histogram and the middle part of the Gaussian. The error summation range is
µ+/ϭ? "1
In the following I will ask my questions and add my reflections on them:
How should I identify those bins of the total histogram which describe the highest peak? In order to identify its apex I simply run through the histogram and store the index of the bin with the highest frequency. But how far should the extend of the peak reach to the left and right of the highest bin. At the moment I simply go the the left and right of the highest bin for as long as the next bin is smaller as the previous one. However this is usually a very small range, since there occur creases (mini peaks) in the histogram. I already thought about smoothing the histogram. But I would have to that after each iteration since the subtraction of voxels (step 3 in the description above) can cause the histogram to contain creases again. And I am also worried that the repeated smoothing distorts the results.
Therefore I would like to ask whether there is an efficient way to detect the extend of a peak which is better than my current approach. There have been suggestions about mixture models and deconvolution in the previous thread. However are these methods really reasonable if the shape of the histogram constantly changes after each iteration?
How can I fit a gauss curve to the identified peak so that the accumulated hight difference between the histogram and middle part of the gaussian is minimized?
According to question one from the previous thread I fitted the curve to a given range of histogram bins by computing their mean and deviation (I hope this is correct?!). But how do I minimize the accumulated hight difference between the histogram and middle part of the gaussian from this point?
Thank you for your help!
Regards Marc
Add histogram values to the left and right until the goodness of the fit begins to decrease.

Resources