How can I find the intersection of 2 sets of noisy data? - algorithm

I'm currently writing a script that is supposed to remove redundant data points from my graph. My data includes overlaps from adjacent data sets and I only want the data that is generally higher.
(Imagine two Gaussians with an x offset that overlap slightly. I'm only interested in the higher values in the overlap region, so that my final graph doesn't get all noisy when I combine the data in order to make a single spectrum.)
Here are my problems:
1) The x values aren't the same between the two data sets, so I can't just say "at x, take max y value". They're close together, but not equal.
2) The distances between x values aren't equal.
3) The data is noisy, so there can be multiple points where the data sets intersect. And while Gaussian A is generally higher after the intersection than Gaussian B, the noise means Gaussian B might still have SOME values which are higher. Meaning I can't just say "always take the highest values in this x area", because then I'd wildly combine the noise of both data sets.
4) I have n overlaps of this type, so I need an efficient algorithm and all I can come up with is somewhere at O(n^3), which would be something like "for each overlap, store data sets into two arrays and for each combination of data points (x0,y0) and (x1,y1) cycle through until you find the lowest combination of abs(x1-x0) AND abs(y1-y0)"
As I'm not a programmer, I'm completely lost. I also wasn't able to find an algorithm for this problem anywhere - most algorithms assume that the entries in the arrays I'm comparing are equal integers, but I'm working with almost-equal floats.
I'm using IDL, but I'd also be grateful for a general algorithm or at least a tip what I could try. Thanks!

One way you can do this is if you fit gaussians to your data and then take the max assuming each data point is equal to the gaussian at that point.
This can be done as follows:
Fit some gaussian G1 to dataset X1 and some gaussian G2 to dataset X2, where the mean of G1 is less than the mean of G2.
Then, find their intersection point with some arithmetic.
Then, for all values of x less then the intersection take X1 and all values of x greater than the intersection take X2.


Algorithm: find minimum space spanning points defined only by their separations

I have a collection of points in some N-dimensional space, where all I know is the distances between them. Let's say it's an unordered collection of structs like the following:
struct {
int first; // Just some identifier that uniquely specifies a point
int second; // No importance to which point is first or second
float separation; // The distance between the first and second points -- always positive
Of course the algorithm doesn't have to be C code. I just wrote the struct in this style to make the problem clear. It rather upsets me that the struct spoils the symmetry between the two end-points, but fixing this just makes things more complicated.
Let's say that the separations are defined by the Pythagorean distance between them, and the space is Euclidean. Let's also specify that the separations are internally consistent. For example, given separations AB, BC and AC, we know that AB + BC >= AC.
I want an algorithm that finds the minimal dimensional space that can contain all the points. Within this algorithm, we can assume that separations that deviate from that defined by the space by less than some specified tolerance can be ignored.
Does anyone know an algorithm that does this? So far, I've only been able to think up non-polynominal algorithms. Can anybody improve on that, or at least make something that is clean and extensible?
Why is this interesting? In Physics there are some low-level theories such as String Theory or Quantum Loop Gravity that do not obviously predict our three dimensional world. This algorithm could be part of a project to find how a 3d world can be emergent.
Thank you everybody who posted ideas here. I now have an answer to my own question. It's not great, in that it executes O(n^3) but at least it's polynomial. Roughly, it works like this:
Represent the problem as a symmetric matrix with zero diagonal -- representing the distances between any two points. This is equivalent to the representation using structs, but much easier to work with.
Assume the ordering of the points implied by the matrix (first column/row = first point) is sensible. (It may be worth pivoting to find a better ordering, but that is todo.)
Now create a rectangular coordinate system to fit the points, starting with the first point, which WLOG we take to be the origin.
Second point defines the x axis
For each subsequent point, we calculate its coordinates one at a time, starting with the x axis. We know the distance from the origin and the distance from point 2. This allows us to calculate the x coordinate, as we end up with two simultaneous equations x^2 + y^2 + ... = s1^2 and (x - x2)^2 + y^2 + ... = s2^2, which allows us to calculate x easily from x2, the x coordinate of point 2, and the distances from points 1 and 2, s1 and s2.
Each new coordinate can be calculated easily, because the matrix of coordinates calculated so far is triangular -- there is only one unknown each time.
The last coordinate for each point is on a new axis -- a dimension that has not yet been used. Calculate its coordinate using Pythagoras on the distance from the origin, as we know all the other coordinates.
It is possible that the coordinate on the new axis will come out imaginary -- a general set of distances cannot always be represented by a coordinate system of any number of dimensions -- at least not with real numbers. If this is the case, I error.
Keep going in this way for each new point, building up a vector of coordinate vectors for each point. In general, this is triangular, but there may be cases where the final coordinate we calculate is near enough to zero that we consider the point's position to be represented by the existing dimensions. I store the coordinates anyway, but keep the number of dimensions the same as the previous point. I also skip these points, as they are not needed for calculating further points (see step 10).
Finally, we have represented all points such that the distances are consistent.
As a final check, I validate that the distances match for all points, including those skipped in step 9.
The number of dimensions needed is the number used for the last point.
If anyone is interested in an implementation of this (in Haskell), it is on my GitHub page at

Finding all points in certain radius of another point

I am making a simple game and stumbled upon this problem. Assume several points in 2D space. What I want is to make points close to each other interact in some way.
Let me throw a picture here for better understanding of the problem:
Now, the problem isn't about computing the distance. I know how to do that.
At first I had around 10 points and I could simply check every combination, but as you can already assume, this is extremely inefficient with increasing number of points. What if I had a million of points in total, but all of them would be very distant to each other?
I'm trying to find a suitable data structure or a way to look at this problem, so every point can only mind their surrounding and not whole space. Are there any known algorithms for this? I don't exactly know how to name this problem so I can google exactly what I want.
If you don't know of such known algorighm, all ideas are very welcome.
This is a range searching problem. More specifically - the 2-d circular range reporting problem.
Quoting from "Solving Query-Retrieval Problems by Compacting Voronoi Diagrams" [Aggarwal, Hansen, Leighton, 1990]:
Input: A set P of n points in the Euclidean plane E²
Query: Find all points of P contained in a disk in E² with radius r centered at q.
The best results were obtained in "Optimal Halfspace Range Reporting in Three Dimensions" [Afshani, Chan, 2009]. Their method requires O(n) space data structure that supports queries in O(log n + k) worst-case time. The structure can be preprocessed by a randomized algorithm that runs in O(n log n) expected time. (n is the number of input points, and k in the number of output points).
The CGAL library supports circular range search queries. See here.
You're still going to have to iterate through every point, but there are two optimizations you can perform:
1) You can eliminate obvious points by checking if x1 < radius and if y1 < radius (like Brent already mentioned in another answer).
2) Instead of calculating the distance, you can calculate the square of the distance and compare it to the square of the allowed radius. This saves you from performing expensive square root calculations.
This is probably the best performance you're gonna get.
This looks like a nearest neighbor problem. You should be using the kd tree for storing the points.
Space partitioning is what you want..
If you could get those points to be sorted by x and y values, then you could quickly pick out those points (binary search?) which are within a box of the central point: x +- r, y +- r. Once you have that subset of points, then you can use the distance formula to see if they are within the radius.
I assume you have a minimum and maximum X and Y coordinate? If so how about this.
Call our radius R, Xmax-Xmin X, and Ymax-Ymin Y.
Have a 2D matrix of [X/R, Y/R] of double-linked lists. Put each dot structure on the correct linked list.
To find dots you need to interact with, you only need check your cell plus your 8 neighbors.
Example: if X and Y are 100 each, and R is 1, then put a dot at 43.2, 77.1 in cell [43,77]. You'll check cells [42,76] [43,76] [44,76] [42,77] [43,77] [44,77] [42,78] [43,78] [44,78] for matches. Note that not all cells in your own box will match (for instance 43.9,77.9 is in the same list but more than 1 unit distant), and you'll always need to check all 8 neighbors.
As dots move (it sounds like they'd move?) you'd simply unlink them (fast and easy with a double-link list) and relink in their new location. Moving any dot is O(1). Moving them all is O(n).
If that array size gives too many cells, you can make bigger cells with the same algo and probably same code; just be prepared for fewer candidate dots to actually be close enough. For instance if R=1 and the map is a million times R by a million times R, you wouldn't be able to make a 2D array that big. Better perhaps to have each cell be 1000 units wide? As long as density was low, the same code as before would probably work: check each dot only against other dots in this cell plus the neighboring 8 cells. Just be prepared for more candidates failing to be within R.
If some cells will have a lot of dots, each cell having a linked list, perhaps the cell should have an red-black tree indexed by X coordinate? Even in the same cell the vast majority of other cell members will be too far away so just traverse the tree from X-R to X+R. Rather than loop over all dots, and go diving into each one's tree, perhaps you could instead iterate through the tree looking for X coords within R and if/when you find them calculate the distance. As you traverse one cell's tree from low to high X, you need only check the neighboring cell to the left's tree while in the first R entries.
You could also go to cells smaller than R. You'd have fewer candidates that fail to be close enough. For instance with R/2, you'd check 25 link lists instead of 9, but have on average (if randomly distributed) 25/36ths as many dots to check. That might be a minor gain.

How does one decide the final clusters when using the means shift algorthm?

I am reading a bit about the means shift clustering algorithm ( and this is what i got so far. For each point in your data set : select all points within a certain distance of it (including the original point), calculate the mean for all these points, repeat until these means stabilize.
What I'm confused about is how does one go from here in deciding what the final clusters are , and on what conditions do these means merge. Also, does the distance used to select the points fluctuate through the iterations or does it remain constant?
Thanks in advance
The mean shift cluster finding is a simple iterative process which is actually guaranteed to converge. The iteration starts from a starting point x, and the iteration steps are (note that x may have several components, as the algorithm will work in higher dimensions, as well):
calculate the weighted mean position x' of all points around x - maybe the simplest form is to calculate the average of positions of all points within d distance from x, but the gaussian function is also commonly used and mathematically beneficial.
set x <- x'
repeat until the difference between x and x' is very small
This can be used in cluster analysis by starting with different values of x. The final values will end up at different cluster centers. The number of clusters cannot be known (other than it is <= number of points).
The upper level algorithm is:
go through a selection of starting values
for each value, calculate the convergence value as shown above
if the value is not already in the list of convergence values, add it to the list (allow some reasonable tolerance for numerical imprecision)
And then you have the list of clusters. The only difficult thing is finding a reasonable selection of starting values. It is easy with one or two dimensions, but with higher dimensionalities exhaustive searches are not quite possible.
All starting points, which end up into the same mode (point of convergence) belong to the same cluster.
It may be of interest that if you are doing this on a 2D image, it should be sufficient to calculate the gradient (i.e. the first iteration) for each pixel. This is a fast operation with common convolution techniques, and then it is relatively easy to group the pixels into clusters.

How to do query all points which lie on aline

Suppose I have a set of points,
then I define line L. How do I obtain b, d, and f?
Can this be solved using kd-tree (with slight modification)?
How my program works:
Define a set of points
L is defined later, it has nothing to do with point set
My only idea right now:
Get middle point m of line L.
Based on point m, Get all points in the radius of lenght(L)/2 using KD-Tree
For every points, test if it lies on line L
Perhaps I'll add colinear threshold if some points are slightly lie on the query line.
The running time of my approach will depend on the L length, longer the line, bigger the query, more points need to be checked.
You can have logarithmic-time look-up. My algorithm achieves that at the cost of a giant memory usage (up to cubic in the number of points):
If you know the direction of the line in advance, you can achieve logarithmic-time lookup quite easily: let a*x + b*y = c be the equation of the line, then a / b describes the direction, and c describes the line position. For each a, b (except [0, 0]) and point, c is unique. Then sort the points according to their value of c into an index; when you get the line, do a search in this index.
If all your lines are orthogonal, it takes two indexes, one for x, one for y. If you use four indexes, you can look up by lines at 45° as well. You don't need to get the direction exact; if you know the bounding region for all the points, you can search every point in a strip parallel to the indexed direction that spans the query line within the bounding region:
The above paragraphs define "direction" as the ratio a / b. This yields infinite ratios, however. A better definition defines "direction" as a pair (a, b) where at least one of a, b is non-zero and two pairs (a1, b1), (a2, b2) define the same direction iff a1 * b2 == b1 * a2. Then { (a / b, 1) for b nonzero, (1, 0) for b zero} is one particular way of describing the space of directions. Then we can choose (1, 0) as the "direction at infinity", then order all other directions by their first component.
Be aware of floating point inaccuracies. Rational arithmetic is recommended. If you choose floating point arithmetic, be sure to use epsilon comparison when checking point-line incidence.
Algorithm 1: Just choose some value n, prepare n indexes, then choose one at query time. Unfortunately, the downside is obvious: the lookup is still a range sweep and thus linear, and the expected speedup drops as the direction gets further away from an indexed one. It also doesn't provide anything useful if the bounding region is much bigger than the region where most of the points are (you could search extremal points separately from the dense region, however).
The theoretical lookup speed is still linear.
In order to achieve logarithmic lookup this way, we need an index for every possible direction. Unfortunately, we can't have infinitely many indexes. Fortunately, similar directions still produce similar indexes - indexes that differ in only few swaps. If the directions are similar enough, they will produce identical indexes. Thus, we can use the same index for an entire range of directions. Namely, only directions such that two different points lie on the same line can cause a change of index.
Algorithm 2 achieves the logarithmic lookup time at the cost of a huge index:
When preparing:
For each pair of points (A, B), determine the direction from A to B. Collect the directions into an ordered set, calling the set the set of significant directions.
Turn this set into a list and add the "direction at infinity" to both ends.
For each pair of consecutive significant directions, choose an arbitrary direction within that range and prepare an index of all points for that direction. Collect the indexes into a list. Do not store any specific values of key in this index, only references to points.
Prepare an index over these indexes, where the direction is the key.
When looking up points by a line:
determine the line direction.
look up the right point index in the index of indexes. If the line direction falls at the boundary between two ranges, choose one arbitrarily. If not, you are guaranteed to find at most one point on the line.
Since there are only O(n^2) significant directions, there are O(n^2) ranges in this index. The lookup will take O(log n) time to find the right one.
look up the points in the index for this range, using the position with respect to the line direction as the key. This lookup will take O(log n) time.
Slight improvement can be obtained because the first and the last index are identical if the "direction at infinity" is not among the significant directions. Further improvements can be performed depending on what indexes are used. An array of indexes into an array of points is very compact, but if a binary search tree (such as a red-black tree or an AVL tree) is used for the index of points, you can do further improvements by merging subtrees identical by value to be identical by reference.
If the points are uniformly distributed, you could divide the plane in a Sqrt(n) x Sqrt(n) grid. Every gridcell contains 1 point on average, which is a constant.
Every line intersects at most 2 * Sqrt(n) grid cells [right? Proof needed :)]. Inspecting those cells takes O(Sqrt(n)) time, because each cell contains a constant number of points on average (of course this does not hold if the points have some bias).
Compute the bounding box of all of your points
Divide that bounding box in a uniform grid of x by y cells
Store each of your point in the cell it belongs to
Now for each line you want to test, all you have to do is find the cells it intersects, and test the points in those cells with "distance to line = 0".
Of course, it's only efficient if you gonna test many line for a given set of points.
Can try the next :
For each point find distance from a point to a line
More simple, for each point put the point coordinate in the line equation , is it match (meaning 0=0) than it's on the line
If you have many points - there is another way.
If you can sort the points, create 2 sort list:
1 sorted by x value
2 sorted by y values
Let say that your line start at (x1,y1) and ended at (x2,y2)
It's easy to filter all the points that their x value is not between [x1,x2] OR their y value is not between [y1,y2]
If you have no points - mean there are no points on this line.
Now split the line to 2, now you have 2 lines - run the same process again - you can see where this is going.
once you have small enough number of points (for you to choose) - let say 10, check if they are on the line in the usual way
This also enable you to get "as near" as you need to the line, and skip places where there are not relevant points
If you have enough memory, then it is possible to use Hough-algo like approach.
Fill r-theta array with lists of matching points (not counters). Then for every line find it's r-theta equation, and check points from the list with given r-theta coordinates.
Subtle thing - how to choose array resolution.

Equidistant points across a cube

I need to initialize some three dimensional points, and I want them to be equally spaced throughout a cube. Are there any creative ways to do this?
I am using an iterative Expectation Maximization algorithm and I want my initial vectors to "span" the space evenly.
For example, suppose I have eight points that I want to space equally in a cube sized 1x1x1. I would want the points at the corners of a cube with a side length of 0.333, centered within the larger cube.
A 2D example is below. Notice that the red points are equidistant from eachother and the edges. I want the same for 3D.
In cases where the number of points does not have an integer cube root, I am fine with leaving some "gaps" in the arrangement.
Currently I am taking the cube root of the number of points and using that to calculate the number of points and the desired distance between them. Then I iterate through the points and increment the X, Y and Z coordinates (staggered so that Y doesn't increment until X loops back to 0, same for Z with regard for Y).
If there's an easy way to do this in MATLAB, I'd gladly use it.
The sampling strategy you are proposing is known as a Sukharev grid, which is the optimal low dispersion sampling strategy, In cases where the number of samples is not n^3, the selection of which points to omit from the grid is unimportant from a sampling standpoint.
In practice, it's possible to use low discrepancy (quasi-random) sampling techniques to achieve very good results in three dimensions, You might want to look at using Halton and Hammersley sequences.
You'll have to define the problem in more detail for the cases where the number of points isn't a perfect cube. Hovever, for the cases where the number of points is a cube, you can use:
x=l(2:n+1); y=x; z=x;
[X, Y, Z] = meshgrid(x, y, z);
Then for each position in the matrices, the coordinates of that point are given by the corresponding elements of X, Y, and Z. If you want the points listed in a single matrix, such that each row represents a point, with the three columns for x, y, and z coordinates, then you can say:
points(:,1) = reshape(X, [], 1);
points(:,2) = reshape(Y, [], 1);
points(:,3) = reshape(Z, [], 1);
You now have a list of n^3 points on a grid throughout the unit cube, excluding the boundaries. As others have suggested, you can probably randomly remove some of the points if you want fewer points. This would be easy to do, by using randi([0 n^3], a, 1) to generate a indices of points to remove. (Don't forget to check for duplicates in the matrix returned by randi(), otherwise you might not delete enough points.)
This looks related to sphere packing.
Choose the points randomly within the cube, and then compute vectors to the nearest neighbor or wall. Then, extend the endpoints of the smallest vector by exponentially decaying step size. If you do this iteratively, the points should converge to the optimal solution. This even works if the number of points is not cubic.
a good random generator could be a first a usable first approximation. maybe with a later filter to reposition (again randomly) the worst offenders.
