2D pattern matching in a noisy dataset

2D pattern matching in a noisy dataset - algorithm

I have a data set given by an RFID antenna on 2D x,y location of two people moving. One person is carrying 3 RFID tags while the other is carrying 4 tags. Both are moving along the y axis as below. Red and Cyan are the paths, two people are walking.
The location map on a x,y scale looks like below.
Ideally, Orange, Yellow, Blue and Gray lines (RFID x,y data points) should go on a positive horizontal line while the below Green, Dark blue and Sky blue lines should go on a negative horizontal line.
Question
Although the lines are not straight, a visual pattern is emerged which can cluster above zero lines together and below zero lines together. My question is what algorithm/method can be used to compare such patterns and cluster them together. (So ideally the answer should be, above 4 lines are in one cluster and below 3 lines are in another cluster.)
It's difficult to think it as a linear movement always as people can walk in non linear ways. So best fit lines would not work. Any suggestion or shading of light is thankfully appreciated.

You want to look at methods for clustering time series (1d) or trajectories (2d).
The approach for both is pretty much the same. First, you want to find a suitable distance metric(dissimilarity measure), second, you decide for a suitable clustering algorithm.
Possible Distance Metrics
Here are some example distances you could use, with some short arguments for each:
Euclidean Distance: Very basic
Dynamic Time Wraping (DTW): Can account for shifts
Longuest Common Subsequence (LCSS): Accounts for shifts and can handle outliers
Edit distance with Real Penalty (EDP): Accounts for shifts and can handle outliers
More details can for example be found in this paper. An implementation of the distances can be found here.
Possible Cluster Algorithms
K-Means
DBSCAN
Single Linkage clustering
You can usually combine any distances metric with any distance based clustering algorithm.
Opinion
Looking at your data I would try DTW as a distance metric. If you expect two clusters then K-Means with k=2 should work. Otherwise you could try Single Linkage Clustering, which will give you something similar to the image below.

Related

Dividing the plane into regions of equal mass based on a density function

Given a "density" scalar field in the plane, how can I divide the plane into nice (low moment of inertia) regions so that each region contains a similar amount of "mass"?
That's not the best description of what my actual problem is, but it's the most concise phrasing I could think of.
I have a large map of a fictional world for use in a game. I have a pretty good idea of approximately how far one could walk in a day from any given point on this map, and this varies greatly based on the terrain etc. I would like to represent this information by dividing the map into regions, so that one day of walking could take you from any region to any of its neighboring regions. It doesn't have to be perfect, but it should be significantly better than simply dividing the map into a hexagonal grid (which is what many games do).
I had the idea that I could create a gray-scale image with the same dimensions as the map, where each pixel's color value represents how quickly one can travel through the pixel in the same place on the map. Well-maintained roads would be encoded as white pixels, and insurmountable cliffs would be encoded as black, or something like that.
My question is this: does anyone have an idea of how to use such a gray-scale image (the "density" scalar field) to generate my "grid" from the previous paragraph (regions of similar "mass")?
I've thought about using the gray-scale image as a discrete probability distribution, from which I can generate a bunch of coordinates, and then use some sort of clustering algorithm to create the regions, but a) the clustering algorithms would have to create clusters of a similar size, I think, for that idea to work, which I don't think they usually do, and b) I barely have any idea if any of that even makes sense, as I'm way out of my comfort zone here.
Sorry if this doesn't belong here, my idea has always been to solve it programatically somehow, so this seemed the most sensible place to ask.
UPDATE: Just thought I'd share the results I've gotten so far, trying out the second approach suggested by #samgak - recursively subdividing regions into boxes of similar mass, finding the center of mass of each region, and creating a voronoi diagram from those.
I'll keep tweaking, and maybe try to find a way to make it less grid-like (like in the upper right corner), but this worked way better than I expected!

Building upon #samgak's solution, if you don't want the grid-like structure, you can just add a small random perturbation to your centers. You can see below for example the difference I obtain:
without perturbation
adding some random perturbation

A couple of rough ideas:
You might be able to repurpose a color-quantization algorithm, which partitions color-space into regions with roughly the same number of pixels in them. You would have to do some kind of funny mapping where the darker the pixel in your map, the greater the number of pixels of a color corresponding to that pixel's location you create in a temporary image. Then you quantize that image into x number of colors and use their color values as co-ordinates for the centers of the regions in your map, and you could then create a voronoi diagram from these points to define your region boundaries.
Another approach (which is similar to how some color quantization algorithms work under the hood anyway) could be to recursively subdivide regions of your map into axis-aligned boxes by taking each rectangular region and choosing the optimal splitting line (x or y) and position to create 2 smaller rectangles of similar "mass". You would end up with a power of 2 count of rectangular regions, and you could get rid of the blockiness by taking the centre of mass of each rectangle (not simply the center of the bounding box) and creating a voronoi diagram from all the centre-points. This isn't guaranteed to create regions of exactly equal mass, but they should be roughly equal. The algorithm could be improved by allowing recursive splitting along lines of arbitrary orientation (or maybe a finite number of 8, 16, 32 etc possible orientations) but of course that makes it more complicated.

Algorithm to detect curved lines from list of 2D points

I am trying to extract horizontal lines from a set of 2D points generated from the photo of the model of a human torso:
The points "mostly" form horizontal(ish) lines in a more or less regular way, but with possible gaps/missing-points:
There can be regions where the lines deform a bit:
And regions with background noise:
Of course I would need to tune things so I exclude those parts with defects. What I am looking for with this question is a suggested algorithm to find lines where they are well-behaved, filling eventual gaps and avoiding eventual noise, and also terminating the lines properly upon some discontinuity condition.
I believe there could be some optimizing or voting "flood fill" variant that would score line candidates and yield only well-formed lines, but I am not experienced with this and cannot figure anything by myself.
This dataset is in a gist here, and it is important to note that X coordinates are integers, so points are aligned vertically. The Y coordinates though are decimal numbers.

I would start by finding the nearest neighbor of every dot, then the second nearest neighbor on the other side (I mean only considering the dots in the half plane opposite to the first neighbor).
If the distance to the second neighbor exceeds twice the distance to the first, ignore it.
Just doing that, I bet that you will reconstruct a great deal of the curves, with gaps left unfilled.
By estimating the local curvature along the curve (f.i. by computing the circumscribed circle of three dots, taking every other dot, you can discard noisy portions.
Then to fill the gaps, you can detect the curve endpoints and look for the nearest endpoint in an angle around the extrapolated direction.
First step in the processing:

These are integral curves to the vector field representing the direction pattern.
So maybe start by finding for each point the slope vector, the predominant direction, by taking points from the neighborhood and fitting a line with LS or performing a PCA. Increasing the neighborhood radius should allow to deal with the data irregularities thereby picking up a greater-scale slope trend instead of a local noise.
If you decide to do this, could you post here the slope field you find, so instead of points could we see some tangents?

Matrix of pixels to coordinates

I have to convert a given matrix of pixels (coefficients are in a range from 0 to 255, since the matrix corresponds to a black and white image) into two lists. Both of them may be composed of lists, one containing the abscissas of the points, the other the ordinates.
As you can notice on the included picture, the first case corresponds to a single curve, whereas both the others involve multiple ones, crossing one each other. The algorithm should be able to make the difference between the two or three curves (in the two last examples), so in the two mainlists, a given sublist corresponds to a given curve.
I have absolutely no idea of what to start from...
One last thing : I'm seeking ideas on how to program this algorithm, so this is why I didn't add any specific programming language (if code may help any explanation, feel free to speak any language).
Thanks in advance >^.^<

Check out the Hough transform. It is a simple voting algorithm, that allows finding simple geometric shapes in images. One complication could be that your lines are not strictly straight. But it would give you equations on the lines it does find. Since your case is a little nonstandard I'd try to understand the algorithm itself and write my own implementation.
In my first implementation (centering a circle on a square in long focal depth image I took) I started with a very simple Python example I found online, rewrote it for my purposes and then later moved to C# for speed, since I needed more parameters (higher dimensional search space) than you need for this simple case.
In your case I would start with the simple assumption of a straight line. Then the Hough transform will give 1, 2 and 3 maxima respectively for your three cases.
The idea of the Hough transform is well described on wikipedia.
Here just the gist of the idea:
For a straight line think of giving each black pixel the right to vote
for 180 possible lines that could go through it (one for each angle in
single degree steps), then plotting the vote as histogram over a 2d space, where one
dimension is the angle of the line, another is the distance from
origin (using the Hesse normal form of the line for practical reasons
rather than the common y= m x +b) and the z-dimension is the number of votes. The actual line formed by the black
pixels will get more votes than any other possible line, so you are
simply looking for the Maximum vote location in the transformation
space (say in Python/numpy it would be argmax).
If there are two lines, you will find two clear maxima, the higher one with the longer or thicker line (more votes). You can then start playing with grayscale in your image, giving non-integer votes to pixels. You can also play with the resolution of the angle, depending on the content of your problem.

How to subsample a 2D polygon?

I have polygons that define the contour of counties in the UK. These shapes are very detailed (10k to 20k points each), thus rendering the related computations (is point X in polygon P?) quite computationaly expensive.
Thus, I would like to "subsample" my polygons, to obtain a similar shape but with less points. What are the different techniques to do so?
The trivial one would be to take one every N points (thus subsampling by a factor N), but this feels too "crude". I would rather do some averaging of points, or something of that flavor. Any pointer?

Two solutions spring to mind:
1) since the map of the UK is reasonably squarish, you could choose to render a bitmap with the counties. Assign each a specific colour, and then render the borders with a 1 or 2 pixel thick black line. This means you'll only have to perform the expensive interior/exterior calculation if a sample happens to lie on the border. The larger the bitmap, the less often this will happen.
2) simplify the county outlines. You can use a recursive Ramer–Douglas–Peucker algorithm to recursively simplify the boundaries. Just make sure you cache the results. You may also have to solve this not for entire county boundaries but for shared boundaries only, to ensure no gaps. This might be quite tricky.

Here you can find a project dealing exactly with your issues. Although it works primarily with an area "filled" by points, you can set it to work with a "perimeter" type definition as yours.
It uses a k-nearest neighbors approach for calculating the region.
Samples:
Here you can request a copy of the paper.
Seemingly they planned to offer an online service for requesting calculations, but I didn't test it, and probably it isn't running.
HTH!

Polygon triangulation should help here. You'll still have to check many polygons, but these are triangles now, so they are easier to check and you can use some optimizations to determine only a small subset of polygons to check for a given region or point.
As it seems you have all the algorithms you need for polygons, not only for triangles, you can also merge several triangles that are too small after triangulation or if triangle count gets too high.

Appropriate similarity metrics for multiple sets of 2D coordinates

I have a collection of 2D coordinate sets (on the scale of a 100K-500K points in each set) and I am looking for the most efficient way to measure the similarity of 1 set to the other. I know of the usuals: Cosine, Jaccard/Tanimoto, etc. However I am hoping for some suggestions on any fast/efficient ones to measure similarity, especially ones that can cluster by similarity.
Edit 1: The image shows what I need to do. I need to cluster all the reds, blues and greens by their shape/orientatoin, etc.
alt text http://img402.imageshack.us/img402/8121/curves.png

It seems that the first step of any solution is going to be to find the centroid, or other reference point, of each shape, so that they can be compared regardless of absolute position.
One algorithm that comes to mind would be to start at the point nearest the centroid and walk to its nearest neighbors. Compare the offsets of those neighbors (from the centroid) between the sets being compared. Keep walking to the next-nearest neighbors of the centroid, or the nearest not-already-compared neighbors of the ones previously compared, and keep track of the aggregate difference (perhaps RMS?) between the two shapes. Also, at each step of this process calculate the rotational offset that would bring the two shapes into closest alignment [and whether mirroring affects it as well?]. When you are finished you will have three values for every pair of sets, including their direct similarity, their relative rotational offset (mostly only useful if they are close matches after rotation), and their similarity after rotation.

Try K-means algorithm. It dynamically calculated the centroid of each cluster and calculates distance to all the pointers and associates them to the nearest cluster.

Since your clustering is based on a nearness-to-shape metric, perhaps you need some form of connected component labeling. UNION-FIND can give you a fast basic set primitive.
For union-only, start every point in a different set, and merge them if they meet some criterion of nearness, influenced by local colinearity since that seems important to you. Then keep merging until you pass some over-threshold condition for how difficult your merge is. If you treat it like line-growing (only join things at their ends) then some data structures become simpler. Are all your clusters open lines and curves? No closed curves, like circles?
The crossing lines are trickier to get right, you either have to find some way merge then split, or you set your merge criteria to extremely favor colinearity and you luck out on the crossing lines.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio