Reduce the data points on a graph? - algorithm

I have a huge graph, thousands of data points.
Plotting the graph as is will create a mess with too many lines.
Question: what's the best way to reduce the data points?
Example: say my graph is 1000 data points and I need to bring this to 100.
I tried:
a) take 10 of the data points and create a data point based on the average of those. This method produced terrible results and the graph seemed like something else.
b) take the 1st of the 10 data points. This was better than a, but still the graph was different.

There is Douglas-Peucker algorithm to simplify curves, removing some points, while preserving overall form of curve.
(Note that the remaining points will be distributed slightly unevenly)

Related

Applying K-means clustering on Z-score Normalized Data

I've been working to understand how to apply k-means clustering to a small set of data for a list of companies.
The mean and standard deviation is given so that I can determine the normalized data.
For example, I have the following:
From my understanding of k-means clustering, I have to randomly find the centroids, where k = 3. I have to keep adjusting the centroid locations until no more movements are possible, that is the data remains the same after a certain result is met.
I am having difficulty applying these procedures to my data set. I've watched and searched for many examples on how to accomplish this, step by step, but I haven't had any success that allows me to understand.
Basically what I am suppose to do is show a scatter plot at each adjustment to the centroid.
I believe that I have to calculate the distance between two data items using the Euclidean distance algorithm, but does that mean the distance between z-score sales and z-score fuel, or what? This is why I am lost, even after I've read through about a dozen powerpoints and watched multiple videos.
This seems to be the best example I've come across, but even then, I'm still a bit lost due to my example being slightly different than the one introduced: http://www.indiana.edu/~dll/Q530/Q530_kk.pdf
The most progress I've made was coming across a variety of data mining software, such as WEKA, Orange, various Excel add-ons such as XLMiner, etc. However, they seem to provide the end result, not the procedures required to get there.
Any help is appreciated. If more information is needed, please let me know.
Thank you.
Edit: I've found some more solutions and thought I should add in the event anyone runs into the same issues.
1) I calculated the Euclidean distance using this Excel formula mentioned on this video: http://www.lynda.com/Excel-tutorials/Calculating-distance-centroid/165438/175003-4.html
This is what the formula looks like: =SQRT((B28-$B$52)^2+(C28-$C$52)^2) keeping in mind that each cell represents where your data is contained.
In this case my cells are listed in the image here: http://i.imgur.com/W44km64.png
This has given me the following table: http://i.imgur.com/miTiVj5.png
You are right on with the process. Personally, I'd view your data as 2D just the (x,y) that are Sales and Fuel Cost... though you could use all 4 and just have 4D points instead.
Step 1: Either pick random centers (3 of them c_1, c_2, c_3), or split up your data into 3 random clusters. If you randomly split the data into 3 clusters, you then compute the mean of all the points in each cluster. Those 3 means become the three centers. (Here by mean, I mean the average of each coordinate... think of them as vectors and average the vector.)
Step 2: Each center represents one of the three clusters. For each point, compute the distance to each center (this could be Euclidean distance, or any other distance metric). Each point is moved into the cluster whose center is the closest. I.e. if point i is closest to center j, then regardless of which cluster point i was in, it moves to cluster j. Keep track of whether or not any point moves to a new cluster. This is used as a part of your stopping condition in Step 3.
Step 3: After all the points have moved to the cluster nearest them, recompute the centers by averaging together all the points in each cluster. Then, go back to 2 and repeat until no points change which cluster they are in.

3D clustering Algorithm

Problem Statement:
I have the following problem:
There are more than a billion points in 3D space. The goal is to find the top N points which has largest number of neighbors within given distance R. Another condition is that the distance between any two points of those top N points must be greater than R. The distribution of those points are not uniform. It is very common that certain regions of the space contain a lot of points.
Goal:
To find an algorithm that can scale well to many processors and has a small memory requirement.
Thoughts:
Normal spatial decomposition is not sufficient for this kind of problem due to the non-uniform distribution. irregular spatial decomposition that evenly divide the number of points may help us the problem. I will really appreciate that if someone can shed some lights on how to solve this problem.
Use an Octree. For 3D data with a limited value domain that scales very well to huge data sets.
Many of the aforementioned methods such as locality sensitive hashing are approximate versions designed for much higher dimensionality where you can't split sensibly anymore.
Splitting at each level into 8 bins (2^d for d=3) works very well. And since you can stop when there are too few points in a cell, and build a deeper tree where there are a lot of points that should fit your requirements quite well.
For more details, see Wikipedia:
https://en.wikipedia.org/wiki/Octree
Alternatively, you could try to build an R-tree. But the R-tree tries to balance, making it harder to find the most dense areas. For your particular task, this drawback of the Octree is actually helpful! The R-tree puts a lot of effort into keeping the tree depth equal everywhere, so that each point can be found at approximately the same time. However, you are only interested in the dense areas, which will be found on the longest paths in the Octree without even having to look at the actual points yet!
I don't have a definite answer for you, but I have a suggestion for an approach that might yield a solution.
I think it's worth investigating locality-sensitive hashing. I think dividing the points evenly and then applying this kind of LSH to each set should be readily parallelisable. If you design your hashing algorithm such that the bucket size is defined in terms of R, it seems likely that for a given set of points divided into buckets, the points satisfying your criteria are likely to exist in the fullest buckets.
Having performed this locally, perhaps you can apply some kind of map-reduce-style strategy to combine spatial buckets from different parallel runs of the LSH algorithm in a step-wise manner, making use of the fact that you can begin to exclude parts of your problem space by discounting entire buckets. Obviously you'll have to be careful about edge cases that span different buckets, but I suspect that at each stage of merging, you could apply different bucket sizes/offsets such that you remove this effect (e.g. perform merging spatially equivalent buckets, as well as adjacent buckets). I believe this method could be used to keep memory requirements small (i.e. you shouldn't need to store much more than the points themselves at any given moment, and you are always operating on small(ish) subsets).
If you're looking for some kind of heuristic then I think this result will immediately yield something resembling a "good" solution - i.e. it will give you a small number of probable points which you can check satisfy your criteria. If you are looking for an exact answer, then you are going to have to apply some other methods to trim the search space as you begin to merge parallel buckets.
Another thought I had was that this could relate to finding the metric k-center. It's definitely not the exact same problem, but perhaps some of the methods used in solving that are applicable in this case. The problem is that this assumes you have a metric space in which computing the distance metric is possible - in your case, however, the presence of a billion points makes it undesirable and difficult to perform any kind of global traversal (e.g. sorting of the distances between points). As I said, just a thought, and perhaps a source of further inspiration.
Here are some possible parts of a solution.
There are various choices at each stage,
which will depend on Ncluster, on how fast the data changes,
and on what you want to do with the means.
3 steps: quantize, box, K-means.
1) quantize: reduce the input XYZ coordinates to say 8 bits each,
by taking 2^8 percentiles of X,Y,Z separately.
This will speed up the whole flow without much loss of detail.
You could sort all 1G points, or just a random 1M,
to get 8-bit x0 < x1 < ... x256, y0 < y1 < ... y256, z0 < z1 < ... z256
with 2^(30-8) points in each range.
To map float X -> 8 bit x, unrolled binary search is fast —
see Bentley, Pearls p. 95.
Added: Kd trees
split any point cloud into different-sized boxes, each with ~ Leafsize points —
much better than splitting X Y Z as above.
But afaik you'd have to roll your own Kd tree code
to split only the first say 16M boxes, and keep counts only, not the points.
2) box: count the number of points in each 3d box,
[xj .. xj+1, yj .. yj+1, zj .. zj+1].
The average box will have 2^(30-3*8) points;
the distribution will depend on how clumpy the data is.
If some boxes are too big or get too many points, you could
a) split them into 8,
b) track the centre of the points in each box,
otherwide just take box midpoints.
3)
K-means clustering
on the 2^(3*8) box centres.
(Google parallel "k means" -> 121k hits.)
This depends strongly on K aka Ncluster, also on your radius R.
A rough approach would be to grow a
heap
of the say 27*Ncluster boxes with the most points,
then take the biggest ones subject to your Radius constraint.
(I like to start with a
Minimum spanning tree,
then remove the K-1 longest links to get K clusters.)
See also
Color quantization .
I'd make Nbit, here 8, a parameter from the beginning.
What is your Ncluster ?
Added: if your points are moving in time, see
collision-detection-of-huge-number-of-circles on SO.
I would also suggest to use an octree. The OctoMap framework is very good at dealing with huge 3D point clouds. It does not store all the points directly, but updates the occupancy density of every node (aka 3D box).
After the tree is built, you can use a simple iterator to find the node with the highest density. If you would like to model the point density or distribution inside the nodes, the OctoMap is very easy to adopt.
Here you can see how it was extended to model the point distribution using a planar model.
Just an idea. Create a graph with given points and edges between points when distance < R.
Creation of this kind of graph is similar to spatial decomposition. Your questions can be answered with local search in graph. First are vertices with max degree, second is finding of maximal unconnected set of max degree vertices.
I think creation of graph and search can be made parallel. This approach can have large memory requirement. Splitting domain and working with graphs for smaller volumes can reduce memory need.

Space partitioning algorithm

I have a set of points which are contained within the rectangle. I'd like to split the rectangles into subrectangles based on point density (giving a number of subrectangles or desired density, whichever is easiest).
The partitioning doesn't have to be exact (almost any approximation better than regular grid would do), but the algorithm has to cope with the large number of points - approx. 200 millions. The desired number of subrectangles however is substantially lower (around 1000).
Does anyone know any algorithm which may help me with this particular task?
Just to understand the problem.
The following is crude and perform badly, but I want to know if the result is what you want>
Assumption> Number of rectangles is even
Assumption> Point distribution is markedly 2D (no big accumulation in one line)
Procedure>
Bisect n/2 times in either axis, looping from one end to the other of each previously determined rectangle counting "passed" points and storing the number of passed points at each iteration. Once counted, bisect the rectangle selecting by the points counted in each loop.
Is that what you want to achieve?
I think I'd start with the following, which is close to what #belisarius already proposed. If you have any additional requirements, such as preferring 'nearly square' rectangles to 'long and thin' ones you'll need to modify this naive approach. I'll assume, for the sake of simplicity, that the points are approximately randomly distributed.
Split your initial rectangle in 2 with a line parallel to the short side of the rectangle and running exactly through the mid-point.
Count the number of points in both half-rectangles. If they are equal (enough) then go to step 4. Otherwise, go to step 3.
Based on the distribution of points between the half-rectangles, move the line to even things up again. So if, perchance, the first cut split the points 1/3, 2/3, move the line half-way into the heavy half of the rectangle. Go to step 2. (Be careful not to get trapped here, moving the line in ever decreasing steps first in one direction, then the other.)
Now, pass each of the half-rectangles in to a recursive call to this function, at step 1.
I hope that outlines the proposal well enough. It has limitations: it will produce a number of rectangles equal to some power of 2, so adjust it if that's not good enough. I've phrased it recursively, but it's ideal for parallelisation. Each split creates two tasks, each of which splits a rectangle and creates two more tasks.
If you don't like that approach, perhaps you could start with a regular grid with some multiple (10 - 100 perhaps) of the number of rectangles you want. Count the number of points in each of these tiny rectangles. Then start gluing the tiny rectangles together until the less-tiny rectangle contains (approximately) the right number of points. Or, if it satisfies your requirements well enough, you could use this as a discretisation method and integrate it with my first approach, but only place the cutting lines along the boundaries of the tiny rectangles. This would probably be much quicker as you'd only have to count the points in each tiny rectangle once.
I haven't really thought about the running time of either of these; I have a preference for the former approach 'cos I do a fair amount of parallel programming and have oodles of processors.
You're after a standard Kd-tree or binary space partitioning tree, I think. (You can look it up on Wikipedia.)
Since you have very many points, you may wish to only approximately partition the first few levels. In this case, you should take a random sample of your 200M points--maybe 200k of them--and split the full data set at the midpoint of the subsample (along whichever axis is longer). If you actually choose the points at random, the probability that you'll miss a huge cluster of points that need to be subdivided will be approximately zero.
Now you have two problems of about 100M points each. Divide each along the longer axis. Repeat until you stop taking subsamples and split along the whole data set. After ten breadth-first iterations you'll be done.
If you have a different problem--you must provide tick marks along the X and Y axis and fill in a grid along those as best you can, rather than having the irregular decomposition of a Kd-tree--take your subsample of points and find the 0/32, 1/32, ..., 32/32 percentiles along each axis. Draw your grid lines there, then fill the resulting 1024-element grid with your points.
R-tree
Good question.
I think the area you need to investigate is "computational geometry" and the "k-partitioning" problem. There's a link that might help get you started here
You might find that the problem itself is NP-hard which means a good approximation algorithm is the best you're going to get.
Would K-means clustering or a Voronoi diagram be a good fit for the problem you are trying to solve?
That's looks like Cluster analysis.
Would a QuadTree work?
A quadtree is a tree data structure in which each internal node has exactly four children. Quadtrees are most often used to partition a two dimensional space by recursively subdividing it into four quadrants or regions. The regions may be square or rectangular, or may have arbitrary shapes. This data structure was named a quadtree by Raphael Finkel and J.L. Bentley in 1974. A similar partitioning is also known as a Q-tree. All forms of Quadtrees share some common features:
They decompose space into adaptable cells
Each cell (or bucket) has a maximum capacity. When maximum capacity is reached, the bucket splits
The tree directory follows the spatial decomposition of the Quadtree

Appropriate similarity metrics for multiple sets of 2D coordinates

I have a collection of 2D coordinate sets (on the scale of a 100K-500K points in each set) and I am looking for the most efficient way to measure the similarity of 1 set to the other. I know of the usuals: Cosine, Jaccard/Tanimoto, etc. However I am hoping for some suggestions on any fast/efficient ones to measure similarity, especially ones that can cluster by similarity.
Edit 1: The image shows what I need to do. I need to cluster all the reds, blues and greens by their shape/orientatoin, etc.
alt text http://img402.imageshack.us/img402/8121/curves.png
It seems that the first step of any solution is going to be to find the centroid, or other reference point, of each shape, so that they can be compared regardless of absolute position.
One algorithm that comes to mind would be to start at the point nearest the centroid and walk to its nearest neighbors. Compare the offsets of those neighbors (from the centroid) between the sets being compared. Keep walking to the next-nearest neighbors of the centroid, or the nearest not-already-compared neighbors of the ones previously compared, and keep track of the aggregate difference (perhaps RMS?) between the two shapes. Also, at each step of this process calculate the rotational offset that would bring the two shapes into closest alignment [and whether mirroring affects it as well?]. When you are finished you will have three values for every pair of sets, including their direct similarity, their relative rotational offset (mostly only useful if they are close matches after rotation), and their similarity after rotation.
Try K-means algorithm. It dynamically calculated the centroid of each cluster and calculates distance to all the pointers and associates them to the nearest cluster.
Since your clustering is based on a nearness-to-shape metric, perhaps you need some form of connected component labeling. UNION-FIND can give you a fast basic set primitive.
For union-only, start every point in a different set, and merge them if they meet some criterion of nearness, influenced by local colinearity since that seems important to you. Then keep merging until you pass some over-threshold condition for how difficult your merge is. If you treat it like line-growing (only join things at their ends) then some data structures become simpler. Are all your clusters open lines and curves? No closed curves, like circles?
The crossing lines are trickier to get right, you either have to find some way merge then split, or you set your merge criteria to extremely favor colinearity and you luck out on the crossing lines.

Comparison of Lat, Long Coordinates

I have a list of more than 15 thousand latitude and longitude coordinates. Given any X,Y coordinates, what is the fastest way to find the closest coordinates on the list?
I did this once for a web site. I.e. find the dealer within 50 miles of your zip code. I used the great circle calculation to find the coordinates that were 50 miles north, 50 miles east, 50 miles south, and 50 miles west. That gave me a min and max lat and a min and max long. From there then I did a database query:
select *
from dealers
where latitude >= minlat
and latitude <= maxlat
and longitude >= minlong
and longitude <= maxlong
Since some of those results will still be more than 50 miles away, then I used the great circle formula once more on that small list of coordinates. Then I printed out the list along with the distance from the target.
Of course, if you wanted to search for points near the international date line or the poles, than this won't work. But it works great for searches inside North America!
You will want to use a geometric construction called a Voronoi diagram. This divides up the plane into a number of areas, one for each point, that encompass all the points that are closest to each of your given points.
The code for the exact algorithms to create the Voronoi diagram and arrange the data structure lookups are too large to fit in this little edit box. :)
#Linor: That's essentially what you would do after creating a Voronoi diagram. But instead of making a rectangular grid, you can choose dividing lines that closely match the lines of the Voronoi diagram (this way you will get fewer areas that cross dividing lines). If you recursively divide your Voronoi diagram in half along the best dividing line for each subdiagram, you can then do a tree search for each point you want to look up. This requires a bit of work up front but saves time later. Each lookup would be on the order of log N where N is the number of points. 16 comparisons is a lot better than 15,000!
The general concept you're describing is nearest-neighbour search, and there are a whole raft of techniques which deal with solving these types of queries, either exactly or approximately. The basic idea is to use a spatial partitioning technique to reduce the complexity from O(n) per query to (approximately) O( log n ) per query.
KD-Trees, and variants of KD-Trees seem to work very well, but quad-trees will also work. The quality of these searches depends on whether your set of 15,000 data points are static (you're not adding a-lot of data points to the reference set). Mount and Arya's work on the Approximate Nearest Neighbour library is both easy to use and understand, even without a good grounding in the math. It also gives you some flexibility in the types and tolerances of your queries.
It rather depends how many times you want to do it, and what resources are available - if you're doing the test once, then the O(log N) techniques are good. If you're doing it a thousand times on a server, constructing a bitmap lookup table would be faster, either giving the result directly or as a first stage of. 2GB of bitmap can map the whole world lat-lon to a 32bit value at 0.011 degree pixels (1.2km at equator), and should fit into memory. If you're only doing single country, or can exclude the poles, you can have a smaller map or higher resolution. For 15,000 points you probably have a much smaller map - I first sized it up as a first step to doing lat-lon to postcode searches, which needs higher resolution. Depending on requirements, you use the mapped value to point at the result directly, or to short list of the candidates (which would allow a smaller map, but requires greater subsequent processing - you're not in O(1) lookup territory any more).
You didn't specify what you meant by fastest. If you want to get the answer quickly without writing any code, I would give the gpsbabel radius filter a go.
Based on your clarifications, I would use a geometrical data structure such as a KD-tree or an R-tree. MySQL has a SPATIAL data type which does this. Other languages/frameworks/databases have libraries to support this. Basically, such a data structure embeds the points in a tree of rectangles, and searches the tree using a radius. This should be fast enough, and I believe is simpler than building a Voronoi diagram. I guess there is some threshold above which you would prefer the added performance of a Voronoi diagram so you will be ready to pay the added complexity.
This can be solved in several ways. I would first approach this problem by generating a Delaunay network connecting closest points to each other. This can be accomplished with the v.delaunay command in the open source GIS application GRASS. You could complete the problem in GRASS using one of the many network analysis modules in GRASS. Alternatively, you could use the free spatial RDBMS PostGIS to do the distance queries. The PostGIS spatial queries are considerably more powerful than those in MySQL, as they are not constrained to BBOX operations. For example:
SELECT network_id, ST_Length(geometry) from spatial_table where ST_Length(geometry) < 10;
Since you are using Longitude and Latitude, you probably want to use the Spheroid-Distance functions. With a spatial index, PostGIS scales very well for large datasets.
Even if you create a voronoi diagram, that still means you need to compare your x, y coordinates to all 15 thousand created areas. To make that easier, the first thing that popped into my mind though was to create some sort of grid over the possible values, so that you can easily place and x/y coordinate into one of the boxes in a grid, if the same is done for the list of areas you should quickly shrink the possible candidates for comparison (because the grid would be more rectangular, it's possible for an area to be in multiple grid positions).
Premature optimization is the root of all evil.
15K coordinates aren't that much. Why not iterate over the 15K coordinates and see if that's really a performance problem? You could save a lot of work and maybe it never gets too slow to even notice.
How large an area are these coordinates spread out over? What latitude are they at? How much accuracy do you require? If they're fairly close together, you can probably ignore the fact that the earth is round and just treat this as a Cartesian plane rather than messing about with spherical geometry and great circle distances. Of course, as you get further from the equator, degrees of longitute get smaller compared to degrees of latitude, so some sort of scaling factor may be appropriate.
Start with a fairly simple distance formula and a brute force search and see how long that's going to take and if the results are accurate enough before you get fancy.
Thanks everyone for the answers.
#Tom, #Chris Upchurch: The coordinates are fairly close to each others, and they are in a relatively small area of about 800 sq km. I guess I can assume the surface to be flat. I need to process the requests over and over again, and the response should be faster enough for more web experience.
A grid is very simple, and very fast. It's basically just a 2D array of lists. Each array entry represents the points that fall inside a grid cell. Very easy to set the grid up:
for each point p
get cell that contains p
add point to that cell's list
and it's very easy to look things up:
given a query point p
get cell that contains p
check points in that cell (and its 8 neighbors), against query point p
Alejo
Just to be contrairian, do you mean close in distance or (driving) time? In an urban area I'd gladly drive 5 miles (5min) on the highway than 4 miles (20min stop and go) in another direction.
Thus if it's a 'closest' metric you need, I'd look into GIS databases with travel time metrics.

Resources