Efficient data structure for quality threshold clustering algorithm - data-structures

I'm trying to implement the quality threshold clustering algorithm. The outline of it (taken from here) is listed below:
Initialize the threshold distance allowed for clusters and the minimum cluster size
Build a candidate cluster for each data point by including the closest point, the next closest, and so on, until the distance of the cluster surpasses the threshold
Save the candidate cluster with the most points as the first true cluster, and remove all points in the cluster from further consideration
Repeat with the reduced set of points until no more cluster can be formed having the minimum cluster size
I've been reading up on some nearest neighbor search algorithms and space partitioning data structures, as they seem to be the kind of thing I need, but I cannot determine which one to use or if I'm supposed to be looking at something else.
I want to implement the data structure myself for educational purposes, and I need one that can successively return the nearest points for some point. However, since I don't know the number of times I need to query (i.e. until the threshold is exceeded), I can't use k-nearest neighbor algorithms. I've been looking mostly at quadtrees and k-d trees.
Additionally, since the algorithm constantly builds new candidate clusters, it would be interesting to use a modified data structure that uses cached information to speed up subsequent queries (but also taking point removal into account).

This algorithm sounds like a predecessor of DBSCAN (Wikipedia), which is known to work very well with R*-Tree indexes (Wikipedia). But of course, kd-trees are also an option. The main difference between these two is that R*-trees are meant for database use - they support online insertions and deletions very well, and are block oriented - while kd-trees are more of an in-memory data structure based on binary splits. R*-trees perform rebalancing, while kd-trees will slowly become unbalanced and will need to be rebuilt.
I find nearest neighbor search in R*-trees much more understandable than in k-d-trees, because you have the bounding rectangles are very intuitive.
DBSCAN also "removes" points from further consideration, but simply by marking them as already assigned. That way you don't need to update the index; and it's sufficient to bulk-load it once in the beginning. You should be able to do this for QT, too. So unless I'm mistaken, you can get the QT clustering efficiently by running DBSCAN with epsilon set to the QT clustering and minPts=2 (although one would prefer higher values in proper DBSCAN).
There are a number of DBSCAN implementations around. The one in Weka is exceptionally crappy, so stay away from it. The fpc implementation in R is okay, but could still be a lot faster. ELKI seems to be the only one with full index support, and the speed difference is massive. Their Benchmark shows a 12x speed gain by using an index on this data set, allowing them to cluster in 50 seconds instead of 603 (without index). Weka took incredible 37917 seconds, R fpc 4339 there. That aligns with my experiences, Weka has the reputation of being quite slow, and R only kicks ass at vectorized operations, once the R interpreter has to work, it is significantly slower than anything native. But it is a good example about how different the same algorithm can perform when it is implemented by different people. I would have expected this to be 2x-5x, but apparently the differences can easily be 50x from one programmer implementing the same algorithm to another.

Related

Spatial partition data structure that is better suited for a placement system than a quadtree

I want to know if there is spatial partition data structure that is better suited for a placement system than a quadtree. By better suited I mean for the data structure to have a O(logn) time complexity or less when search querying it and using less memory. I want to know what data structure can organize my data in such a way that querying it is faster than a quadtree. Its all 2D and its all rectangles which should never overlap. I currently have a quadtree done and it works great and its fast, I am just curious to know if there is a data structure that uses less resources and its faster than a quadtree for this case.
The fastest is probably brute forcing it on a GPU.
Also, it is really worth trying out different implementations, I found performance differences between implementations to be absolutely wild.
Another tip: measure performance with realistic data (potentially multiple scenarios), data and usage characteristics can have enormous influence on index performance.
Some of these characteristics are (you already mentioned "rectangle data" and "2D"):
How big is your dataset
How much overlap do you have between rectangles?
Do you need to update data often?
Do you have a large variance between small and large rectangles?
Do you have dense cluster of rectangles?
How large is the are you cover?
Are your coordinates integers or floats?
Is it okay if the execution time of operations varies or should it be consistent?
Can you pre-load data? Do you need to update the index?
Quadtrees can be a good initial choice. However they have some problems, e.g.:
They can get very deep (and inefficient) with dense clusters
They don't work very well when there is a lot of overlap between rectangles
Update operations may take longer if nodes are merged or split.
Another popular choice are R-Trees (I found R-star-Trees to be the best). Some properties:
Balanced (good for predictable search time but bad because update times can be very unpredictable due to rebalancing)
Quite complex to implement.
R-Trees can also be preloaded (takes longer but allows queries to be faster), this is called STR-Tree (Sort-tile-recurse-R-Tree)
It may be worth looking at the PH-Tree (disclaimer: self advertisement):
Similar to a quadtree but depth is limited to the bit-width of the data (usually 32 or 64 (bits)).
No rebalancing. Merging or splitting is guaranteed to move only one entry (=cheap)
Prefers integer coordinates but works reasonably well with floating point data as well.
Implementations can be quite space efficient (they don't need to store all bit of coordinates). However, not all implementations support that. Also, the effect varies and is strongest with integer coordinates.
I made some measurements here. The measurements include a 2D dataset where I store line segments from OpenStreetMap as boxes, the relevant diagrams are labeled with "OSM-R" (R for rectangles).
Fig. 3a shows timings for inserting a given amount of data into a tree
Fig. 9a shows memory usage
Fig. 15a shows query times for queries that return on average 1000 entries
Fig. 17a shows how query performance changes when varying the query window size (on an index with 1M entries)
Fig. 41a shows average times for updating an index with 1M entries
PH/PHM is the PH-Tree, PHM has coordinates converted to integer before storing them
RSZ/RSS are two different R-Tree implementations
STR is an STR-Tree
Q(T)Z is a quadtree
In case you are using Java, have a look at my spatial index collection.
Similar collections exist for other programming languages.

Algorithm for 2D nearest-neighbour queries with dynamic points

I am trying to find a fast algorithm for finding the (approximate, if need be) nearest neighbours of a given point in a two-dimensional space where points are frequently removed from the dataset and new points are added.
(Relatedly, there are two variants of this problem that interest me: one in which points can be thought of as being added and removed randomly and another in which all the points are in constant motion.)
Some thoughts:
kd-trees offer good performance, but are only suitable for static point sets
R*-trees seem to offer good performance for a variety of dimensions, but the generality of their design (arbitrary dimensions, general content geometries) suggests the possibility that a more specific algorithm might offer performance advantages
Algorithms with existing implementations are preferable (though this is not necessary)
What's a good choice here?
I agree with (almost) everything that #gsamaras said, just to add a few things:
In my experience (using large dataset with >= 500,000 points), kNN-performance of KD-Trees is worse than pretty much any other spatial index by a factor of 10 to 100. I tested them (2 KD-trees and various other indexes) on a large OpenStreetMap dataset. In the following diagram, the KD-Trees are called KDL and KDS, the 2D dataset is called OSM-P (left diagram): The diagram is taken from this document, see bullet points below for more information.
This research describes an indexing method for moving objects, in case you keep (re-)inserting the same points in slightly different positions.
Quadtrees are not too bad either, they can be very fast in 2D, with excellent kNN performance for datasets < 1,000,000 entries.
If you are looking for Java implementations, have a look at my index library. In has implementations of quadtrees, R-star-tree, ph-tree, and others, all with a common API that also supports kNN. The library was written for the TinSpin, which is a framework for testing multidimensional indexes. Some results can be found enter link description here (it doesn't really describe the test data, but 'OSM-P' results are based on OpenStreetMap data with up to 50,000,000 2D points.
Depending on your scenario, you may also want to consider PH-Trees. They appear to be slower for kNN-queries than R-Trees in low dimensionality (though still faster than KD-Trees), but they are faster for removal and updates than RTrees. If you have a lot of removal/insertion, this may be a better choice (see the TinSpin results, Figures 2 and 46). C++ versions are available here and here.
Check the Bkd-Tree, which is:
an I/O-efficient dynamic data structure based on the kd-tree. [..] the Bkd-tree maintains its high space utilization and excellent
query and update performance regardless of the number of updates performed on it.
However this data structure is multi dimensional, and not specialized to lower dimensions (like the kd-tree).
Play with it in bkdtree.
Dynamic Quadtrees can also be a candidate, with O(logn) query time and O(Q(n)) insertion/deletion time, where Q(n) is the time
to perform a query in the data structure used. Note that this data structure is specialized for 2D. For 3D however, we have octrees, and in a similar way the structure can be generalized for higher dimensions.
An implentation is QuadTree.
R*-tree is another choice, but I agree with you on the generality. A r-star-tree implementations exists too.
A Cover tree could be considered as well, but I am not sure if it fits your description. Read more here,and check the implementation on CoverTree.
Kd-tree should still be considered, since it's performance is remarkable on 2 dimensions, and its insertion complexity is logarithic in size.
nanoflann and CGAL are jsut two implementations of it, where the first requires no install and the second does, but may be more performant.
In any case, I would try more than one approach and benchmark (since all of them have implementations and these data structures are usually affected by the nature of your data).

Quadtree performance issue

The problem I have is that a game I work on uses a quadtree for fast proximity detection, used for range checks when weapons are firing. I'm using the classic "4 wide" quadtree, meaning that I subdivide when I attempt to add a 5th child node to an already full parent node.
Initially the set of available targets was fairly evenly spread out, so the quadtree worked very well. Due to design changes, we now get clusters of large numbers of enemies in a relatively small space, leading to performance problems because the quadtree becomes significantly unbalanced.
There are two possible solutions that occur to me, either modify the quadtree to handle this, or switch to an alternative representation.
The only other representation I'm familiar with is a spatial hash, and not very familiar at that. From what I understand, this risks suffering the same problem since the cluster would wind up in a relatively small number of hash buckets. From what I know of it, a BSP is a possible solution that will deal with the uneven distribution better than a quadtree or spatial hash.
No fair, I know, there are actually three questions now.
Are there any modifications I can make to the quadtree, e.g. increasing the "width" of nodes that would help deal with this?
Is it even worth my time to consider a spatial hash?
Is a BSP, or some other data structure a better bet to deal with the uneven distribution?
I usually use quadtree with at least 10 entries per node, but you'll have to try it out.
I have no experience with spatial hashing.
Other structures you could look into are:
KD-Trees: They are quite simple to implement and are also good for neighbour search, but get slower with string clustering. They are a bit slow to update and may get imbalanced.
R*Tree: More complex, very good with neighbour search, but even slower to update than KD-Trees. They won't get imbalanced because of automatic rebalancing. Rebalancing is mostly fast, but in extreme cases it can occasionally slow thing down further.
PH-Tree: Quite complex to implement. Good neighbour search. Has very good update speed (like quadtree), maximum depth is limited by the bit width of your coordinates (usually 32 or 64bit), so they can't really get imbalanced. Scales very well with large datasets (1 million and more), I have little experience with small datasets.
If you are using Java, I have Apache licensed versions available here (R*Tree) and here (PH-Tree).

Which data clustering algorithm is appropriate to detect an unknown number of clusters in a time series of events?

Here's my scenario. Consider a set of events that happen at various places and times - as an example, consider someone high above recording the lightning strikes in a city during a storm. For my purpose, lightnings are instantaneous and can only hit certain locations (such as high buildings). Also imagine each lightning strike has a unique id so one can reference the strike later. There are about 100,000 such locations in this city (as you guess, this is an analogy as my current employer is sensitive about the actual problem).
For phase 1, my input is the set of (strike id, strike time, strike location) tuples. The desired output is the set of the clusters of more than 1 event that hit the same location within a short time. The number of clusters is not known in advance (so k-means is not that useful here). What is being considered as 'short' could be predefined for a given clustering attempt. That is, I can set it to, say, 3 minutes, than run the algorithm; later try with 4 minutes or 10 minutes. Perhaps a nice touch would be for the algorithm to determine a 'strength' of clustering and recommend that for a given input, the most compact clustering is achieved by using a particular value for 'short', but this is not required initially.
For phase 2, I'd like to take into consideration the amplitude of the strike (i.e., a real number) and look for clusters that are both within a short time and with similar amplitudes.
I googled and checked the answers here about data clustering. The information is a bit bewildering (below is the list of links I found useful). AFAIK, k-means and related algorithms would not be useful because they require the number of clusters to be specified apriori. I'm not asking for someone to solve my problem (I like solving it), but some orientation in the large world of data clustering algorithms would be useful in order to save some time. Specifically, what clustering algorithms are appropriate for when the number of clusters is unknown.
Edit: I realized the location is irrelevant, in the sense that although events happen all the time, I only need to cluster them per location. So each location has its own time-series of events that can thus be analyzed independently.
Some technical details:
- as the dataset is not that large, it can fit all in memory.
- parallel processing is a nice to have, but not essential. I only have a 4-core machine and MapReduce and Hadoop would be too much.
- the language I'm mostly familiar with is Java. I haven't yet used R and the learning curve for it would probably be too much for what time I was given. I'll have a look at it anyway in my spare time.
- for the time being, using tools to run the analysis is ok, I don't have to produce just code. I'm mentioning this because probably Weka will be suggested.
- visualization would be useful. As the dataset is large enough so it doesn't fit in memory, the visualization should at least support zooming and panning. And to clarify: I don't need to build a visualization GUI, it's just a nice capability to use for checking the results produced with a tool.
Thank you. Questions that I found useful are: How to find center of clusters of numbers? statistics problem?, Clustering Algorithm for Paper Boys, Java Clustering Library, How to cluster objects (without coordinates), Algorithm for detecting "clusters" of dots
I would suggest you to look into Mean Shift Clustering. The basic idea behind mean shift clustering is to take the data and perform a kernel density estimation, then find the modes in the density estimate, the regions of convergence of data points towards modes defines the clusters.
The nice thing about mean shift clustering is that the number of clusters do not have to be specified ahead of time.
I have not used Weka, so I am not sure if it has mean shift clustering. However if you are using MATLAB, here is a toolbox (KDE toolbox) to do it. Hope that helps.
Couldn't you just use hierarchical clustering with the difference in times of strikes as part of the distance metric?
It is too late, but still I would add it:
In R, there is a package fpc and it has a method pamk() which provides you the clusters. Using pamk(), you do not need to mention the number of clusters intially. It calculates itself the number of clusters in the input data.

Is a kd-tree suitable for 4D space-time data (x,y,z,time)?

I want to use a data structure for sorting space-time data (x,y,z,time).
Currently a processing algorithm searches a set of 4D (x,y,z,time) points, given a spherical (3d) spacial radius and a linear (1d) time radius, marking for each point, which other points are within those radii. The reason is that after processing, I can ask any 4D point for all of its neighbours in O(1) time.
However in some common configurations of space and time radii, the first run of the algorithm takes about 12 hours. Believe it or not, that's actually fast compared to what exists in our industry. Nevertheless, I want to help speed up the initial runs and so I want to know: Is a kd-tree suitable for 4D space-time data?
Note that I am not looking for implementations of nearest-neighbour search or k-nearest-neighbours search.
More Info:
An example dataset has 450,000 4D points.
Some datasets are time-dense so ordering by time certainly saves processing, but still leads to many distance checks.
Time is represented by Excel-style dates, with typical ranges between 30,000-39,000 (approximate). The space ranges are sometimes higher values, sometimes lower values, but the range between each space co-ordinate is similar to time (e.g. maxX-minX ~ maxT-minT).
Even more info:
I thought I'd add some more slightly irrelevant data in case anybody has dealt with a similar dataset.
Basically I'm working with data that represents space-time events that are recorded and corroborated by multiple sensors. Error is involved, so only events that meet an error threshold are included.
The time span of these datasets ranges between 5-20 years of data.
For the really old data (>8 years old), the events were often very spacially dense for two reasons: 1) there were relatively few sensors available back then, and 2) the sensors were placed close together so that nearby events could be properly corroborated with low error. Further events could be recorded, but they had too high an error
For the newer data (<8 years old), the events are often very time dense, for the inverse reasons: 1) there are usually many sensors available, and 2) the sensors are placed at regular intervals over a larger distance.
As a result, the datasets cannot typically be said to be only time-dense or only spacially dense (except in the case of datasets that contain only new data).
Conclusion
I clearly should be asking more questions on this site.
I will be testing out several solutions over the next while which will include the 4d kd-tree, a 3d kd-tree followed by time distance check (suggested by Drew Hall), and the current algorithm I have. Also, I have been suggested another data structure called TSP (time space partitioning) tree, which uses an octree for space and a bsp on each node for time, so I may test that as well.
Assuming I remember, I'll be sure to post some profiling benchmarks on different time/space radii configurations.
Thanks all
To expand a little bit on my comments to an answer above:
According to the literature, kd-trees require data with Euclidean coordinates. They are probably not strictly necessary, but they're certainly sufficient: guaranteeing that all coordinates are Euclidean ensures that the normal rules of space apply, and makes it possible to easily partition points by their location and build up the tree structure.
Time is a little bit strange. Under the rules of special relativity, you use a Minkowski metric, not a standard Euclidean metric, when you're working with time coordinates. This causes all kinds of problems (most severe among them destroying the meaning of "simultaneity"), and generally makes people afraid of time coordinates. That fear is not well-founded, though, because unless you know you're working on physics, your time coordinate almost certainly actually will be Euclidean in practice.
What does it mean for a coordinate to be Euclidean? It should be independent of all the other coordinates. Saying time is a Euclidean coordinate means that you can answer the question "Are these two points close together in time?" by looking only at their time coordinates, and ignoring any extra information. It's easy to see why not having that property might break a scheme that partitions points by the values of their coordinates; if two points can have radically different time coordinates but still be considered "close in time", then a tree which sorts them by time coordinate is not going to work very well.
An example of a Euclidean time coordinate would be any time specified in a single, consistent time zone (like UTC times). If you have two clocks, one in New York and one in Tokyo, you know that if you have two measurements labelled "12:00 UTC" then they were taken at the same time. But if the measurements are taken in local time, so one says "12:00 New York time" and one is "12:00 Tokyo time", you have to use extra information about the locations and time zones of the cities to figure out how much time elapsed between the two measurements.
So as long as your time coordinate is consistently measured and sane, it will be Euclidean, and that means it will work just fine in a kd-tree or similar data structure.
If you stored an index to your points sorted in the time dimension, couldn't you first perform an initial pruning in the 1-d time dimension, thus reducing the number of distance calculations? (Or is that an oversimplfication?)
You haven't really given enough information to answer this.
But sure, in general kd-trees are perfectly suitable for 4 (or 5 or 6 or...) dimensional data --- if the spatial (or in your case space/time-ial) distribution lends itself to kd-tree decomposition. In other words, it depends (sound familiar?).
kd-trees are just one method of spatial decomposition that lend themselves to certain localized searches. As you go to higher dimensions, the curse of dimensionality problem rears it's head, of course, but 4d isn't too bad (you probably want at least a several hundred points though).
In order to know if this will work for you, you have to analyse some other criteria. Is approximate NN search good enough (this can help a lot). Is tree balancing likely to be expensive? etc.
If your data is relatively time-dense (and relatively space-sparse), it might work best to use a 3d kd-tree on the spatial dimensions, then simply reject the points that are outside the time window of interest. That would get around your mixed space/time metric problem, at the expense of a slightly more complex point struct.

Resources