I have a list of tweets with their geo locations.
They are going to be displayed in a heatmap image transparently placed over Google Map.
The trick is to find groups of locations residing next to each other and display
them as a single heatmap circle/figure of a certain heat/color, based on cluster size.
Is there some library ready to grouping locations in a map into clusters?
Or I better should decide my clusterization params and build a custom algorithm?
I don't know if there is a 'library ready to grouping locations in a map into clusters', maybe it is, maybe it isn't. Anyways, I don't recommend you to build your custom clustering algorithm since there are a lot of libraries already implemented for this.
#recursive sent you a link with a php code for k-means (one clustering algorithm). There is also a huge Java library with other techniques (Java-ML) including k-means too, hierarchical clustering, k-means++ (to select the centroids), etc.
Finally I'd like to tell you that clustering is a non-supervised algorithm, which means that effectively, it will give you a set of clusters with data inside them, but at a first glance you don't know how the algorithm clustered your data. I mean, it may be clustered by locations as you want, but it can be clustered also by another characteristic you don't need so it's all about playing with the parameters of the algorithm and tune your solutions.
I'm interested in the final solution you could find to this problem :) Maybe you can share it in a comment when you end this project!
K means clustering is a technique often used for such problems
The basic idea is this:
Given an initial set of k means m1,…,mk, the
algorithm proceeds by alternating between two steps:
Assignment step: Assign each observation to the cluster with the closest mean
Update step: Calculate the new means to be the centroid of the observations in the cluster.
Here is some sample code for php.
heatmap.js is an HTML5 library for rendering heatmaps, and has a sample for doing it on top of the Google Maps API. It's pretty robust, but only works in browsers that support canvas:
The heatmap.js library is currently supported in Firefox 3.6+, Chrome
10, Safari 5, Opera 11 and IE 9+.
You can try my php class hilbert curve at phpclasses.org. It's a monster curve and reduces 2d complexity to 1d complexity. I use a quadkey to address a coordinate and it has 21 zoom levels like Google maps.
This isn't really a clustering problem. Head maps don't work by creating clusters. Instead they convolute the data with a gaussian kernel. If you're not familiar with image processing, think of it as using a normal or gaussian "stamp" and stamping it over each point. Since the overlays of the stamp will add up on top of each other, areas of high density will have higher values.
One simple alternative for heatmaps is to just round the lat/long to some decimals and group by that.
See this explanation about lat/long decimal accuracy.
1 decimal - 11km
2 decimals - 1.1km
3 decimals - 110m
etc.
For a low zoom level heatmap with lots of data, rounding to 1 or 2 decimals and grouping the results by that should do the trick.
Related
I'm trying to use t-sne to arrange images based on their visual similarity, similar to this cool example for emojies (source):
but the output of t-sne is just a "point cloud", while my goal is to display the images in a regular, near-square, dense grid. So I need to somehow convert the output of t-sne to (x,y) locations on a grid.
So far, I've followed the suggestion in this great blog post: I formulated it as a linear assignment problem to find the best embedding into a regular grid. I'm pleased with the results, for example:
My Problem is that the "snapping to grid" stage turns out to be a huge bottleneck, and I need my method to scale well for a large number of images (10K). To solve the linear assignment problem I'm using a Java implementation of the Jonker-Volgenant algorithm, whose time complexity is O(n^3). So while t-sne is nlogn and can scale well up to 10K images, the part of aligning to a regular grid can only deal with up to 2K images.
Potential Solutions, as I see it:
Randomly sample 2K images out of the total 10K
Divide the 10K images into 5 and create 5 maps. This is problematic because there's a "chicken and egg" problem, how do I do the division well?
Trade accuracy for performance: Solve the Linear assignment problem approximately in a near-linear time. I want to try this but I couldn't find any existing implementations for me to use.
Implement the "snap to grid" part in a different, more efficient way.
I'm working with Java but solutions in cpp are also good. I'm guessing I'm not the first to try this. Any suggestions? Thoughts?
Thanks!
In image processing, each of the following methods can be used to get the orientation of a blob region:
Using second order central moments
Using PCA to find the axis
Using distance transform to get the skeleton and axis
Other techniques, like fitting the contour of the region with an ellipse.
When should I consider using a specific method? How do they compare, in terms of accuracy and performance?
I'll give you a vague general answer, and I'm sure others will give you more details. This issue comes up all the time in image processing. There are N ways to solve my problem, which one should I use? The answer is, start with the simplest one that you understand the best. For most people, that's probably 1 or 2 in your example. In most cases, they will be nearly identical and sufficient. If for some reason the techniques don't work on your data, you have now learned for yourself, a case where the techniques fail. Now, you need to start exploring other techniques. This is where the hard work comes in, in being a image processing practitioner. There are no silver bullets, there's a grab bag of techniques that work in specific contexts, which you have to learn and figure out. When you learn this for yourself, you will become god like among your peers.
For this specific example, if your data is roughly ellipsoidal, all these techniques will be similar results. As your data moves away from ellipsoidal, (say spider like) the PCA/Second order moments / contours will start to give poor results. The skeleton approaches become more robust, but mapping a complex skeleton to a single axis / orientation can become a very difficult problem, and may require more apriori knowledge about the blob.
I have a set of data I have generated that consists of extracted mass (well, m/z but that not so important) values and a time. I extract the data from the file, however, it is possible to get repeat measurements and this results in a large amount of redundancy within the dataset. I am looking for a method to cluster these in order to group those that are related based on either similarity in mass alone, or similarity in mass and time.
An example of data that should be group together is:
m/z time
337.65 1524.6
337.65 1524.6
337.65 1604.3
However, I have no way to determine how many clusters I will have. Does anyone know of an efficient way to accomplish this, possibly using a simple distance metric? I am not familiar with clustering algorithms sadly.
http://en.wikipedia.org/wiki/Cluster_analysis
http://en.wikipedia.org/wiki/DBSCAN
Read the section about hierarchical clustering and also look into DBSCAN if you really don't want to specify how many clusters in advance. You will need to define a distance metric and in that step is where you would determine which of the features or combination of features you will be clustering on.
Why don't you just set a threshold?
If successive values (by time) do not differ by at least +-0.1 (by m/s) they a grouped together. Alternatively, use a relative threshold: differ by less than +- .1%. Set these thresholds according to your domain knowledge.
That sounds like the straightforward way of preprocessing this data to me.
Using a "clustering" algorithm here seems total overkill to me. Clustering algorithms will try to discover much more complex structures than what you are trying to find here. The result will likely be surprising and hard to control. The straightforward change-threshold approach (which I would not call clustering!) is very simple to explain, understand and control.
For the simple one dimension K-means clustering (http://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm) is appropriate and can be used directly. The only issue is selecting appropriate K. The best way to select a good K is to either plot K vs residual variance and select the K that "dramatically" reduces variance. Another strategy is to use some information criteria (eg. Bayesian Information Criteria).
You can extend K-Means to multi-dimensional data easily. But you should be beware of scaling the individual dimensions. Eg. Among items (1KG, 1KM) (2KG, 2KM) the nearest point to (1.7KG, 1.4KM) is (2KG, 2KM) with these scales. But once you start expression second item in meters, probably the alternative is true.
Here's my scenario. Consider a set of events that happen at various places and times - as an example, consider someone high above recording the lightning strikes in a city during a storm. For my purpose, lightnings are instantaneous and can only hit certain locations (such as high buildings). Also imagine each lightning strike has a unique id so one can reference the strike later. There are about 100,000 such locations in this city (as you guess, this is an analogy as my current employer is sensitive about the actual problem).
For phase 1, my input is the set of (strike id, strike time, strike location) tuples. The desired output is the set of the clusters of more than 1 event that hit the same location within a short time. The number of clusters is not known in advance (so k-means is not that useful here). What is being considered as 'short' could be predefined for a given clustering attempt. That is, I can set it to, say, 3 minutes, than run the algorithm; later try with 4 minutes or 10 minutes. Perhaps a nice touch would be for the algorithm to determine a 'strength' of clustering and recommend that for a given input, the most compact clustering is achieved by using a particular value for 'short', but this is not required initially.
For phase 2, I'd like to take into consideration the amplitude of the strike (i.e., a real number) and look for clusters that are both within a short time and with similar amplitudes.
I googled and checked the answers here about data clustering. The information is a bit bewildering (below is the list of links I found useful). AFAIK, k-means and related algorithms would not be useful because they require the number of clusters to be specified apriori. I'm not asking for someone to solve my problem (I like solving it), but some orientation in the large world of data clustering algorithms would be useful in order to save some time. Specifically, what clustering algorithms are appropriate for when the number of clusters is unknown.
Edit: I realized the location is irrelevant, in the sense that although events happen all the time, I only need to cluster them per location. So each location has its own time-series of events that can thus be analyzed independently.
Some technical details:
- as the dataset is not that large, it can fit all in memory.
- parallel processing is a nice to have, but not essential. I only have a 4-core machine and MapReduce and Hadoop would be too much.
- the language I'm mostly familiar with is Java. I haven't yet used R and the learning curve for it would probably be too much for what time I was given. I'll have a look at it anyway in my spare time.
- for the time being, using tools to run the analysis is ok, I don't have to produce just code. I'm mentioning this because probably Weka will be suggested.
- visualization would be useful. As the dataset is large enough so it doesn't fit in memory, the visualization should at least support zooming and panning. And to clarify: I don't need to build a visualization GUI, it's just a nice capability to use for checking the results produced with a tool.
Thank you. Questions that I found useful are: How to find center of clusters of numbers? statistics problem?, Clustering Algorithm for Paper Boys, Java Clustering Library, How to cluster objects (without coordinates), Algorithm for detecting "clusters" of dots
I would suggest you to look into Mean Shift Clustering. The basic idea behind mean shift clustering is to take the data and perform a kernel density estimation, then find the modes in the density estimate, the regions of convergence of data points towards modes defines the clusters.
The nice thing about mean shift clustering is that the number of clusters do not have to be specified ahead of time.
I have not used Weka, so I am not sure if it has mean shift clustering. However if you are using MATLAB, here is a toolbox (KDE toolbox) to do it. Hope that helps.
Couldn't you just use hierarchical clustering with the difference in times of strikes as part of the distance metric?
It is too late, but still I would add it:
In R, there is a package fpc and it has a method pamk() which provides you the clusters. Using pamk(), you do not need to mention the number of clusters intially. It calculates itself the number of clusters in the input data.
Anyone know of an algorithm that will group pictures into events based on the date the picture was taken. Obviously I can group by the date, but I'd like something a little more sophisticated that would(might) be able to group pictures spanning multiple days based on the frequency over a certain timespan. Consider the following groupings:
1/2/2009 15 photos
1/3/2009 20 photos
1/4/2009 13 photos
1/5/2009 19 photos
1/15/2009 5 photos
Potentially these would be grouped into two groups:
1/2/2009 -> 1/5/2009
1/15/2009
Obviously there will be some tolerance(s) that need to be established.
Is there any well established way of doing this, other then inventing my own top/down approach?
You can apply pretty much any standard clustering technique to this, it's just a matter of defining your distance function correctly. When you are making your matrix of distances between your photos you should consider a combination of physical distance between locations - if you have it - and temporal distance between their creation timestamps. Normalise them and put them on separate dimensions and you may even just be able to take a regular euclidean distance.
Best of luck.
Just group the pictures that were taken on successive days (no days on which no pictures were taken) together.
You might try to dynamically calculate tolerance based on how many or how big (absolute or %) clusters you want to create.
To get a useful clustering of pictures according to date you require the following:
1) The number of clusters should be variable and not fixed a priori to the clustering
2) The diameter of each cluster should not exceed a specific amount.
The clustering algorithm that best satisfies both requirements is the QT (quality threshold) clustering algorithm. From Wikipedia:
QT (quality threshold) clustering
(Heyer, Kruglyak, Yooseph, 1999) is an
alternative method of partitioning
data, invented for gene clustering. It
requires more computing power than
k-means, but does not require
specifying the number of clusters a
priori, and always returns the same
result when run several times.
Although it is mainly used for gene clustering I think it would fit in very well for what you need.
Try to detect the Gaps instead of the Clusters.