I am currently delevoping a Self-Organizing-Map prototype for clustering of BigData and don't understand one thing.
The SOM-algorithm updates the weights of the Best Matching Unit and its neighborhood to better fit the input vector. Does the algorithm somehow change the real position of the neuron on the lattice? I mean, if I define a square lattice (5x5) each neuron can be referenced by a two dimensional coordinate (for example 1/1 or 1/5). So what I ask is, if the SOM algorithm update the coordinate of the neuron (for example from 1/1 to 1.1/1.3).
If not, how does the software show up clusters? I mean some programs show a unified distance between neurons (for example black areas are those, where the distances between neurons are low and white areas are those, where the distances are high). So how does the software know, which Neurons are next to each other?
Weight vectors are updated but the positions of neurons in the lattice never change.
SOM is a topology preserving map. That is, if two vectors are close to one another in input space, so is the case for the map representation [1].
But sometime topographic errors occur.
[1]: Engelbrecht, A.P., 2007. Computational intelligence: an introduction. John Wiley & Sons.
Related
Please forgive me if I do not explain my question clearly in title.
Here may I show you two pictures as my example:
my question is described as follows: I have 2 or more different objects(In the pictures, two objects: circle and cross), each one is placed repeatedly with a fixed row/column distance (In the pictures, the circle has a distance of 4 and cross has a distance of 2) into a grid.
In the first picture, each of the two objects are repeated correctly without any interruptions(here interruption means one object may occupy another one's position), but the arrangement in the first picture is ununiform distributed; on the contrary, in the second picture, the two objects may have interruptions (the circle object occupies cross objects' position) but the picture is uniformly distributed.
My target is to get the placement as uniform as possible (the objects are still placed with fixed distances but may allow some occupations). Is there a potential algorithm for this question? Or are there any similar questions?
I have some immature thinkings on this problem that: 1. occupation may relate to least common multiple; 2. how to define "uniformly distributed" mathematically? Maybe there's no genetic solution but is there a solution for some special cases? (for example, 3 objects with distance of multiple of 2, or multiple of 3?)
Uniformity can be measured as sum of squared inverse distances(or distances to equilibrium distances). Because it has squared relation, any single piece that approaches others will have big fitness penalty in system so that the system will not tolerate too close piece and prefer a better distribution.
If you do not use squared (or higher orders) distance but simple distance, then system starts tolerating even overlapped pieces.
If you want to manually compute uniformity, then compute the standard deviation of distances. You'd say its perfect with 1 distance and 0 deviation but small enough deviation also acceptable.
I tested this only on a problem to fit 106 circles in a square thats 10x the size of circle.
I'm having a project need categorize 3D models based on the complexity.
By "complexity", I mean for example, 3D model of furniture in modern style has low complexity, but 3D model of royal style furniture has very high complexity.
All 3D models are mesh type. I only need the very rough estimate, the reliability is not required too high, but should be correct most of times.
Please guide me which way, or the algorithm for this purpose (not based on vertices count).
It the best if we can process inside Meshlab, but any other source is fine too.
Thanks!
Let's consider a sphere: it looks simple, but it can be made of many vertices. I don't think that counting vertices gives a good estimation of complexity. The spheres' vertices are very little diverse.
Let's consider the old vs simple and modern furniture: the old one has potentially many different vertices but their organization is not "simple".
I propose to measure complexity to consider:
the number of different angles (and solid angles) between edges
the number of different edges' length (eg., connected vertices distances)
So far so good. But we got here by counting global complexity. What if with the same set of edges and vertices, we order them and build something that changes in a monotonic manner ? Yes, we need also to take into account local complexity: say the complexity in a limited chunk of space.
An algorithm is taking form:
divide the space into smaller spaces
count sets of different edges by angles and length
You can imagine take into account several scales by ranging the size of space divisions, and count sets every time, and in the end multiply or add the results.
I think you got something interesting. The thing is that this algorithm is rather close to some methods do estimate dimension of a fractal object.
Papers (google scholar) about "estimate fractal dimension"
3D models are composed of vertices, and vertices are connected together by edges to form faces. A rough measure of complexity from a computation standpoint would be to count the vertices or faces.
This approach falls down when trying to categorize the two chairs. It's entirely possible to have a simple chair with more vertices and faces than the regal chair.
To address that limitation I would merge adjacent faces with congruent normal vectors. If the faces share 1 edge and have congruent normal vectors then they can be said to be planar to each other. This would have the effect of simplifying the 3D model. A simple object should have a lower number of vertices / faces after this operation than a more complex model. At least in theory.
I'm sure there's a name for this algorithm, but I don't know it.
I have a data set given by an RFID antenna on 2D x,y location of two people moving. One person is carrying 3 RFID tags while the other is carrying 4 tags. Both are moving along the y axis as below. Red and Cyan are the paths, two people are walking.
The location map on a x,y scale looks like below.
Ideally, Orange, Yellow, Blue and Gray lines (RFID x,y data points) should go on a positive horizontal line while the below Green, Dark blue and Sky blue lines should go on a negative horizontal line.
Question
Although the lines are not straight, a visual pattern is emerged which can cluster above zero lines together and below zero lines together. My question is what algorithm/method can be used to compare such patterns and cluster them together. (So ideally the answer should be, above 4 lines are in one cluster and below 3 lines are in another cluster.)
It's difficult to think it as a linear movement always as people can walk in non linear ways. So best fit lines would not work. Any suggestion or shading of light is thankfully appreciated.
You want to look at methods for clustering time series (1d) or trajectories (2d).
The approach for both is pretty much the same. First, you want to find a suitable distance metric(dissimilarity measure), second, you decide for a suitable clustering algorithm.
Possible Distance Metrics
Here are some example distances you could use, with some short arguments for each:
Euclidean Distance: Very basic
Dynamic Time Wraping (DTW): Can account for shifts
Longuest Common Subsequence (LCSS): Accounts for shifts and can handle outliers
Edit distance with Real Penalty (EDP): Accounts for shifts and can handle outliers
More details can for example be found in this paper. An implementation of the distances can be found here.
Possible Cluster Algorithms
K-Means
DBSCAN
Single Linkage clustering
You can usually combine any distances metric with any distance based clustering algorithm.
Opinion
Looking at your data I would try DTW as a distance metric. If you expect two clusters then K-Means with k=2 should work. Otherwise you could try Single Linkage Clustering, which will give you something similar to the image below.
As a newbie in Machine Learning, I have a set of trajectories that may be of different lengths. I wish to cluster them, because some of them are actually the same path and they just SEEM different due to the noise.
In addition, not all of them are of the same lengths. So maybe although Trajectory A is not the same as Trajectory B, yet it is part of Trajectory B. I wish to present this property after the clustering as well.
I have only a bit knowledge of K-means Clustering and Fuzzy N-means Clustering. How may I choose between them two? Or should I adopt other methods?
Any method that takes the "belongness" into consideration?
(e.g. After the clustering, I have 3 clusters A, B and C. One particular trajectory X belongs to cluster A. And a shorter trajectory Y, although is not clustered in A, is identified as part of trajectory B.)
=================== UPDATE ======================
The aforementioned trajectories are the pedestrians' trajectories. They can be either presented as a series of (x, y) points or a series of step vectors (length, direction). The presentation form is under my control.
It might be a little late but I am also working on the same problem.
I suggest you take a look at TRACLUS, an algorithm created by Jae-Gil Lee, Jiawei Han and Kyu-Young Wang, published on SIGMOD’07.
http://web.engr.illinois.edu/~hanj/pdf/sigmod07_jglee.pdf
This is so far the best approach I have seen for clustering trajectories because:
Can discover common sub-trajectories.
Focuses on Segments instead of points (so it filters out noise-outliers).
It works over trajectories of different length.
Basically is a 2 phase approach:
Phase one - Partition: Divide trajectories into segments, this is done using MDL Optimization with complexity of O(n) where n is the numbers of points in a given trajectory. Here the input is a set of trajectories and output is a set of segments.
Complexity: O(n) where n is number of points on a trajectory
Input: Set of trajectories.
Output: Set D of segments
Phase two - Group: This phase discovers the clusters using some version of density-based clustering like in DBSCAN. Input in this phase is the set of segments obtained from phase one and some parameters of what constitutes a neighborhood and the minimum amount of lines that can constitute a cluster. Output is a set of clusters. Clustering is done over segments. They define their own distance measure made of 3 components: Parallel distance, perpendicular distance and angular distance. This phase has a complexity of O(n log n) where n is the number of segments.
Complexity: O(n log n) where n is number of segments on set D
Input: Set D of segments, parameter E that sets neighborhood treshold and parameter MinLns that is the minimun number of lines.
Output: Set C of Cluster, that is a Cluster of segments (trajectories clustered).
Finally they calculate a for each cluster a representative trajectory, which is nothing else that a discovered common sub-trajectory in each cluster.
They have pretty cool examples and the paper is very well explained. Once again this is not my algorithm, so don't forget to cite them if you are doing research.
PS: I made some slides based on their work, just for educational purposes:
http://www.slideshare.net/ivansanchez1988/trajectory-clustering-traclus-algorithm
Every clustering algorithm needs a metric. You need to define distance between your samples. In your case simple Euclidean distance is not a good idea, especially if the trajectories can have different lengths.
If you define a metric, than you can use any clustering algorithm that allows for custom metric. Probably you do not know the correct number of clusters beforehand, then hierarchical clustering is a good option. K-means doesn't allow for custom metric, but there are modifications of K-means that do (like K-medoids)
The hard part is defining distance between two trajectories (time series). Common approach is DTW (Dynamic Time Warping). To improve performance you can approximate your trajectory by smaller amount of points (many algorithms for that).
Neither will work. Because what is a proper mean here?
Have a look at distance based clustering methods, such as hierarchical clustering (for small data sets, but you probably don't have thousands of trajectories) and DBSCAN.
Then you only need to choose an appropriate distance function that allows e.g. differences in time and spatial resolution of trajectories.
Distance functions such as dynamic time warping (DTW) distance can accomodate this.
This is good concept and having possibility for real-time applications. In my view, one can adopt any clustering but need to select appropriate dissimilarity measure, later need to think about computational complexity.
Paper (http://link.springer.com/chapter/10.1007/978-81-8489-203-1_15) used Hausdorff and suggest the technique for reducing complexity, and paper (http://www.cit.iit.bas.bg/CIT_2015/v-15-2/4-5-TCMVS%20A-edited-md-Gotovop.pdf) described the use of "Trajectory Clustering Technique Based on Multi-View Similarity"
I have a number of points on a relatively small 2-dimensional grid, which wraps around in both dimensions. The coordinates can only be integers. I need to divide them into sets of at most N points that are close together, where N will be quite a small cut-off, I suspect 10 at most.
I'm designing an AI for a game, and I'm 99% certain using minimax on all the game pieces will give me a usable lookahead of about 1 move, if that. However distant game pieces should be unable to affect each other until we're looking ahead by a large number of moves, so I want to partition the game into a number of sub-games of N pieces at a time. However, I need to ensure I select a reasonable N pieces at a time, i.e. ones that are close together.
I don't care whether outliers are left on their own or lumped in with their least-distant cluster. Breaking up natural clusters larger than N is inevitable, and only needs to be sort-of reasonable. Because this is used in a game AI with limited response time, I'm looking for as fast an algorithm as possible, and willing to trade off accuracy for performance.
Does anyone have any suggestions for algorithms to look at adapting? K-means and relatives don't seem appropriate, as I don't know how many clusters I want to find but I have a bound on how large clusters I want. I've seen some evidence that approximating a solution by snapping points to a grid can help some clustering algorithms, so I'm hoping the integer coordinates makes the problem easier. Hierarchical distance-based clustering will be easy to adapt to the wrap-around coordinates, as I just plug in a different distance function, and also relatively easy to cap the size of the clusters. Are there any other ideas I should be looking at?
I'm more interested in algorithms than libraries, though libraries with good documentation of how they work would be welcome.
EDIT: I originally asked this question when I was working on an entry for the Fall 2011 AI Challenge, which I sadly never got finished. The page I linked to has a reasonably short reasonably high-level description of the game.
The two key points are:
Each player has a potentially large number of ants
Every ant is given orders every turn, moving 1 square either north, south, east or west; this means the branching factor of the game is O(4ants).
In the contest there were also strict time constraints on each bot's turn. I had thought to approach the game by using minimax (the turns are really simultaneous, but as a heuristic I thought it would be okay), but I feared there wouldn't be time to look ahead very many moves if I considered the whole game at once. But as each ant moves only one square each turn, two ants cannot N spaces apart by the shortest route possibly interfere with one another until we're looking ahead N/2 moves.
So the solution I was searching for was a good way to pick smaller groups of ants at a time and minimax each group separately. I had hoped this would allow me to search deeper into the move-tree without losing much accuracy. But obviously there's no point using a very expensive clustering algorithm as a time-saving heuristic!
I'm still interested in the answer to this question, though more in what I can learn from the techniques than for this particular contest, since it's over! Thanks for all the answers so far.
The median-cut algorithm is very simple to implement in 2D and would work well here. Your outliers would end up as groups of 1 which you could discard or whatever.
Further explanation requested:
Median cut is a quantization algorithm but all quantization algorithms are special case clustering algorithms. In this case the algorithm is extremely simple: find the smallest bounding box containing all points, split the box along its longest side (and shrink it to fit the points), repeat until the target amount of boxes is achieved.
A more detailed description and coded example
Wiki on color quantization has some good visuals and links
Since you are writing a game where (I assume) only a constant number of pieces move between each clusering, you can take advantage of a Online algorithm to get consant update times.
The property of not locking yourself to a number of clusters is called Nonstationary, I believe.
This paper seams to have a good algorithm with both of the above two properties: Improving the Robustness of 'Online Agglomerative Clustering Method' Based on Kernel-Induce Distance Measures (You might be able to find it elsewhere as well).
Here is a nice video showing the algorithm in works:
Construct a graph G=(V, E) over your grid, and partition it.
Since you are interested in algorithms rather than libraries, here is a recent paper:
Daniel Delling, Andrew V. Goldberg, Ilya Razenshteyn, and Renato F. Werneck. Graph Partitioning with Natural Cuts. In 25th International Parallel and Distributed Processing Symposium (IPDPS’11). IEEE Computer
Society, 2011. [PDF]
From the text:
The goal of the graph partitioning problem is to find a minimum-cost partition P such that the size of each cell is bounded by U.
So you will set U=10.
You can calculate a minimum spanning tree and remove the longest edges. Then you can calculate the k-means. Remove another long edge and calculate the k-means. Rinse and repeat until you have N=10. I believe this algorithm is named single-link k-means and the cluster are similar to voronoi diagrams:
"The single-link k-clustering algorithm ... is precisely Kruskal's algorithm ... equivalent to finding an MST and deleting the k-1 most expensive edges."
See for example here: https://stats.stackexchange.com/questions/1475/visualization-software-for-clustering
Consider the case where you only want two clusters. If you run k-means, then you will get two points, and the division between the two clusters is a plane orthogonal to the line between the centres of the two clusters. You can find out which cluster a point is in by projecting it down to the line and then comparing its position on the line with a threshold (e.g. take the dot product between the line and a vector from either of the two cluster centres and the point).
For two clusters, this means that you can adjust the sizes of the clusters by moving the threshold. You can sort the points on their distance along the line connecting the two cluster centres and then move the threshold along the line quite easily, trading off the inequality of the split with how neat the clusters are.
You probably don't have k=2, but you can run this hierarchically, by dividing into two clusters, and then sub-dividing the clusters.
(After comment)
I'm not good with pictures, but here is some relevant algebra.
With k-means we divide points according to their distance from cluster centres, so for a point Xi and two centres Ai and Bi we might be interested in
SUM_i (Xi - Ai)^2 - SUM_i(Xi - Bi)^2
This is SUM_i Ai^2 - SUM_i Bi^2 + 2 SUM_i (Bi - Ai)Xi
So a point gets assigned to either cluster depending on the sign of K + 2(B - A).X - a constant plus the dot product between the vector to the point and the vector joining the two cluster circles. In two dimensions, the dividing line between the points on the plane that end up in one cluster and the points on the plane that end up in the other cluster is a line perpendicular to the line between the two cluster centres. What I am suggesting is that, in order to control the number of points after your division, you compute (B - A).X for each point X and then choose a threshold that divides all points in one cluster from all points in the other cluster. This amounts to sliding the dividing line up or down the line between the two cluster centres, while keeping it perpendicular to the line between them.
Once you have dot products Yi, where Yi = SUM_j (Bj - Aj) Xij, a measure of how closely grouped a cluster is is SUM_i (Yi - Ym)^2, where Ym is the mean of the Yi in the cluster. I am suggesting that you use the sum of these values for the two clusters to tell how good a split you have. To move a point into or out of a cluster and get the new sum of squares without recomputing everything from scratch, note that SUM_i (Si + T)^2 = SUM_i Si^2 + 2T SUM_i Si + T^2, so if you keep track of sums and sums of squares you can work out what happens to a sum of squares when you add or subtract a value to every component, as the mean of the cluster changes when you add or remove a point to it.