Creating a cluster centroid prone to noise - algorithm

I'm working on a clustering algorithm to group similar ranges of real numbers. After I group them, I have to create one range for that cluster, i.e., cluster centroid. For example, if one cluster contains values <1,6>, <0,7> and <0,6>, that means that this cluster is for all those with values <0,7>. The question is how to create such a resulting range. I was thinking to take the min and max value of all values in the cluster, but that would mean that the algorithm is very sensitive on noise. I should do it somehow weighted, but I'm not sure how. Any hints? Thanks.

Perhaps you can convert all ranges to their midpoints before running your clustering algorithm. That way you convert your problem into clustering points on a line. Previously, the centroid range could 'grow' and in the next iteration consume more ranges that perhaps should belong to another cluster.
midpoints = []
for range in ranges
midpoints[range] = range.min + (range.max - range.min) / 2
end
After the algorithm is finished you can do as you previously suggested and take the min and max values of all the ranges in the cluster to create the range for that centroid.

Related

What is the difference between clustering and matching?

What is the difference between clustering and matching?
For example: There's a pool of four elements and in the one scenario I want to generate pairs. What I do is I measure the distance of each element to each other which yields a 2x2 matrix. Then the matching algorithm finds the two pairings with the lowest or highest weighted sum.
What is a clustering algorithm doing? When I demand a cluster number of two then the result is the same, or not?
Specifying the number of elements in a cluster (pairs for example) doesn't make much sense. If you have been looking at k-means (k-medoids), the k actually indicates how many clusters will be created in total. So, if you have 4 elements and use k = 2, you can get one cluster with 1 element and another cluster with 3 elements, depending on the data you have. Anyway, clustering on 4 elements doesn't make sense.

How to find the recurrence formula for the number of ways of clustering given size n and number of clusters to be k?

Here I'm kinda stuck with this quiz problem. It asks for the recurrence function for number of ways that a set of n points can be clustered into k non-empty clusters.
My initial thought is that it should be S(n,k) = nS(n, k-1) since for every increase in the number of clusters by one, there should be n more ways to add a cluster to existing clusters of k-1 in size.
The picture attached is the actual question. Thanks a lot!
enter image description here
You can get k non-empty clusters, containing n objects:
by adding n-th object to any existing cluster (there are k of them, so k*S(n-1,k) variants)
or making new cluster containing single n-th object in addition to (k-1) existing clusters (S(n-1,k-1) variants)

Is it better to reduce the space complexity or the time complexity for a given program?

Grid Illumination: Given an NxN grid with an array of lamp coordinates. Each lamp provides illumination to every square on their x axis, every square on their y axis, and every square that lies in their diagonal (think of a Queen in chess). Given an array of query coordinates, determine whether that point is illuminated or not. The catch is when checking a query all lamps adjacent to, or on, that query get turned off. The ranges for the variables/arrays were about: 10^3 < N < 10^9, 10^3 < lamps < 10^9, 10^3 < queries < 10^9
It seems like I can get one but not both. I tried to get this down to logarithmic time but I can't seem to find a solution. I can reduce the space complexity but it's not that fast, exponential in fact. Where should I focus on instead, speed or space? Also, if you have any input as to how you would solve this problem please do comment.
Is it better for a car to go fast or go a long way on a little fuel? It depends on circumstances.
Here's a proposal.
First, note you can number all the diagonals that the inputs like on by using the first point as the "origin" for both nw-se and ne-sw. The diagonals through this point are both numbered zero. The nw-se diagonals increase per-pixel in e.g the northeast direction, and decreasing (negative) to the southwest. Similarly ne-sw are numbered increasing in the e.g. the northwest direction and decreasing (negative) to the southeast.
Given the origin, it's easy to write constant time functions that go from (x,y) coordinates to the respective diagonal numbers.
Now each set of lamp coordinates is naturally associated with 4 numbers: (x, y, nw-se diag #, sw-ne dag #). You don't need to store these explicitly. Rather you want 4 maps xMap, yMap, nwSeMap, and swNeMap such that, for example, xMap[x] produces the list of all lamp coordinates with x-coordinate x, nwSeMap[nwSeDiagonalNumber(x, y)] produces the list of all lamps on that diagonal and similarly for the other maps.
Given a query point, look up it's corresponding 4 lists. From these it's easy to deal with adjacent squares. If any list is longer than 3, removing adjacent squares can't make it empty, so the query point is lit. If it's only 3 or fewer, it's a constant time operation to see if they're adjacent.
This solution requires the input points to be represented in 4 lists. Since they need to be represented in one list, you can argue that this algorithm requires only a constant factor of space with respect to the input. (I.e. the same sort of cost as mergesort.)
Run time is expected constant per query point for 4 hash table lookups.
Without much trouble, this algorithm can be split so it can be map-reduced if the number of lampposts is huge.
But it may be sufficient and easiest to run it on one big machine. With a billion lamposts and careful data structure choices, it wouldn't be hard to implement with 24 bytes per lampost in an unboxed structures language like C. So a ~32Gb RAM machine ought to work just fine. Building the maps with multiple threads requires some synchronization, but that's done only once. The queries can be read-only: no synchronization required. A nice 10 core machine ought to do a billion queries in well less than a minute.
There is very easy Answer which works
Create Grid of NxN
Now for each Lamp increment the count of all the cells which suppose to be illuminated by the Lamp.
For each query check if cell on that query has value > 0;
For each adjacent cell find out all illuminated cells and reduce the count by 1
This worked fine but failed for size limit when trying for 10000 X 10000 grid

ClusterDump in Mahout 0.9

I have a question related to cluster dump in Mahout 0.9 while doing text clustering -
https://mahout.apache.org/users/clustering/clusteringyourdata.html
One case of cluster dump is to output the top k kerms and for that you don’t specify the parameter p (pointsDir).
Second case of cluster dump is where you specify the parameter p (pointsDir) and you get points associated with a cluster.
Both the outputs have same exact cluster id but the number of records shown in Case 1 – Where Top Terms are displayed is different than the number of records appearing in Case 2 – Where you get points associated with a cluster.
Why does this happen? I mean its bizzare to see different # of points associated with a specific cluster and not sure which one is correct?
Has anyone seen this happening?
Thank you in advance!
Finally after searching a lot about this issue on the web, I found a link discussing this problem -
http://qnalist.com/questions/4874723/mahout-clusterdump-output
Although what caught my attention was this explanation below -
I think the discrepancy between the number (n=) of vectors reported by
the cluster and the number of points actually clustered by the -cl
option is normal.
* In the final iteration, points are assigned to (observed by)
(classified as) each cluster based upon the distance measure and the
cluster center computed from the previous iteration. The (n=) value
records the number of points "observed by" the cluster in that
iteration.
* After the final iteration, a new cluster center is calculated for
each cluster. This moves the center by some amount, less than the
convergence threshold, but it moves.
* During the subsequent classification (-cl) step, these new centers
are used to classify the points for output. This will inevitably
cause some points to be assigned to (observed by) (classified as) a
different cluster and so the output clusteredPoints will reflect
this final assignment.
In small, contrived examples, the clustering will likely be more stable
between the final iteration and the output of clustered points.
I think the discrepancy between the
number (n=) of vectors reported by the cluster and the number of
points actually clustered by the -cl option is normal.
In the final iteration, points are assigned to (observed by)
(classified as) each cluster based upon the distance measure
and the cluster center computed from the previous iteration.
The (n=) value records the number of points "observed by" the
cluster in that iteration.
After the final iteration, a new cluster center is
calculated for each cluster. This moves the center by some
amount, less than the convergence threshold, but it moves.
During the subsequent classification (-cl) step, these new
centers are used to classify the points for output. This will
inevitably cause some points to be assigned to (observed by)
(classified as) a different cluster and so the output
clusteredPoints will reflect this final assignment.
In small, contrived examples, the clustering will likely be
more stable between the final iteration and the output of
clustered points.

Choosing number of clusters in k means

I want to cluster a large sample of data and for it I am using k means function in MATLAB. The problem is that it returns a matrix with all the data sorted in the number of clusters I specify.
How can I know which number of clusters is optimal.
I thought that if I would get the equal number of elements in each cluster that would be optimal but this never happens. Rather it can go on clustering the data for any number I put.
Please help...
I read and I think an answer to this could be :- In kmeans we are trying to partition the data according to the means as the data comes so theoretically our best dataset would be where each partition has equal number of data.
I used kmeans++ which was a better algorithm than kmeans because it does not initialise a random value and then iterated over the number of partitions till the sizes of partitions were almost equal. This was an approximate figure as say for 3 i got 2180,729,1219 and for 4 i was getting 30,2422, 1556,120 so I chose 3 as my final answer............

Resources