ClusterDump in Mahout 0.9 - hadoop

I have a question related to cluster dump in Mahout 0.9 while doing text clustering -
https://mahout.apache.org/users/clustering/clusteringyourdata.html
One case of cluster dump is to output the top k kerms and for that you don’t specify the parameter p (pointsDir).
Second case of cluster dump is where you specify the parameter p (pointsDir) and you get points associated with a cluster.
Both the outputs have same exact cluster id but the number of records shown in Case 1 – Where Top Terms are displayed is different than the number of records appearing in Case 2 – Where you get points associated with a cluster.
Why does this happen? I mean its bizzare to see different # of points associated with a specific cluster and not sure which one is correct?
Has anyone seen this happening?
Thank you in advance!

Finally after searching a lot about this issue on the web, I found a link discussing this problem -
http://qnalist.com/questions/4874723/mahout-clusterdump-output
Although what caught my attention was this explanation below -
I think the discrepancy between the number (n=) of vectors reported by
the cluster and the number of points actually clustered by the -cl
option is normal.
* In the final iteration, points are assigned to (observed by)
(classified as) each cluster based upon the distance measure and the
cluster center computed from the previous iteration. The (n=) value
records the number of points "observed by" the cluster in that
iteration.
* After the final iteration, a new cluster center is calculated for
each cluster. This moves the center by some amount, less than the
convergence threshold, but it moves.
* During the subsequent classification (-cl) step, these new centers
are used to classify the points for output. This will inevitably
cause some points to be assigned to (observed by) (classified as) a
different cluster and so the output clusteredPoints will reflect
this final assignment.
In small, contrived examples, the clustering will likely be more stable
between the final iteration and the output of clustered points.
I think the discrepancy between the
number (n=) of vectors reported by the cluster and the number of
points actually clustered by the -cl option is normal.
In the final iteration, points are assigned to (observed by)
(classified as) each cluster based upon the distance measure
and the cluster center computed from the previous iteration.
The (n=) value records the number of points "observed by" the
cluster in that iteration.
After the final iteration, a new cluster center is
calculated for each cluster. This moves the center by some
amount, less than the convergence threshold, but it moves.
During the subsequent classification (-cl) step, these new
centers are used to classify the points for output. This will
inevitably cause some points to be assigned to (observed by)
(classified as) a different cluster and so the output
clusteredPoints will reflect this final assignment.
In small, contrived examples, the clustering will likely be
more stable between the final iteration and the output of
clustered points.

Related

Process points spread apart in 2D in parallel

Problem
There are N points in 2D space (with coordinates in the range 10^9). All these points must be processed (once each).
Processing can use P parallel threads (with typical hardware, P ≈ 6).
The time it takes to process a point is different for each point and unknown beforehand.
All points being processed in parallel must be at least D apart from each other (Euclidean or other distance measure are all okay).
Attempts
I would imagine the algorithm would be an implementation of two parts:
Which points to schedule initially
Which new point to schedule (if possible) when a point finishes being processed
My solutions have not been much better than a naive method, which is simply to keep trying random points until one is at least D away from all points being processed.
(I have thought about making P of points so that every element of one group is at least D away from every element of every other group, and then when a point from a group finishes, take the next point in the group. This only saves time in some scenarios though, and I have not determined how to get a good set of groups either.)

Is it better to reduce the space complexity or the time complexity for a given program?

Grid Illumination: Given an NxN grid with an array of lamp coordinates. Each lamp provides illumination to every square on their x axis, every square on their y axis, and every square that lies in their diagonal (think of a Queen in chess). Given an array of query coordinates, determine whether that point is illuminated or not. The catch is when checking a query all lamps adjacent to, or on, that query get turned off. The ranges for the variables/arrays were about: 10^3 < N < 10^9, 10^3 < lamps < 10^9, 10^3 < queries < 10^9
It seems like I can get one but not both. I tried to get this down to logarithmic time but I can't seem to find a solution. I can reduce the space complexity but it's not that fast, exponential in fact. Where should I focus on instead, speed or space? Also, if you have any input as to how you would solve this problem please do comment.
Is it better for a car to go fast or go a long way on a little fuel? It depends on circumstances.
Here's a proposal.
First, note you can number all the diagonals that the inputs like on by using the first point as the "origin" for both nw-se and ne-sw. The diagonals through this point are both numbered zero. The nw-se diagonals increase per-pixel in e.g the northeast direction, and decreasing (negative) to the southwest. Similarly ne-sw are numbered increasing in the e.g. the northwest direction and decreasing (negative) to the southeast.
Given the origin, it's easy to write constant time functions that go from (x,y) coordinates to the respective diagonal numbers.
Now each set of lamp coordinates is naturally associated with 4 numbers: (x, y, nw-se diag #, sw-ne dag #). You don't need to store these explicitly. Rather you want 4 maps xMap, yMap, nwSeMap, and swNeMap such that, for example, xMap[x] produces the list of all lamp coordinates with x-coordinate x, nwSeMap[nwSeDiagonalNumber(x, y)] produces the list of all lamps on that diagonal and similarly for the other maps.
Given a query point, look up it's corresponding 4 lists. From these it's easy to deal with adjacent squares. If any list is longer than 3, removing adjacent squares can't make it empty, so the query point is lit. If it's only 3 or fewer, it's a constant time operation to see if they're adjacent.
This solution requires the input points to be represented in 4 lists. Since they need to be represented in one list, you can argue that this algorithm requires only a constant factor of space with respect to the input. (I.e. the same sort of cost as mergesort.)
Run time is expected constant per query point for 4 hash table lookups.
Without much trouble, this algorithm can be split so it can be map-reduced if the number of lampposts is huge.
But it may be sufficient and easiest to run it on one big machine. With a billion lamposts and careful data structure choices, it wouldn't be hard to implement with 24 bytes per lampost in an unboxed structures language like C. So a ~32Gb RAM machine ought to work just fine. Building the maps with multiple threads requires some synchronization, but that's done only once. The queries can be read-only: no synchronization required. A nice 10 core machine ought to do a billion queries in well less than a minute.
There is very easy Answer which works
Create Grid of NxN
Now for each Lamp increment the count of all the cells which suppose to be illuminated by the Lamp.
For each query check if cell on that query has value > 0;
For each adjacent cell find out all illuminated cells and reduce the count by 1
This worked fine but failed for size limit when trying for 10000 X 10000 grid

Percentage load balance thread requests

I have a pool of worker threads in which I send request to them based on percentage. For example, worker 1 must process 60% of total requests, worker 2 must process 31% of total requests and lastly worker 3 processes 9%. I need to know mathematically how to scale down the numbers and maintain ratio so I don't have to send 60 requests to thread 1 and then start sending requests to worker 2. It sounds like a "Linear Scale" math approach. In any case, all inputs on this issue are appreciated
One way to think about this problem makes it quite similar to the problem of drawing a sloped line on a pixel-based display, which can be done with Bresenham's algorithm.
First let's assume for simplicity that there are only 2 workers, and that they should take a fraction p (for worker 1) and (1-p) (for worker 2) of the incoming requests. Imagine that "Requests sent to worker 1" is the horizontal axis and "Requests sent to worker 2" is the vertical axis of a graph: what we want to do is draw a (pixelated) line in this graph that starts at (0, 0) and has slope (1-p)/p (i.e. it advances (1-p) units upwards for every p units it advances rightwards). When a new request comes in, a new pixel gets drawn. This new pixel will always be either immediately to the right of the previous pixel (if we assign the job to worker 1) or immediately above it (if we assign it to worker 2), so it's not quite like Bresenham's algorithm where diagonal movements are possible, but there are similarities.
With each new request that comes in, we have to assign that request to one of the workers, corresponding to drawing the next pixel rightwards or upwards from the previous one. I propose that a good way to pick the right direction is to pick the one that minimises an error function. The easiest thing to do is to take the slope of the line between (0, 0) and the point that would result from each of the 2 possible choices, and compare these slopes to the ideal slope (1-p)/p; then pick whichever one produces the lowest difference. This will cause the drawn pixels to "track" the ideal line as closely as possible.
To generalise this to more than 2 dimensions (workers), we can't use slope directly. If there are W workers, we need to come up with some function error(X, Y), where X and Y are both W-dimensional vectors, one representing the ideal direction (the ratios of requests to assign, analogous to the slope (1-p)/p earlier), the other representing the candidate point, and returning some number representing how different their directions are. Fortunately this is easy: we can take the cosine of the angle between two vectors by dividing their dot product by the product of their magnitudes, which is easy to calculate. This will be 1 if their directions are identical, and less than 1 otherwise, so when a new request arrives, all we need to do is perform this calculation for each of worker 1 <= i <= W and see which one's error(X, Y[i]) is closest to 1: that's the worker to give the request to.
[EDIT]
This procedure will also adapt to changes in the ideal direction. But as it stands, it tries (as hard as it can) to make the overall ratios of every request assigned so far track the ideal direction, so if the procedure has been running a long time, then even a small adjustment in the target direction could result in large "swings" to compensate. In that case, when calling error(X, Y[i]), it might be better to compute the second argument using the difference between the latest pixel (request assignment) and the pixel from some number k (e.g. k=100) steps ago. (In the original algorithm, we are implicitly subtracting the starting point (0, 0), i.e. k is as large as possible.) This only requires you to keep the last k chosen endpoints. Picking k too large will mean you can still get large swings, while picking k too small might mean that the "line" drifts well off-course, with some workers never picked at all, because each assignment alters the direction so drastically. You might need to experiment to find a good k.
To keep the assignments non-clustered, associate merits with each workers jobs inversely proportional to the intended share, e.g., 31 * 9 for w1, 60 * 9 for w2, and 31 * 60 for w3. Start mit no merits for each worker, next job goes to worker with least merits, and lesser ordinal in case of ties. Accumulate merits for jobs done. (On overflow from one accumulator, subtract MAXVALUE - 31 * 60 from each.)

Creating a cluster centroid prone to noise

I'm working on a clustering algorithm to group similar ranges of real numbers. After I group them, I have to create one range for that cluster, i.e., cluster centroid. For example, if one cluster contains values <1,6>, <0,7> and <0,6>, that means that this cluster is for all those with values <0,7>. The question is how to create such a resulting range. I was thinking to take the min and max value of all values in the cluster, but that would mean that the algorithm is very sensitive on noise. I should do it somehow weighted, but I'm not sure how. Any hints? Thanks.
Perhaps you can convert all ranges to their midpoints before running your clustering algorithm. That way you convert your problem into clustering points on a line. Previously, the centroid range could 'grow' and in the next iteration consume more ranges that perhaps should belong to another cluster.
midpoints = []
for range in ranges
midpoints[range] = range.min + (range.max - range.min) / 2
end
After the algorithm is finished you can do as you previously suggested and take the min and max values of all the ranges in the cluster to create the range for that centroid.

Choosing number of clusters in k means

I want to cluster a large sample of data and for it I am using k means function in MATLAB. The problem is that it returns a matrix with all the data sorted in the number of clusters I specify.
How can I know which number of clusters is optimal.
I thought that if I would get the equal number of elements in each cluster that would be optimal but this never happens. Rather it can go on clustering the data for any number I put.
Please help...
I read and I think an answer to this could be :- In kmeans we are trying to partition the data according to the means as the data comes so theoretically our best dataset would be where each partition has equal number of data.
I used kmeans++ which was a better algorithm than kmeans because it does not initialise a random value and then iterated over the number of partitions till the sizes of partitions were almost equal. This was an approximate figure as say for 3 i got 2180,729,1219 and for 4 i was getting 30,2422, 1556,120 so I chose 3 as my final answer............

Resources