How to do the final steps of the k-means algorithm - algorithm

I’m having a problem with my program. I am trying to implement k-means (manually) by clustering a set of RGB values to 3 clusters. I don’t need help with coding just with understanding. So far I have done this:
Created 3 cluster objects, each with a mean and an array to hold that clusters members.
Imported the text file and saved the RGB values in an array.
Looped the array and for each RGB value stored, calculated the mean with the closest distance using Euclidian distance.
Added each RGB value to the array of the cluster with the closest mean.
I have done research and I can’t seem to understand the next step. Research has suggested:
In each cluster, add all of the RGB values together divide them by the number of values in that cluster, then update the mean with that value or;
Find the average distance between all RGB values and the mean, then update the mean with that value or;
Update the mean each time you add a RGB to the cluster…..
I just can’t seem to understand the last steps, thanks.

Once you have a per-cluster array, recompute the "cluster center" as the average of the items in that cluster. Then, reassign every item to the appropriate cluster (the one whose "center" it's closest to after the recomputation). You're done when no item changes cluster (it's theoretically possible to end up in a situation where one item keeps flipping between two clusters, with generalized distance measures -- this can be detected to stop the loop anyway -- but I don't think it can happen with Euclidean distances).
IOW, that would be the first one of your three alternatives. I'm not even sure what you mean by the second alternative; the third one would likely not be stable, and depend on the arbitrary order of items, so I feel strongly against it.

Related

Algorithm to find outliers of DateTimes outside of a 2 day window

I have a set of arbitrary DateTime values that are input by users. The requirement is that the values are to be within a certain window, e.g. no more than 2 days apart from each other. There is no reference value to work from.
An unknown, but small percentage (say < 5%) of them will be outside the 2 day window because of user error. At some point, the values are aggregated and processed, at which point the requirement is checked. Validation at input time is not practical. How do I determine the largest set of values that fulfill the requirement, so that I can report back the other, incorrect values that don't fulfill the requirement?
I know about determining Interquartile Range. Can I somehow modify that algorithm to include the boundary condition? Or do I need a different algorithm?
One good "quick strike" solution from Machine Learning is the support vector machine (SVM). A 1-class method will be relatively fast, and will identify clustered values vs outliers with very good accuracy for this application. Otherwise ...
You do not want the mean of the dates: a single error could skew the mean right out of the rest of the distribution, such as today's date being 20 Aug 2109.
The median is likely a good starting guess for this distribution. Sort the values, grab the median, and then examine the distribution on either side. At roughly 24 hours each way, there should be a sudden difference in values. Those difference points will absolutely identify your proper boundaries.
In most data sets, you'll be able to find that point easily: look at the differences between adjacent values in your sorted list of dates.
Very simply:
Sort the list of dates
Make a new list, shifting one element left (i.e. delete the first element)
Subtract the two lists.
Move through the difference list; there will be a large cluster of small values in the middle, bounded by a pair of large jumps. The pair of large jumps will be 48 hours apart. Those points are you boundaries.

Iterate twice through values in Reducer Hadoop

I read in couple of places that the only way to iterate twice through values in a Reducer is to cache that values.
But also, there is a limitation that all the values must fit in main memory in that case.
What if you need to iterate twice, but you don't have the luxury of caching the values in memory?
Is there some kind of workaround?
Maybe there are some answers about this problem, but I'm new to Hadoop so I'm hopping that some solution was found since the time that questions were asked.
To be more concrete with my question, here is what I need to do:
Reducer gets a certain number of points (per example - points in 3D space with x,y,z coordinates)
One random point between them should be selected - let's call it firstPoint
Reducer should then find point that is farthest from firstPoint, to do that it needs to iterate through all the values - this way we get secondPoint
After that, reducer should find point farthest from secondPoint, so there's a need to iterate through the dataset again - this way we get thirdPoint
Distance from thirdPoint to all other points needs to be calculated
Distances from secondPoint to all other points and distances from thirdPoint to all other points need to be saved, so additional steps could be performed.
It's not a problem to buffer this distances, since each distance is a double, while a point could actually be a point in n-dimensional space so each point could have n coordinates, so it could take up too much space.
My original question was how can I iterate twice, but my question is more general, how can you iterate multiple times through values, to perform the steps above?
It might not work for every case, but you could try running more reducers so that each one processes a small enough amount of data that you could then cache the values into memory.

Question about Backpropagation Algorithm with Artificial Neural Networks -- Order of updating

Hey everyone, I've been trying to get an ANN I coded to work with the backpropagation algorithm. I have read several papers on them, but I'm noticing a few discrepancies.
Here seems to be the super general format of the algorithm:
Give input
Get output
Calculate error
Calculate change in weights
Repeat steps 3 and 4 until we reach the input level
But here's the problem: The weights need to be updated at some point, obviously. However, because we're back propagating, we need to use the weights of previous layers (ones closer to the output layer, I mean) when calculating the error for layers closer to the input layer. But we already calculated the weight changes for the layers closer to the output layer! So, when we use these weights to calculate the error for layers closer to the input, do we use their old values, or their "updated values"?
In other words, if we were to put the the step of updating the weights in my super general algorithm, would it be:
(Updating the weights immediately)
Give input
Get output
Calculate error
Calculate change in weights
Update these weights
Repeat steps 3,4,5 until we reach the input level
OR
(Using the "old" values of the weights)
Give input
Get output
Calculate error
Calculate change in weights
Store these changes in a matrix, but don't change these weights yet
Repeat steps 3,4,5 until we reach the input level
Update the weights all at once using our stored values
In this paper I read, in both abstract examples (the ones based on figures 3.3 and 3.4), they say to use the old values, not to immediately update the values. However, in their "worked example 3.1", they use the new values (even though what they say they're using are the old values) for calculating the error of the hidden layer.
Also, in my book "Introduction to Machine Learning by Ethem Alpaydin", though there is a lot of abstract stuff I don't yet understand, he says "Note that the change in the first-layer weight delta-w_hj, makes use of the second layer weight v_h. Therefore, we should calculate the changes in both layers and update the first-layer weights, making use of the old value of the second-layer weights, then update the second-layer weights."
To be honest, it really seems like they just made a mistake and all the weights are updated simultaneously at the end, but I want to be sure. My ANN is giving me strange results, and I want to be positive that this isn't the cause.
Anyone know?
Thanks!
As far as I know, you should update weights immediately. The purpose of back-propagation is to find weights that minimize the error of the ANN, and it does so by doing a gradient descent. I think the algorithm description in the Wikipedia page is quite good. You may also double-check its implementation in the joone engine.
You are usually backpropagating deltas not errors. These deltas are calculated from the errors, but they do not mean the same thing. Once you have the deltas for layer n (counting from input to output) you use these deltas and the weigths from the layer n to calculate the deltas for layer n-1 (one closer to input). The deltas only have a meaning for the old state of the network, not for the new state, so you should always use the old weights for propagating the deltas back to the input.
Deltas mean in a sense how much each part of the NN has contributed to the error before, not how much it will contribute to the error in the next step (because you do not know the actual error yet).
As with most machine-learning techniques it will probably still work, if you use the updated, weights, but it might converge slower.
If you simply train it on a single input-output pair my intuition would be to update weights immediately, because the gradient is not constant. But I don't think your book mentions only a single input-output pair. Usually you come up with an ANN because you have many input-output samples from a function you would like to model with the ANN. Thus your loops should repeat from step 1 instead of from step 3.
If we label your two methods as new->online and old->offline, then we have two algorithms.
The online algorithm is good when you don't know how many sample input-output relations you are going to see, and you don't mind some randomness in they way the weights update.
The offline algorithm is good if you want to fit a particular set of data optimally. To avoid overfitting the samples in your data set, you can split it into a training set and a test set. You use the training set to update the weights, and the test set to measure how good a fit you have. When the error on the test set begins to increase, you are done.
Which algorithm is best depends on the purpose of using an ANN. Since you talk about training until you "reach input level", I assume you train until output is exactly as the target value in the data set. In this case the offline algorithm is what you need. If you were building a backgammon playing program, the online algorithm would be a better because you have an unlimited data set.
In this book, the author talks about how the whole point of the backpropagation algorithm is that it allows you to efficiently compute all the weights in one go. In other words, using the "old values" is efficient. Using the new values is more computationally expensive, and so that's why people use the "old values" to update the weights.

Find areas in matrix..?

lets say I have a very big matrix with 10000x10000 elements all having the value '0'. Lets say there are some big 'nests' of '1's. Those areas might even be connected, but very weekly connected by a 'pipe' of '1's.
I want to get an algorithm that very quickly (and dirty if necessary) finds these 'nests' of '1's. Here it shouldn't 'cut apart' two weekly connected 'nests'.
Any idea how I should do such an algorithm?
Maybe a pathfinding algorithm like A* (or something simpler like a BFS or DFS) may work in this case..
You can:
search starting point for your searches by finding small nests (ignoring pipes).. so at least a 3x3 block of 1's
then you should pathfind from there going through 1's until you end your "connected component" (poetic license) inside the matrix
repeat starting from another small 1's block
I would say it depends on how the data is needed. If, given two points, you need to check if they are in the same block of 1's, I think #Jack's answer is best. This is also true if you have some knowledge of where blocks are initially, as you can use those as starting points for your algorithm.
If you don't have any other information, maybe one of these would be a possibility:
If given a point, you wish to find all elements in the same block, a flood fill would be appropriate. Then you could cache each nest as you find it, and when you get another point first see if it's in a known nest, and if it isn't do a flood fill to find this nest then add it to the cache.
As an implementation detail, as you traverse the matrix each row should have available the set of nests present on the previous row. Then you would only need to check new points against those nests, rather than the complete set, to determine if a new point is in a known set or not.
Be sure that you use a set implementation with a very low lookup cost such as a hashtable or possibly a Bloom filter if you can deal with the probabilistic effects.
Turn the matrix into a black&white bitmap
Scale the matrix so that nests of size N become a single pixel (so if you look for 10x10 nests, scale by a factor of N=10).
Use the remaining pixels of the output to locate the nests. Use the center coordinate (multiplied by the factor above) to locate the same nest in the matrix.
Use a low-pass filter to get rid of all "pipes" that connect the nests.
Find the border of the nest with a contrast filter on the bitmap.
Create a bitmap which doesn't contain the nests (i.e. set all pixels of the nests to 0).
Use a filter that widens single pixels to grow the outline of the nests.
Bitwise AND the output of 7 and 5 to get the connection points of all pipes.
Follow the pipes to see how they connect the nests

Averaging a set of points on a Google Map into a smaller set

I'm displaying a small Google map on a web page using the Google Maps Static API.
I have a set of 15 co-ordinates, which I'd like to represent as points on the map.
Due to the map being fairly small (184 x 90 pixels) and the upper limit of 2000 characters on a Google Maps URL, I can't represent every point on the map.
So instead I'd like to generate a small list of co-ordinates that represents an average of the big list.
So instead of having 15 sets, I'd end up with 5 sets, who's positions approximate the positions of the 15. Say there are 3 points that are in closer proximity to each-other than to any other point on the map, those points will be collapsed into 1 point.
So I guess I'm looking for an algorithm that can do this.
Not asking anyone to spell out every step, but perhaps point me in the direction of a mathematical principle or general-purpose function for this kind of thing?
I'm sure a similar function is used in, say, graphics software, when pixellating an image.
(If I solve this I'll be sure to post my results.)
I recommend K-means clustering when you need to cluster N objects into a known number K < N of clusters, which seems to be your case. Note that one cluster may end up with a single outlier point and another with say 5 points very close to each other: that's OK, it will look closer to your original set than if you forced exactly 3 points into every cluster!-)
If you are searching for such functions/classes, have a look at MarkerClusterer and MarkerManager utility classes. MarkerClusterer closely matches the described functionality, as seen in this demo.
In general I think the area you need to search around in is "Vector Quantization". I've got an old book title Vector Quantization and Signal Compression by Allen Gersho and Robert M. Gray which provides a bunch of examples.
From memory, the Lloyd Iteration was a good algorithm for this sort of thing. It can take the input set and reduce it to a fixed sized set of points. Basically, uniformly or randomly distribute your points around the space. Map each of your inputs to the nearest quantized point. Then compute the error (e.g. sum of distances or Root-Mean-Squared). Then, for each output point, set it to the center of the set that maps to it. This will move the point and possibly even change the set that maps to it. Perform this iteratively until no changes are detected from one iteration to the next.
Hope this helps.

Resources