Elbow method for determining number of clusters on Mahout - hadoop

I'm using Mahout for clustering and I have implemented elbow method for determining number of clusters, so that I wouldn't have to specify it.
I have tried this on one machine, but now I'm having doubts when it comes to cluster of computers.
I have planned to use Oozie to simulate looping (running clustering algorithm each time incrementing the number of clusters by one). I read that Oozie is used for DAGs, but I see no other way of doing this.
The question is, does this look like a sound approach? If not, any alternatives?

Related

The design of Clustering using MapReduce

I have got a similarity matrix like this: ItemA, ItemB, Similarity.
I wanted it to cluster the dataset using algorithm such as Kmeans by using MapReduce. But I don't know how many MapReduces I should use and how to design them.
You cannot use k-means with a similarity matrix. End of story: k-means needs the similarity to the means, not between instances. But there are alternative algorithms. Unfortunately, PAM for example scales so badly, it does not pay off to run it on a cluster either.
Other than that, just experiment. Choose as many reduces as you have cores, for example; and choose as many mappers as your cluster can sustain (unless your data is too tiny - there should be several MB per mapper to make the startup cost pay off)
But I don't think you are ready for that question yet. First figure out what you want to do, then how to set parameters that may or may not arise at all..

Depth First Search using Map Reduce

I have successfully implemented the Shortest Path algorithm in Hadoop Map Reduce(Breath First Search). However I have a question that:
Is it possible to do graph traversal "Depth First Search" using Hadoop map reduce ?
Any Links..?
The nature of the Depth First Search makes it inappropriate for map reduce jobs. Because you only follow one strict path to the end before forking into another one. That lead to the fact that you can't use the scalability provided by hadoop properly. I'm not aware of a fine working implementation and I'm pretty sure you won't find one which uses the MapReduce paradigm in a good way.
If you try to implement graph algorithms in hadoop on your own you might want to have a look at some useful frameworks like Apache Giraph, xrime or Pegasus. xrime also contains a shortest path implementation which might be interesting for you.

Hadoop for password cracking

Hi I came across this article and it made me wonder, how easy it would be for a hacker to crack passwords. What do you think guys???
If you want to try out several permutations in a brute force manner, I don't think that using hadoop would give you any benefit. Hadoop is not something that fits into all uses cases and would not every time perform well.
Computing permutations can be done in batch.. just set different start and end params for each machine. The overhead involved in setting a job, movement of data across nodes, job cleanup can surely be saved. I have seen that running different processes over 5 nodes be pre-dividing the load equally performed pretty well as compared to map-reduce. Offcourse, I dont mean that map-reduce is bad.. its just that the scenario wasnt right fit for getting the job done.
I found this Recursive Algorithm on Distributed Systems an interesting way to run recursive algorithms on distributed system. Now a permutation and combination algorithms can be used to do some interesting stuff

Can I use hadoop to train a neutral network?

I want to train a neural network with the help of Hadoop. We know when training a neural network, weights to each neuron are altered every iteration, and each iteration depends on the previous. I'm new to Hadoop and don't quite familiar with features it provides. Can I chaining the iteration with the help of method addDependingJob() emphasizing the dependences? Or there's other tricks can be used to implement the NN with help of Hadoop.
Any advice will be highly appreciated.
Thanks and Best Regards.
You can write it by youself. If you know how to write Bacvk propogation in single core from scratch. It can be easily migrated to Mapreduce approach. HDFS cache should store current neuron weights and each map job should evaluate their update upon training instance and then reduce should sum all this updates and put them to cachce.

Which data clustering algorithm is appropriate to detect an unknown number of clusters in a time series of events?

Here's my scenario. Consider a set of events that happen at various places and times - as an example, consider someone high above recording the lightning strikes in a city during a storm. For my purpose, lightnings are instantaneous and can only hit certain locations (such as high buildings). Also imagine each lightning strike has a unique id so one can reference the strike later. There are about 100,000 such locations in this city (as you guess, this is an analogy as my current employer is sensitive about the actual problem).
For phase 1, my input is the set of (strike id, strike time, strike location) tuples. The desired output is the set of the clusters of more than 1 event that hit the same location within a short time. The number of clusters is not known in advance (so k-means is not that useful here). What is being considered as 'short' could be predefined for a given clustering attempt. That is, I can set it to, say, 3 minutes, than run the algorithm; later try with 4 minutes or 10 minutes. Perhaps a nice touch would be for the algorithm to determine a 'strength' of clustering and recommend that for a given input, the most compact clustering is achieved by using a particular value for 'short', but this is not required initially.
For phase 2, I'd like to take into consideration the amplitude of the strike (i.e., a real number) and look for clusters that are both within a short time and with similar amplitudes.
I googled and checked the answers here about data clustering. The information is a bit bewildering (below is the list of links I found useful). AFAIK, k-means and related algorithms would not be useful because they require the number of clusters to be specified apriori. I'm not asking for someone to solve my problem (I like solving it), but some orientation in the large world of data clustering algorithms would be useful in order to save some time. Specifically, what clustering algorithms are appropriate for when the number of clusters is unknown.
Edit: I realized the location is irrelevant, in the sense that although events happen all the time, I only need to cluster them per location. So each location has its own time-series of events that can thus be analyzed independently.
Some technical details:
- as the dataset is not that large, it can fit all in memory.
- parallel processing is a nice to have, but not essential. I only have a 4-core machine and MapReduce and Hadoop would be too much.
- the language I'm mostly familiar with is Java. I haven't yet used R and the learning curve for it would probably be too much for what time I was given. I'll have a look at it anyway in my spare time.
- for the time being, using tools to run the analysis is ok, I don't have to produce just code. I'm mentioning this because probably Weka will be suggested.
- visualization would be useful. As the dataset is large enough so it doesn't fit in memory, the visualization should at least support zooming and panning. And to clarify: I don't need to build a visualization GUI, it's just a nice capability to use for checking the results produced with a tool.
Thank you. Questions that I found useful are: How to find center of clusters of numbers? statistics problem?, Clustering Algorithm for Paper Boys, Java Clustering Library, How to cluster objects (without coordinates), Algorithm for detecting "clusters" of dots
I would suggest you to look into Mean Shift Clustering. The basic idea behind mean shift clustering is to take the data and perform a kernel density estimation, then find the modes in the density estimate, the regions of convergence of data points towards modes defines the clusters.
The nice thing about mean shift clustering is that the number of clusters do not have to be specified ahead of time.
I have not used Weka, so I am not sure if it has mean shift clustering. However if you are using MATLAB, here is a toolbox (KDE toolbox) to do it. Hope that helps.
Couldn't you just use hierarchical clustering with the difference in times of strikes as part of the distance metric?
It is too late, but still I would add it:
In R, there is a package fpc and it has a method pamk() which provides you the clusters. Using pamk(), you do not need to mention the number of clusters intially. It calculates itself the number of clusters in the input data.

Resources