K-Means clustering in OpenIMAJ library - hadoop

I'm not very experienced in machine learning and cluster analysis, but I have following problem:
I have ~100kk-1000kk pieces of data which I cannot load into memory all at once and I need to divide it to a number of classes (like 1-10k or even 100k classes) for further analisys. To do that I've choosed K-Means algorithm implemented in OpenIMAJ library (FloatKMeans class).
I understand that K-Means algorithm can be divided into 2 phases:
Learning phase - where I pass in all the data I have to create/fill the classes
Assignemnt phase - where I can ask the cluster to which class the given piece of data belongs to
I'm planning to build the cluster model using Hadoop reduce phase where I'll receive the data pieces one by one (that's why i cannot pass the data all at once to the algorithm)
My questions are:
Is OpenIMAJ implementation optimal for such 'bigdata' use case? Wont it take like forever to calculate it?
Is it possible to 'stream' the data into the algorithm during the hadoop reduce faze in order to learn the cluster?
Is it possible to save the learned cluster (model) as bytes in order to pass the model to the next hadoop job?
Is it ok to run the algorithm assignment phase during hadoop mapping?
Thanks for help

K-Means clustering is an iterative algorithm that makes multiple passes over the data. In each pass, points are assigned to cluster centroids and then after all points have been assigned, the cluster centroids are recomputed to be the mean of the assigned points. You can't "stream" data to the algorithm in the traditional sense as you'll need to come back to it during the subsequent iterations.
Regarding the OpenIMAJ FloatKMeans implementation: yes this can handle "big data" in the sense that it doesn't mind where it gets the data from - the DataSource instance that it takes as input can read data from disk if necessary. The only requirement is that you can hold all the centroids in memory during the runtime of the algorithm. The implementation is multi-threaded, so all cpu cores can be used during computation. There is example code here: https://github.com/openimaj/openimaj/blob/master/demos/examples/src/main/java/org/openimaj/examples/ml/clustering/kmeans/BigDataClusterExample.java.
The OpenIMAJ IOUtils.writeBinary(...) methods can be used to save the resultant cluster centroids in the FloatCentroidsResult object.
One of the biggest costs in K-Means is the computation of distances between each data point and each cluster centroid in order to find the closest. The cost of this is related to the dimensionality of the data and the number of centroids. If you've got a large number of centroids and high dimensional data, then using an approximate K-Means implementation can have big speed benefits at the cost of a slight loss in accuracy (see FloatKMeans.createKDTreeEnsemble() for example -- this uses an ensemble of KD-Trees to speed neighbour computations).
Regarding integration with Hadoop, it is possible to implement K-Means as a series of Map-Reduce tasks (each pair corresponds to an iteration of the algorithm). See this paper for a discussion: http://eprints.soton.ac.uk/344243/1/paper.pdf . If you want to go down this route, OpenIMAJ has a very rough implementation here, which you could build off: https://github.com/openimaj/openimaj/tree/master/hadoop/tools/HadoopFastKMeans. As mentioned in the linked paper, Apache Mahout also contains an implementation: https://mahout.apache.org. One problem with both of these implementations is that they required quite a lot of data to be transferred between the mappers and reducer (each mapper emits the current data point and its assigned cluster id). The extent of this could mean that it could be faster to use a non-Hadoop implementation of the algorithm, but this would depend on what processing resources you have available and the nature of the dataset. The problem of data-transfer between the map and reduce could probably also be reduced with a clever Hadoop Combiner and computes weighted centroids from subsets of the data and then passes these to the (modified) reducer to compute the actual centroids.

Related

MinMax algorithm implementation in map-reduce paradigm

I have some data in Hbase tables ( few billions). I have to process them to score the stored documents. What are the possible algorithms that can be implemented and applied in mapreduce paradigm.
I have tried to deploy MinMax algorithm but due to its requirement, all data is shifted to single node in reducer phase (to find min and max value). Due to that reason, GC overhead limit is read that was quite expected as single node could not have so much memory to process all data in single go.
Is there any other option available for hbase document ranking (scoring) in mapreduce paradigm ?

Does it make sense to solve Word Count on a Hadoop cluster?

Many tutorials on Hadoop MapReduce begin with the Word Count example. However, I remember from my distributed computing class (which was before Hadoop's birth) that computing in a distributed fashion results in a speed up only when the subtasks are of coarse granularity, which means that the time of computation exceeds the time of communication. In Word Count, the time complexity (if done with hash tables and assuming a constant limit on the word length) is linear. Hence it seems that paying the cost of transferring the input file to HDFS and of the subsequent Sort & Shuffling phase is not justified. Am I missing something?
Not clear what you are suggesting the alternative is, but WordCount is like printing Hello World in your favorite language.
It teaches you the basic concepts, it is not intended to be the prime example of how to use MapReduce, or really how to optimize a Hadoop cluster (storing line-delimted text for analysis isn't where Hadoop shines).

Improve h2o DRF runtime on a multi-node cluster

I am currently running h2o's DRF algorithm an a 3-node EC2 cluster (the h2o server spans across all 3 nodes).
My data set has 1m rows and 41 columns (40 predictors and 1 response).
I use the R bindings to control the cluster and the RF call is as follows
model=h2o.randomForest(x=x,
y=y,
ignore_const_cols=TRUE,
training_frame=train_data,
seed=1234,
mtries=7,
ntrees=2000,
max_depth=15,
min_rows=50,
stopping_rounds=3,
stopping_metric="MSE",
stopping_tolerance=2e-5)
For the 3-node cluster (c4.8xlarge, enhanced networking turned on), this takes about 240sec; the CPU utilization is between 10-20%; RAM utilization is between 20-30%; network transfer is between 10-50MByte/sec (in and out). 300 trees are built until early stopping kicks in.
On a single-node cluster, I can get the same results in about 80sec. So, instead of an expected 3-fold speed up, I get a 3-fold slow down for the 3-node cluster.
I did some research and found a few resources that were reporting the same issue (not as extreme as mine though). See, for instance:
https://groups.google.com/forum/#!topic/h2ostream/bnyhPyxftX8
Specifically, the author of http://datascience.la/benchmarking-random-forest-implementations/ notes that
While not the focus of this study, there are signs that running the
distributed random forests implementations (e.g. H2O) on multiple
nodes does not provide the speed benefit one would hope for (because
of the high cost of shipping the histograms at each split over the
network).
Also https://www.slideshare.net/0xdata/rf-brighttalk points at 2 different DRF implementations, where one has a larger network overhead.
I think that I am running into the same problems as described in the links above.
How can I improve h2o's DRF performance on a multi-node cluster?
Are there any settings that might improve runtime?
Any help highly appreciated!
If your Random Forest is slower on a multi-node H2O cluster, it just means that your dataset is not big enough to take advantage of distributed computing. There is an overhead to communicate between cluster nodes, so if you can train your model successfully on a single node, then using a single node will always be faster.
Multi-node is designed for when your data is too big to train on a single node. Only then, will it be worth using multiple nodes. Otherwise, you are just adding communication overhead for no reason and will see the type of slowdown that you observed.
If your data fits into memory on a single machine (and you can successfully train a model w/o running out of memory), the way to speed up your training is to switch to a machine with more cores. You can also play around with certain parameter values which affect training speed to see if you can get a speed-up, but that usually comes at a cost in model performance.
As Erin says, often adding more nodes just adds the capability for bigger data sets, not quicker learning. Random forest might be the worst; I get fairly good results with deep learning (e.g. 3x quicker with 4 nodes, 5-6x quicker with 8 nodes).
In your comment on Erin's answer you mention the real problem is you want to speed up hyper-parameter optimization? It is frustrating that h2o.grid() doesn't support building models in parallel, one on each node, when the data will fit in memory on each node. But you can do that yourself, with a bit of scripting: set up one h2o cluster on each node, do a grid search with a subset of hyper-parameters on each node, have them save the results and models to S3, then bring the results in and combine them at the end. (If doing a random grid search, you can run exactly the same grid on each cluster, but it might be a good idea to explicitly use a different seed on each.)

Where to add specific function in the Map/Reduce Framework

I have a general question to the MAP/Reduce Framework.
I have a task, which can be separated into several partitions. For each partition, I need to run a computation intensive algorithm.
Then, according to the MAP/Reduce Framework, it seems that I have two choices:
Run the algorithm in the Map stage, so that in the reduce stage, there is no work needed to be done, except collect the results of each partition from the Map stage and do summarization
In the Map stage, just divide and send the partitions (with data) to the reduce stage. In the reduce stage, run the algorithm first, and then collect and summarize the results from each partitions.
Correct me if I misunderstand.
I am a beginner. I may not understand the MAP/Reduce very well. I only have basic parallel computing concept.
You're actually really confused. In a broad and general sense, the map portion takes the task and divides it among some n many nodes or so. Those n nodes that receive a fraction of the whole task do something with their piece. When finished computing some steps on their data, the reduce operation reassembles the data.
The REAL power of map-reduce is how scalable it is.
Given a dataset D running on a map-reduce cluster m with n nodes under it, each node is mapped 1/D pieces of the task. Then the cluster m with n nodes reduces those pieces into a single element. Now, take a node q to be a cluster n with p nodes under it. If m assigns q 1/D, q can map 1/D to (1/D)/p with respect to n. Then n's nodes can reduce the data back to q where q can supply its data to its neighbors for m.
Make sense?
In MapReduce, you have a Mapper and a Reducer. You also have a Partitioner and a Combiner.
Hadoop is a distributed file system that partitions(or splits, you might say) the file into blocks of BLOCK SIZE. These partitioned blocks are places on different nodes. So, when a job is submitted to the MapReduce Framework, it divides that job such that there is a Mapper for every input split(for now lets say it is the partitioned block). Since, these blocks are distributed onto different nodes, these Mappers also run on different nodes.
In the Map stage,
The file is divided into records by the RecordReader, the definition of record is controlled by InputFormat that we choose. Every record is a key-value pair.
The map() of our Mapper is run for every such record. The output of this step is again in key-value pairs
The output of our Mapper is partitioned using the Partitioner that we provide, or the default HashPartitioner. Here in this step, by partitioning, I mean deciding which key and its corresponding values go to which Reducer(if there is only one Reducer, its of no use anyway)
Optionally, you can also combine/minimize the output that is being sent to the reducer. You can use a Combiner to do that. Note that, the framework does not guarantee the number of times a Combiner will be called. It is only part of optimization.
This is where your algorithm on the data is usually written. Since these tasks run in parallel, it makes a good candidate for computation intensive tasks.
After all the Mappers complete running on all nodes, the intermediate data i.e the data at end of Map stage is copied to their corresponding reducer.
In the Reduce stage, the reduce() of our Reducer is run on each record of data from the Mappers. Here the record comprises of a key and its corresponding values, not necessarily just one value. This is where you generally run your summarization/aggregation logic.
When you write your MapReduce job you usually think about what can be done on each record of data in both the Mapper and Reducer. A MapReduce program can just contain a Mapper with map() implemented and a Reducer with reduce() implemented. This way you can focus more on what you want to do with the data and not bother about parallelizing. You don't have to worry about how the job is split, the framework does that for you. However, you will have to learn about it sooner or later.
I would suggest you to go through Apache's MapReduce tutorial or Yahoo's Hadoop tutorial for a good overview. I personally like yahoo's explanation of Hadoop but Apache's details are good and their explanation using word count program is very nice and intuitive.
Also, for
I have a task, which can be separated into several partitions. For
each partition, I need to run a computing intensive algorithm.
Hadoop distributed file system has data split onto multiple nodes and map reduce framework assigns a task to every every split. So, in hadoop, the process goes and executes where the data resides. You cannot define the number of map tasks to run, data does. You can however, specify/control the number of reduce tasks.
I hope I have comprehensively answered your question.

Knn search for large data?

I'm interested in performing knn search on large dataset.
There are some libs: ANN and FLANN, but I'm interested in the question: how to organize the search if you have a database that does not fit entirely into memory(RAM)?
I suppose it depends on how much bigger your index is in comparison to the memory. Here are my first spontaneous ideas:
Supposing it was tens of times the size of the RAM, I would try to cluster my data using, for instance, hierarchical clustering trees (implemented in FLANN). I would modify the implementation of the trees so that they keep the branches in memory and save the leaves (the clusters) on the disk. Therefore, the appropriate cluster would have to be loaded each time. You could then try to optimize this in different ways.
If it was not that bigger (let's say twice the size of the RAM), I would separate the dataset in two parts and create one index for each. I would therefore need to find the nearest neighbor in each dataset and then choose between them.
It depends if your data is very high-dimensional or not. If it is relatively low-dimensional, you can use an existing on-disk R-Tree implementation, such as Spatialite.
If it is a higher dimensional data, you can use X-Trees, but I don't know of any on-disk implementations off the top of my head.
Alternatively, you can implement locality sensitive hashing with on disk persistence, for example using mmap.

Resources