I have hit a real problem. I need to do some Kmeans clustering for 5 million vectors, each containing about 32 cols.
I tried out Mahout which requires linux and I am on windows, I am restrained from using a Linux OS and any sort of simulator.
Can anyone suggest a KMeans clustering algorithm that is scalable upto 5M vectors and can converge quickly?
I have tested a few but they wont scale. Which means they are slow and take forever to complete.
Thanks
OK, So who ever wants clustering for large scale datasets, the only way of doing so is to use Mahout. IT requires a linux platform. So I had to use virtual box, placed Ubuntu on it and then used Mahout. Its a lengthy procedure to set up Mahout, but the two links that I used are as follows.
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)
Related
I am currently running h2o's DRF algorithm an a 3-node EC2 cluster (the h2o server spans across all 3 nodes).
My data set has 1m rows and 41 columns (40 predictors and 1 response).
I use the R bindings to control the cluster and the RF call is as follows
model=h2o.randomForest(x=x,
y=y,
ignore_const_cols=TRUE,
training_frame=train_data,
seed=1234,
mtries=7,
ntrees=2000,
max_depth=15,
min_rows=50,
stopping_rounds=3,
stopping_metric="MSE",
stopping_tolerance=2e-5)
For the 3-node cluster (c4.8xlarge, enhanced networking turned on), this takes about 240sec; the CPU utilization is between 10-20%; RAM utilization is between 20-30%; network transfer is between 10-50MByte/sec (in and out). 300 trees are built until early stopping kicks in.
On a single-node cluster, I can get the same results in about 80sec. So, instead of an expected 3-fold speed up, I get a 3-fold slow down for the 3-node cluster.
I did some research and found a few resources that were reporting the same issue (not as extreme as mine though). See, for instance:
https://groups.google.com/forum/#!topic/h2ostream/bnyhPyxftX8
Specifically, the author of http://datascience.la/benchmarking-random-forest-implementations/ notes that
While not the focus of this study, there are signs that running the
distributed random forests implementations (e.g. H2O) on multiple
nodes does not provide the speed benefit one would hope for (because
of the high cost of shipping the histograms at each split over the
network).
Also https://www.slideshare.net/0xdata/rf-brighttalk points at 2 different DRF implementations, where one has a larger network overhead.
I think that I am running into the same problems as described in the links above.
How can I improve h2o's DRF performance on a multi-node cluster?
Are there any settings that might improve runtime?
Any help highly appreciated!
If your Random Forest is slower on a multi-node H2O cluster, it just means that your dataset is not big enough to take advantage of distributed computing. There is an overhead to communicate between cluster nodes, so if you can train your model successfully on a single node, then using a single node will always be faster.
Multi-node is designed for when your data is too big to train on a single node. Only then, will it be worth using multiple nodes. Otherwise, you are just adding communication overhead for no reason and will see the type of slowdown that you observed.
If your data fits into memory on a single machine (and you can successfully train a model w/o running out of memory), the way to speed up your training is to switch to a machine with more cores. You can also play around with certain parameter values which affect training speed to see if you can get a speed-up, but that usually comes at a cost in model performance.
As Erin says, often adding more nodes just adds the capability for bigger data sets, not quicker learning. Random forest might be the worst; I get fairly good results with deep learning (e.g. 3x quicker with 4 nodes, 5-6x quicker with 8 nodes).
In your comment on Erin's answer you mention the real problem is you want to speed up hyper-parameter optimization? It is frustrating that h2o.grid() doesn't support building models in parallel, one on each node, when the data will fit in memory on each node. But you can do that yourself, with a bit of scripting: set up one h2o cluster on each node, do a grid search with a subset of hyper-parameters on each node, have them save the results and models to S3, then bring the results in and combine them at the end. (If doing a random grid search, you can run exactly the same grid on each cluster, but it might be a good idea to explicitly use a different seed on each.)
I've tried to train --oaa vowpal wabbit classifier on 10M+ train data and found that it uses only one core. Is any ways to make it use all 12 cores?
VW uses two threads: one for loading&parsing the input data and one for the machine learning.
VW comes with a spanning_tree tool for parallel execution (AllReduce) of several VW instances on a cluster (e.g. Hadoop) or on a single machine (--span_server localhost).
That said, I think 12 cores are not enough for AllReduce to pay off. For the best results, you need to do hyper-parameter search anyway, so you can do it in parallel using the 12 cores.
Is Hadoop a proper solution for jobs that are CPU intensive and need to process a small file of around 500 MB? I have read that Hadoop is aimed to process the so called Big Data, and I wonder how it performs with a small amount of data (but a CPU intensive workload).
I would mainly like to know if a better approach for this scenario exists or instead I should stick to Hadoop.
Hadoop is a distributed computing framework proposing a MapReduce engine. If you can express your parallelizable cpu intensive application with this paradigm (or any other supported by Hadoop modules), you may take advantage of Hadoop.
A classical example of Hadoop computations is the calculation of Pi, which doesn't need any input data. As you'll see here, yahoo managed to determine the two quadrillonth digit of pi thanks to Hadoop.
However, Hadoop is indeed specialized for Big Data in the sense that it was developped for this aim. For instance, you dispose of a file system designed to contain huge files. These huge files are chunked into a lot of blocks accross a large number of nodes. In order to ensure your data integrity, each block has to be replicated to other nodes.
To conclude, I'd say that if you already dispose of an Hadoop cluster, you may want to take advantage of it.
If that's not the case, and while I can't recommand anything since I have no idea what exactly is your need, I think you can find more light weights frameworks than Hadoop.
Well a lot of companies are moving to Spark, and I personally believe it's the future of parallel processing.
It sounds like what you want to do is use many CPUs possibly on many nodes. For this you should use a Scalable Language especially designed for this problem - in other words Scala. Using Scala with Spark is much much easier and much much faster than hadoop.
If you don't have access to a cluster, it can be an idea to use Spark anyway so that you can use it in future more easily. Or just use .par in Scala and that will paralellalize your code and use all the CPUs on your local machine.
Finally Hadoop is indeed intended for Big Data, whereas Spark is really just a very general MPP framework.
You have exactly the type of computing issue that we do for Data Normalization. This is a need for parallel processing on cheap hardware and software with ease of use instead of going through all the special programming for traditional parallel processing. Hadoop was born of hugely distributed data replication with relatively simple computations. Indeed, the test application still being distributed, WordCount, is numbingly simplistic. This is because the genesis of Hadoop was do handle the tremendous amount of data and concurrent processing for search, with the "Big Data" analytics movement added on afterwards to try to find a more general purpose business use case. Thus, Hadoop as described in its common form is not targeted to the use case you and we have. But, Hadoop does offer the key capabilities of cheap, easy, fast parallel processing of "Small Data" with custom and complicated programming logic.
In fact, we have tuned Hadoop to do just this. We have a special built hardware environment, PSIKLOPS, that is powerful for small cluster (1-10) nodes with enough power at low cost for run 4-20 parallel jobs. We will be showcasing this in a series of web casts by Inside Analysis titled Tech Lab in conjunction with Cloudera for the first series, coming in early Aug 2014. We see this capability as being a key enabler for people like you. PSIKLOPS is not required to use Hadoop in the manner we will showcase, but it is being configured to maximize ease of use to launch multiple concurrent containers of custom Java.
I want to run kmeans clustering on a Hadoop pseudo-distributed mode. I have 5 million of vectors in a .mat file, with 38 numeric features for each vector, like this:
0 0 1 0 0 0 0 0 0 0 0 0 ...
I've run the examples that I've found, like Reuters (mhttps://mahout.apache.org/users/clustering/k-means-clustering.html) or synthetic data. I know i have to convert this vectors to SequenceFile, but I don't know if I have to do something more before.
I'm using Mahout 0.7 and Hadoop 1.2.1.
Yes, you need a small preprocessing step.
Since the MAT file generated is a Binary File, converting it into a text file (.txt) with each line begin a vector with 38 feature values would be the first step.
Then, using SeqDirectory (or writing your own SequenceFileWriter to get it done) would be next step and all the other steps follow as in the Reuters example.
Example for your own Sequence File Writer would be How to convert .txt file to Hadoop's sequence file format
I tried the same for Mahout LDA where I wrote my own Sequence File Writer and gave it as an input to the next step in LDA process namely seq2sparse.
Never use pseudo-distributed mode
Mahout only pays off if you have data that is way too large to be analyzed on a single computer, but where you really need at least a dozen computers to hold and process the data.
The reason is the architecture. Mahout is built on top of map-reduce and relies on writing plenty of iterim data to disk, to be able to recover from crashes.
In pseudo-distributed mode, it cannot recover from such crashes well anyway.
Pseudo-distributed mode is okay if you want to learn installing and configuring Mahout, without having access to a real cluster. It is not reasonable to use for analyzing real data.
Instead, use the functionality built-in into Matlab, or use a clustering tool designed for single nodes such as ELKI. It will usually outperform Mahout by an order of magnitude by not writing everything to disk a number of times. In my experiments, these tools were able to outperform a 10 core Mahout cluster by a factor of 10 on a single core. Because I/O cost completely dominates runtime.
Benchmark yourself
If you don't trust me on this, benchmark yourself. Load the reuters data into Matlab, and cluster it there. I'm pretty sure, Matlab will make Mahout look like an old fad.
I have a dataset for which I need to process PCA (Principal Component Analysis, A dimentionality reduction program) which is easy to proceed using Weka.
And Since the dataset is large in size, Weka shows memory issues, which can be resolved if I link Weka with Hadoop. To run the algorithm using weka in a server. Could anyone help me regarding the same. How can I connect Weka with Hadoop to deal with larger dataset? Please help!
Thankyou..
Weka 3.7 has new packages for distributed processing in Hadoop. One of the jobs provided by these packages will compute a correlation (or covariance) matrix in Hadoop. The user can optionally have the job use the correlation matrix as input to a PCA analysis (this part runs outside of Hadoop) and produce a "trained" Weka PCA filter. This scales Weka's PCA analysis in the number of instances (but not in the number of original features since the PCA computation still happens locally on the client machine).
For more info on the Hadoop packages see:
http://markahall.blogspot.co.nz/2013/10/weka-and-hadoop-part-1.html
The distributedWekaHadoop package can be installed via the package manager in Weka 3.7.
Cheers,
Mark.
Depending on the algorithm, it may be quite complex to rewrite it to use Hadoop.
You can use Apache Mahout instead. It does have support for PCA.