First I would like to thank the H2o team for a great product and rapid development / iteration.
I was testing h2o autoML on a 4 machine cluster. (40 cores, 256 gigs of ram, gigabite bandwidth)
For a 20MB dataset I am noticing that the cluster is using up a lot of network and hardly touching the CPU. I was wondering if it makes sense for h2o to train 1 model per computer instead of trying to train every model on the entire cluster.
AutoML is training H2O models in a sequence, so this advice applies to H2O models in general, not just AutoML -- if your dataset is small enough, adding machines to your H2O cluster will only slow down the training process.
For a 20MB dataset I am noticing that the cluster is using up a lot of network and hardly touching the CPU.
If you have a 20MB dataset, it's always going to be better to run H2O on a single machine. The overhead of using multiple machines is only worth it when your training frame won't fit into RAM on a single machine.
There is a longer explanation in another Stack Overflow answer I wrote here.
I was wondering if it makes sense for h2o to train 1 model per computer instead of trying to train every model on the entire cluster.
It does make sense for small data, but H2O was designed to scale to big data (with millions or hundreds of millions of rows), so training several models in parallel is not the design pattern that was used. To speed up the training process, you can use a single machine with more cores.
Related
I set up a cluster with a 4 core (2GHz) and a 16 core (1.8GHz) virtual machine. The creation and connection to the cluster works without problems. But now I want to do some deep learning on the cluster, where I see an uneven distribution for the performance usage of those two virtual machines. The one with 4 cores is always at 100% CPU usage while the 16 core machine is idle most of the time.
Do I have to make additional configuration during the cluster generation? Because it is odd for me that the stronger machine of the two is idle while the weaker one does all the work.
Best regards,
Markus
Two things to keep in mind here.
Your data needs to be large enough to take advantage of data parallelism. In particular, the number of chunks per column needs to be large enough for all the cores to have work to do. See this answer for more details: H2O not working on parallel
H2O-3 assumes your nodes are symmetric. It doesn't try to load balance work across the cluster based on capability of the nodes. Faster nodes will finish their work first and wait idle for the slower nodes to catch up. (You can see this same effect if you have two symmetric nodes but one of them is busy running another process.)
Asymmetry is a bigger problem for memory (where smaller nodes can run out of memory and fail entirely) than it is for CPU (where some nodes are just waiting around). So always make sure to start each H2O node with the same value of -Xmx.
You can limit the number of cores H2O uses with the -nthreads option. So you can try giving each of your two nodes -nthreads 4 and see if they behave more symmetrically with each using roughly four cores. In the case you describe, that would mean the smaller machine is roughly 100% utilized and the larger machine is roughly 25% utilized. (But since the two machines probably have different chips, the cores are probably not identical and won't balance perfectly, which is OK.)
[I'm ignoring the virtualization aspect completely, but CPU shares could also come into the picture depending on the configuration of your hypervisor.]
I am planning to spin my development cluster for trend analysis for Infrastructure Monitoring application which I am planning to build using Spark for analysing failure trend and Cassandra for storing incoming data and analysed data.
Consider collecting performance matrix from around 25000 machines/servers (probably set of same application on different servers). I am expecting performance matrix of size 2MB/sec from each machine, which I am planning to push into Cassandra table having timestamp, server as primary key and application along with some important matrix as clustering key. I will be running Spark job on top of this stored information for performance matrix failure trend analysis.
Comming to the question, How many nodes (machines) and of what configuration in terms of CPU and Memory do I need to kick start my cluster considering above scenario.
Cassandra needs a well planned out data model for things to run well. It is very much worth spending time planning things out at this stage before you have a large data set and find out you probably would have done better re-arranging the data model!
The "general" rule of thumb is you shape your model to the queries, while paying attention to avoiding things like really large rows, large deletes, batches and such the like which can have big performance penalties.
The docs give a good start on planning and testing you would probably find useful. I would also recommend using the Cassandra stress tool. You can use it to push performance tests into your Cassandra cluster to check latencies and any performance problems. You can use your own schema too which I personally think is super-useful!
If you are using cloud based hardware like AWS then its relatively easy to scale up / down and see what works best for you. You dont need to throw big hardware at Cassandra, its easier to scale horizontally than vertically.
I'm assuming you are pulling back the data into a separate spark cluster for the analytics side too so these nodes would be running plain Cassandra (less hardware specs). If however you are using the Datastax Enterprise version (where you can run nodes in spark "mode") then you will need more beefier hardware with the additional load you need for spark driver programs, executors and such the like. Another good docs link is the DSE hardware recommendations
I am currently running h2o's DRF algorithm an a 3-node EC2 cluster (the h2o server spans across all 3 nodes).
My data set has 1m rows and 41 columns (40 predictors and 1 response).
I use the R bindings to control the cluster and the RF call is as follows
model=h2o.randomForest(x=x,
y=y,
ignore_const_cols=TRUE,
training_frame=train_data,
seed=1234,
mtries=7,
ntrees=2000,
max_depth=15,
min_rows=50,
stopping_rounds=3,
stopping_metric="MSE",
stopping_tolerance=2e-5)
For the 3-node cluster (c4.8xlarge, enhanced networking turned on), this takes about 240sec; the CPU utilization is between 10-20%; RAM utilization is between 20-30%; network transfer is between 10-50MByte/sec (in and out). 300 trees are built until early stopping kicks in.
On a single-node cluster, I can get the same results in about 80sec. So, instead of an expected 3-fold speed up, I get a 3-fold slow down for the 3-node cluster.
I did some research and found a few resources that were reporting the same issue (not as extreme as mine though). See, for instance:
https://groups.google.com/forum/#!topic/h2ostream/bnyhPyxftX8
Specifically, the author of http://datascience.la/benchmarking-random-forest-implementations/ notes that
While not the focus of this study, there are signs that running the
distributed random forests implementations (e.g. H2O) on multiple
nodes does not provide the speed benefit one would hope for (because
of the high cost of shipping the histograms at each split over the
network).
Also https://www.slideshare.net/0xdata/rf-brighttalk points at 2 different DRF implementations, where one has a larger network overhead.
I think that I am running into the same problems as described in the links above.
How can I improve h2o's DRF performance on a multi-node cluster?
Are there any settings that might improve runtime?
Any help highly appreciated!
If your Random Forest is slower on a multi-node H2O cluster, it just means that your dataset is not big enough to take advantage of distributed computing. There is an overhead to communicate between cluster nodes, so if you can train your model successfully on a single node, then using a single node will always be faster.
Multi-node is designed for when your data is too big to train on a single node. Only then, will it be worth using multiple nodes. Otherwise, you are just adding communication overhead for no reason and will see the type of slowdown that you observed.
If your data fits into memory on a single machine (and you can successfully train a model w/o running out of memory), the way to speed up your training is to switch to a machine with more cores. You can also play around with certain parameter values which affect training speed to see if you can get a speed-up, but that usually comes at a cost in model performance.
As Erin says, often adding more nodes just adds the capability for bigger data sets, not quicker learning. Random forest might be the worst; I get fairly good results with deep learning (e.g. 3x quicker with 4 nodes, 5-6x quicker with 8 nodes).
In your comment on Erin's answer you mention the real problem is you want to speed up hyper-parameter optimization? It is frustrating that h2o.grid() doesn't support building models in parallel, one on each node, when the data will fit in memory on each node. But you can do that yourself, with a bit of scripting: set up one h2o cluster on each node, do a grid search with a subset of hyper-parameters on each node, have them save the results and models to S3, then bring the results in and combine them at the end. (If doing a random grid search, you can run exactly the same grid on each cluster, but it might be a good idea to explicitly use a different seed on each.)
I have 50 GB dataset which doesn't fit in 8 GB RAM of my work computer but it has 1 TB local hard disk.
The below link from offical documentation mentions that Spark can use local hard disk if data doesnt fit in the memory.
http://spark.apache.org/docs/latest/hardware-provisioning.html
Local Disks
While Spark can perform a lot of its computation in memory, it still
uses local disks to store data that doesn’t fit in RAM, as well as to
preserve intermediate output between stages.
For me computational time is not at all a priority but fitting the data into a single computer's RAM/hard disk for processing is more important due to lack of alternate options.
Note:
I am looking for a solution which doesn't include the below items
Increase the RAM
Sample & reduce data size
Use cloud or cluster computers
My end objective is to use Spark MLLIB to build machine learning models.
I am looking for real-life, practical solutions that people successfully used Spark to operate on data that doesn't fit in RAM in standalone/local mode in a single computer. Have someone done this successfully without major limitations?
Questions
SAS have similar capability of out-of-core processing using which it can use both RAM & local hard disk for model building etc. Can Spark be made to work in the same way when data is more than RAM size?
SAS writes persistent the complete dataset to hardisk in ".sas7bdat" format can Spark do similar persistent to hard disk?
If this is possible, how to install and configure Spark for this purpose?
Look at http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
You can use various persistence models as per your need. MEMORY_AND_DISK is what will solve your problem . If you want a better performance, use MEMORY_AND_DISK_SER which stores data in serialized fashion.
I guess that 100Mbit/s network interface will be bottle neck for HDFS and slow down HBase on top of it (max compactions speed about 10MB/s, etc.). Would this deployment make sense?
I am thinking that "now" when when SSD comes in to game even 1Gbit/s network interfeces still can be bottleneck, so maybe building a cluster with 100Mbit/s should never be taken into account (even for HDD)?
To keep it short:
You should never use a SSD in HDFS, these flash memorys have a limited number of writes. HDFS has many writes, that's mainly because of the replication. If you are using HBase as a NoSQL DB this will result in even more writes.
The bottlenecks are as you said the harddisk and the network. Network is an even higher bottleneck because you are distributing the data, so it has to be replicated and if you are running jobs, they could be copied if the data is not locally available (Reducers have to copy much stuff).
So you should definitely for a better network than 10Mbit or 100Mbit. That implies your switch and the NICs on the nodes.
A hdd raid will not result in a higher bandwidth in writing, there were several benchmarks that proof that. Have a look at the HDFS Wiki, it must be described there.
100MB network is not likely to be a good setup for an hadoop cluster you can see cisco's presentation from Hadoop World for some analysis of network usage. That said depending on your actual load and cluster size it might be workable - though you might want to make sure you actually need Hadoop if that is the case.
regarding SSDs they cost more per MB and depending on your write load you may have to replace them sooner than HDDs but they will save you electricity - I guess it wouldn't be cost effective to use them in a large cluster (I don't know of anyone who did)
You can use SSDs for some of the disks e.g. for the temporary space on the cluster (such as map/reduce intermediate results) to get the IO benefits
Whether or not your network will be the bottleneck depends on the kinds of jobs you are running. If you do text processing (e.g. running Stanford NER or coreference suite), then a 100Mbit/s network will be the least of your concerns. However, if you are doing a lot of I/O intensive processing (most jobs with big reduce steps), then it will be. As always, it depends on your workload. But, I think it is safe to say that a 100Mb network is the most likely culprit for a bottleneck given recent processors and nodes with several disks.