How to run H2O from Zeppelin and working on data node - h2o

We use H2O over zeppelin(with sparkling water).
Zeppelin works on the edge machine with low resources.
While running H2O from Zeppelin I can see that the Zeppelin interpreter process is doing all the work (I suppose it's the algorithm processing) and takes a lot of resources from the edge node while the data node shows no action.
Is there a way to spread the H2O process to the data node?
Or, in the way H2O is working, can I use spark only to read the data and then process the algorithm on the client?

Related

Apache Storm vs Apache Airflow

What is the difference between the two? As far as I understand both of them are based on the concept of directed acyclic graph (DAG) and Storm processes data in real time and Airflow rather moves the whole thing from one stage to the other. Is it the only difference between them? What kind of jobs are both of them suitable for?
Airflow is an orchestrating engine like Azkaban or Oozie. Storm is a near real-time processing engine which has largely been replaced by Spark Streaming or Heron (Twitter’s 2nd generation replacement for Storm).

Is there any performance difference for ML Training between H2O Multi-node cluster and H2O Spark Cluster based on Sparkling Water?

I am curious about the cluster configuration environment in terms of the ML Training performance of H2O.
If there are three nodes, is there a performance difference between configuring a generic H2O Multi-node Cluster and configuring an H2O Spark Cluster based on Spark?
From our experiments, we conclude that there is no obvious performance difference between the two.
However, many of the H2O documents tell me that H2O Sparkling Water is more effective at ML Training.
Reference
H2O Multi-node Cluster: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/starting-h2o.html#flatfile
Your measured observation is correct.
There is no difference.

Amazon Emr - What is the need of Task nodes when we have Core nodes?

I am learning about Amazon EMR lately, and according to my knowledge the EMR cluster lets us choose 3 nodes.
Master which runs the Primary Hadoop daemons like NameNode,Job Tracker and Resource manager.
Core which runs Datanode and Tasktracker daemons.
Task which only runs TaskTracker only.
My question to you guys in why does EMR provide task nodes? Where as hadoop suggests that we should have Datanode daemon and Tasktracker daemon on the same node. What is Amazon's logic behind doing this? You can keep data in S3 stream it to HDFS on the core nodes, do the processing on HDFS other than sharing data from HDFS to task nodes which will increase IO over head in that case. Because as far as my knowledge in hadoop, TaskTrackers run on DataNodes which have data blocks for that particular task then why have TaskTrackers on different nodes?
According to AWS documentation [1]
The node types in Amazon EMR are as follows:
Master node: A node that manages the cluster by running software
components to coordinate the distribution of data and tasks among
other nodes for processing. The master node tracks the status of tasks
and monitors the health of the cluster. Every cluster has a master
node, and it's possible to create a single-node cluster with only the
master node.
Core node: A node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on your
cluster. Multi-node clusters have at least one core node.
Task node: A node with software components that only runs tasks and does not store data in HDFS. Task nodes are optional.
According to AWS documentation [2]
Task nodes are optional. You can use them to add power to perform parallel computation tasks on data, such as Hadoop MapReduce tasks and Spark executors.
Task nodes don't run the Data Node daemon, nor do they store data in HDFS.
Some Use cases are:
You can use Task nodes for processing streams from S3. In this case Network IO won't increase as the used data isn't on HDFS.
Task nodes can be added or removed as no HDFS daemons are running. Hence, no data on task nodes. Core nodes have HDFS daemons running and keep adding and removing new nodes isn't a good practice.
Resources:
[1] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview.html#emr-overview-clusters
[2] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-master-core-task-nodes.html#emr-plan-task
One use case is if you use spot instances as task nodes. If its cheap enough, it may be worth while to add some compute power to your EMR cluster. This would be mostly for non-sensitive tasks.
Traditional Hadoop assumes all your workload requires high I/O, with EMR you can choose instance type based on your workload. For high IO needs example up to 100Gbps go with C type or R type, and you can use placement groups. And keep your Core Nodes to Task nodes ratio to 1:5 or lower, this will keep the I/O optimal and if you want higher throughput select C's or R's as your Core and Task. (edited - explaining barely any perf loss with EMR)
Task node's advantage it can scale up/down faster and can minimize compute cost. Traditional Hadoop Cluster it's hard to scale either ways since slaves also part of HDFS.
Task nodes are optional since core nodes can run Map and Reduce.
Core nodes takes longer to scale up/down depending on the tasks hence given the option of Task node for quicker auto scaling.
Reference: https://aws.amazon.com/blogs/big-data/best-practices-for-resizing-and-automatic-scaling-in-amazon-emr/
The reason why Hadoop suggest that we should have DataNode and Tasktracker Daemons on the same nodes is because it wants our processing power as close to data as possible.
But there also comes Rack level optimization when you have to deal with multi-nodes cluster. In my point of view AWS reducing I/O overhead by providing task nodes in the same rack in which Datanodes exists.
And the reason to provide Task nodes are that we need more processing over our data than to just focusing on storing them on HDFS. We would always want more TaskTracker than the Daemon nodes. So AWS has provided you the opportunity to increase it using a complete node benefiting RackLevel optimization.
And the way you want to get data into your cluster(using S3 and only core nodes) is a good option if you want good performance but using only a transient cluster.

Ingesting data in elasticsearch from hdfs , cluster setup and usage

I am setting up a spark cluster. I have hdfs data nodes and spark master nodes on same instances.
Current setup is
1-master (spark and hdfs)
6-spark workers and hdfs data nodes
All instances are same, 16gig dual core (unfortunately).
I have 3 more machines, again same specs.
Now I have three options
1. Just deploy es on these 3 machines. The cluster will look like
1-master (spark and hdfs)
6-spark workers and hdfs data nodes
3-elasticsearch nodes
Deploy es master on 1, extend spark and hdfs and es on all other.
Cluster will look like
1-master (spark and hdfs)
1-master elasticsearch
8-spark workers, hdfs data nodes, es data nodes
My application is heavily use spark for joins, ml etc but we are looking for search capabilities. Search we definitely not needed realtime and a refresh interval of upto 30 minutes is even good with us.
At the same time spark cluster has other long running task apart from es indexing.
The solution need not to be one of above, I am open with experimentation if some one suggest. It would be handy for other dev's also once concluded.
Also I am trying with es hadoop, es-spark project but I felt ingestion is very slow if I do 3 dedicated nodes, its like 0.6 million records/minute.
The optimal approach here mostly depends on your network bandwidth and whether or not it's the bottleneck in your operation in my opinion.
I would just check whether my network links are saturated via say
iftop -i any or similar and check if that is the case. If you see data rates close to the physical capacity of your network, then you could try and run hdfs + spark on the same machines that run ES to save the network round trip and speed things up.
If network turns out not to be the bottleneck here, I would look into the way Spark and HDFS are deployed next.
Are your using all the RAM available (Java Xmx set high enough?, Spark memory limits? Yarn memory limits if Spark is deployed via Yarn?)
Also you should check whether ES or Spark is the bottleneck here, in all likelihood it's ES. Maybe you could spawn additional ES instances, 3 ES nodes feeding 6 spark workers seems very sub-optimal.
If anything, I'd probably try to invert that ratio, fewer Spark executors and more ES capacity. ES is likely a lot slower at providing the data than HDFS is at writing it (though this really depends on the configuration of both ... just an educated guess here :)). It is highly likely that more ES nodes and fewer Spark workers will be the better approach here.
So in a nutshell:
Add more ES nodes and reduce Spark worker count
Check if your network links are saturated, if so put both on the same machines (this could be detrimental with only 2 cores, but I'd still give it a shot ... you gotta try this out)
Adding more ES nodes is the better bet of the two things you can do :)

Elasticsearch on Hadoop - Should ES nodes be Colocated with Hadoop DataNodes?

From the Elasticsearch for Hadoop documentation:
Whenever possible, elasticsearch-hadoop shares the Elasticsearch
cluster information with Hadoop to facilitate data co-location. In
practice, this means whenever data is read from Elasticsearch, the
source nodes IPs are passed on to Hadoop to optimize task execution.
If co-location is desired/possible, hosting the Elasticsearch and
Hadoop clusters within the same rack will provide significant network
savings.
Does this mean to say that ideally an Elasticsearch node should be colocated with every DataNode on the Hadoop cluster, or am I misreading this?
You may find this joint presentation by Elasticsearch and Hortonworks useful in answering this question:
http://www.slideshare.net/hortonworks/hortonworks-elastic-searchfinal
You'll note that on slides 33 and 34 they show multiple architectures - one where the ES nodes are co-located on the Hadoop nodes and another where you have separate clusters. The first option clearly gives you the best co-location of data which is very important for managing Hadoop performance. The second approach allows you to tune each separately and scale them independently.
I don't know that you can say one approach is better than the other as there are clearly tradeoffs. Running on the same node clearly minimizes data access latency at the expense of a loss of isolation and ability to tune each cluster separately.

Resources