run multiple instances of sparkling water on the same cluster - h2o

Two concurrent h2ocontext created on the same driver seem to conflict with each other. When one is running, the other one will throw errors. Can we do some configuration such that two instances of sparkling water can run in parallel?

Pass the following arguments on the command line:
--conf spark.ext.h2o.client.port.base=26000
--conf spark.ext.h2o.node.port.base=26005

Sparkling Water is tight to Spark cluster. If you want to be able to run multiple Sparkling Water clusters(H2OContext), then you need to create multiple separated Spark Clusters first

Related

Not all nodes are utilized in spark job

I am running spark-submit in cluster mode with yarn on hadoop. This is all from apache. I tried the PI java example and found that of 4 spark slave nodes only one was used to do the actual computation(one one node had a log file and the output for the value of pi).
Trying another python app I found that only two nodes at a maximum were being used. That is two out of a maximum of four.
Is this normal behavior, or am I missing something? Thanks in advance.

Spark running on YARN - What does a real life example's workflow look like?

I have been reading up on Hadoop, YARN and SPARK. What makes sense to me thus far is what I have summarized below.
Hadoop MapReduce: Client choses an input file and hands if off to
Hadoop (or YARN). Hadoop takes care of splitting the flie based on
user's InputFormat and stores it on as many nodes that are available
and configured Client submits a job (map-reduce) to YARN, which
copeies the jar to available Data Nodes and executes the job. YARN is
the orchestrator that takes care of all the scheduling and running of
the actual tasks
Spark: Given a job, input and a bunch of configuration parameters, it
can run your job, which could be a series of transformations and
provide you the output.
I also understand MapReduce is a batch based processing paradigm and
SPARK is more suited for micro batch or stream based data.
There are a lot of articles that talks about how Spark can run on YARN and how they are complimentary, but none have managed to help me understand how those two come together during an acutal workflow. For example when a client has a job to submit, read a huge file and do a bunch of transformations what does the workflow look like when using Spark on YARN. Let us assume that the client's input file is a 100GB text file. Please include as much details as possible
Any help with this would be greatly appreciated
Thanks
Kay
Let's assume the large file is stored in HDFS. In HDFS the file is divided into blocks of some size (default 128 MB).
That means your 100GB file will be divided into 800 blocks. Each block will be replicated and can be stored on different node in the cluster.
When reading the file with Hadoop InputFormat list of splits with location is obtained first. Then there is created one task per each splits. That you will get 800 parallel tasks that are executed by runtime.
Basically the input process is the same for MapReduce and Spark, because both of the use Hadoop Input Formats.
Both of them will process each InputSplit in separate task. The main difference is that Spark has more rich set of transformations and can optimize the workflow if there is a chain of transformations that can be applied at once. As opposed to MapReduce where is always map and reduce phase only.
YARN stands for "Yet another resource negotiator". When a new job with some resource requirement (memory, processors) is submitted it is the responsibility of YARN to check if the needed resources are available on the cluster. If other jobs are running on the cluster are taking up too much of the resources then the new job will be made to wait till the prevoius jobs complete and resources are available.
YARN will allocate enough containers in the cluster for the workers and also one for the Spark driver. In each of these containers JVM is started with given resources. Each Spark worker can process multiple tasks in parallel (depends on the configured number of cores per executor).
e.g.
If you set 8 cores per Spark executor, YARN tries to allocated 101 containers in the cluster tu run 100 Spark workers + 1 Spark master (driver). Each of the workers will process 8 tasks in parallel (because of 8 cores).

In Amazon EMR, what is the relationship between a core instance, a mapper, and a map slot?

I am confused about the relationship between core instances and mappers each instance can have. How are these mappers created? If I set core instance count to 0, so that only master node is running, why can MapReduce jobs run without any task nodes?
Thanks in advance.
the number of cores means how many processors are implemented in each machine within a given cluster. Moreover, each core can run a mapper.
You don't have to worry about the creation of the mapper because the hadoop framework will do it for you.
That's a really good question. My guess is that what's happening is that EMR is smart enough to setup the Master node to run the MapReduce jobs in the event that there are no Core or Task nodes. That's a guess.
If you want to find out if I'm right, spin up a cluster. Then start a MapReduce job, while keeping an eye on the java processes via jps -lm and see if any mapper processes get launched on the Master node.

Comparison between Hadoop Classic and Yarn

I have two clusters each running different version of Hadoop. I am working on a POC were I need to understand how YARN provides the capability to run multiple applications simultaneously which was not accomplished with Classic Map Reduce Framework.
Hadoop Classic:
I have a wordcount.jar file and executed on a single cluster (2 Mappers & 2 Reducers). I started two jobs in parallel, the one lucky started first got both mappers, completed the task and then second job started. This is the expected behavior.
Hadoop Yarn:
Same wordcount.jar with a different cluster (4 cores, so total 4 machines). As Yarn does not pre-assign mapper and reducer, any core can be used as mapper or reducer. Here also I submitted two jobs in parallel.
Expected Behavior: Both the jobs should start with 2 mappers each or whichever config as resource manager assigns but atleast both the jobs should start.
Reality: One job starts with 3 mappers and 1 reducers. second job waits untill first is completed.
Can someone please help me understand the behavior, as well as does the parallelism behavior best reflected with multinode cluster?
Not sure if this is the exact reason, but the Classic Hadoop and YARN architectures use a different scheduler. Classic Hadoop uses a JobQueueTaskScheduler, while YARN uses CapacityScheduler by default.

Specify the number of map-slots in concurrent hadoop jobs

I have a a hadoop system running. It has all together 8 map slots in parallel. The DFS block size is 128M.
Now suppose I have two jobs: both of them have large input files, say a hundred G. I want them to run in parallel in the hadoop system. (Because the users do not want to wait. They want to see some progress.) I want the first ones take 5 map slots in parallel, the second one runs on the rest 3 map slots. Is that possible to specify the number of map-slots? Currently I use command line to start it as Hadoop jar jarfile classname input output. Can I specify it in the command line?
Thank you very much for the help.
Resource allocation can be done using a Scheduler. Classic Hadoop uses a JobQueueTaskScheduler, while YARN uses CapacityScheduler by default. According to the Hadoop documentation
This document describes the CapacityScheduler, a pluggable scheduler for Hadoop which allows for multiple-tenants to securely share a large cluster such that their applications are allocated resources in a timely manner under constraints of allocated capacities.

Resources