How to estimate amount of spark executor on a Hortonworks Hadoop cluster? - hadoop

I setup a Hortonworks Hadoop cluster:
Hortonworks version is 2.3.2.
1 NameNode, 1 Secondary NameNode, 10 DataNode
Spark 1.4.1 and deployed on all data node.
YARN is installed.
When I run a spark program, the executor is only run on 4 nodes but not whole data nodes.
How to estimate amount of spark executor on such Hadoop cluster?

The amount of executors you request is by default 4. If you want to request more, you have to call with the --num-executors = x parameter on the command line or set spark.executors.instances in the configuration. More details here:
https://spark.apache.org/docs/latest/running-on-yarn.html
Because the Spark is run on Hortonworks Hadoop with YARN, each Spark client should deploy YARN/node manager, YARN client. Otherwise, the spark client would not be scheduled.
The actual executor is related on the min number of node manager and num-executors.

Related

YARN-Specify Which Application to be Run on Which Nodemanager

I have a Hadoop YARN cluster including one resourcemanager and 6 nodemanagers. I want to run both Flink and Spark applications on the cluster. So I have two main question about YARN:
In case of Spark, Should I install and config Spark on resource manager and each nodemanagers? When I want to submit a Spark application on YARN, in addition to YARN resourcemanager and nodemanagers, should Spark cluster (master and slaves) be run?
Can I set YARN such that run Flink in some special nodemanagers?
Thanks
For the first question, that depends on whether you're using a packaged Hadoop distribution (like Cloudera CDH, Hortonworks HDP for example) or not. The distros will likely take care of this. If you're not using a distribution, you need to consider if you want to run Spark on YARN or Spark stand-alone.
For the second question, you can specify special Node Managers if you are using Capacity Scheduler with the node-labelling feature enabled and if you are using Hadoop 2.6 and higher.

How to set up Spark on multi-node Hadoop cluster?

I would like to install Hadoop HDFS and Spark on multi-node cluster.
I was able to successfully install and configure Hadoop on multi-node cluster. I have also installed and configured Spark on master node.
I have doubts that I have to configure the spark in slaves as well?
I have doubt that I have to configure the spark in slaves as well?
You should not. You're done. You did more than you had to to submit Spark applications to Hadoop YARN (which I concluded is the cluster manager).
Spark is a library for distributed computations on massive datasets and as such it belongs solely to your Spark applications (not any cluster you may use).
Time to spark-submit Spark applications!

What is the difference between multi node hadoop cluster and running hadoop on mesos?

I've built a multi node hadoop cluster, then i started studying mesos and the ability to run hadoop on mesos cluster, so here's my questions:
1) Should I run hadoop on mesos cluster? or it doesn't matter.
2) What is the difference between them?
There are different things in different hierarchies. You could deploy the hadoop cluster in a set of machines directly. So that your machines could handle hadoop jobs now.
Or you could deploy mesos cluster first, and then deploy hadoop cluster, spark cluster, kafka and other things on mesos. So that you could sumbit your hadoop jobs to the hadoop cluster, submit your spark jobs to the spark cluster.

difference between hadoop mr1 and yarn and mr2?

Can someone pls tell what is the differece between MR1 and yarn and MR2
My understanding is MR1 will be having below components
Namenode,
secondary name node,
datanode,
job tracker,
task tracker
Yarn
Node manager
Resource Manager
Is Yarn consists of MR1 or MR2 ( or both MR2 and Yarn are same?)
sorry if i asked basic level question
MRv1 uses the JobTracker to create and assign tasks to task trackers, which can become a resource bottleneck when the cluster scales out far enough (usually around 4,000 clusters).
MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager for each cluster, and each data node runs a Node Manager. In MapReduce MRv2, the functions of the JobTracker have been split between three services. The ResourceManager is a persistent YARN service that receives and runs applications (a MapReduce job is an application) on the cluster. It contains the scheduler, which, as previously, is pluggable. The MapReduce-specific capabilities of the JobTracker have been moved into the MapReduce Application Master, one of which is started to manage each MapReduce job and terminated when the job completes. The JobTracker function of serving information about completed jobs has been moved to the JobHistory Server. The TaskTracker has been replaced with the NodeManager, a YARN service that manages resources and deployment on a host. It is responsible for launching containers, each of which can house a map or reduce task.
YARN is a generic platform for any form of distributed application to run on, while MR2 is one such distributed application that runs the MapReduce framework on top of YARN

hadoop yarn resource management

I have a Hadoop cluster with 10 nodes. Out of the 10 nodes, on 3 of them, HBase is deployed. There are two applications sharing the cluster.
Application 1 writes and reads data from hadoop HDFs. Application 2 stores data into HBase. Is there a way in yarn to ensure that hadoop M/R jobs launched
by application 1 do not use the slots on Hbase nodes? I want only the Hbase M/R jobs launched by application 2 to use the HBase nodes.
This is needed to ensure enough resources are available for application 2 so that the HBase scans are very fast.
Any suggestions on how to achieve this?
if you run HBase and your applications on Yarn, the application masters (of HBase itself and the MR Jobs) can request the maximum of available resources on the data nodes.
Are you aware of the hortonworks project Hoya = HBase on Yarn ?
Especially one of the features is:
Run MR jobs while maintaining HBase’s low latency SLAs

Resources