hadoop yarn resource management - hadoop

I have a Hadoop cluster with 10 nodes. Out of the 10 nodes, on 3 of them, HBase is deployed. There are two applications sharing the cluster.
Application 1 writes and reads data from hadoop HDFs. Application 2 stores data into HBase. Is there a way in yarn to ensure that hadoop M/R jobs launched
by application 1 do not use the slots on Hbase nodes? I want only the Hbase M/R jobs launched by application 2 to use the HBase nodes.
This is needed to ensure enough resources are available for application 2 so that the HBase scans are very fast.
Any suggestions on how to achieve this?

if you run HBase and your applications on Yarn, the application masters (of HBase itself and the MR Jobs) can request the maximum of available resources on the data nodes.
Are you aware of the hortonworks project Hoya = HBase on Yarn ?
Especially one of the features is:
Run MR jobs while maintaining HBase’s low latency SLAs

Related

Writing to hadoop cluster while it is busy running map reducer jobs

I know that Hadoop has the Fair Scheduler, where we can assign a job to some priority group and the cluster resources are allocated to the job based on priority. What I am not sure and what I ask is how a non map-red program is prioritized by the Hadoop cluster. Specifically how do the writes to Hadoop through external clients (say some standalone program which is directly opening HDFS file and streaming data to it) would be prioritized by Hadoop when the cluster is busy running map-red jobs.
The Resource Manager only can prioritize jobs submitted to it (such as MapReduce applications, Spark jobs, etc ...).
Other than distcp, HDFS operations only interact with the NameNode and Datanodes not the Resource Manager so they would be handled by the NameNode in the order they're received.

Spark cluster - read/write on hadoop

I would like to read data from hadoop, process on spark, and wirte result on hadoop and elastic search. I have few worker nodes to do this.
Spark standalone cluster is sufficient? or Do I need to make hadoop cluster to use yarn or mesos?
If standalone cluster mode is sufficient, should jar file be set on all node unlike yarn, mesos mode?
First of all, you can not write data in Hadoop or read data from Hadoop. It is HDFS (Component of Hadoop ecosystem) which is responsible for read/write of data.
Now coming to your question
Yes, it possible to read data from HDFS and process it in spark engine and then write the output on HDFS.
YARN, mesos and spark standalone all are cluster managers and you can use any one of them to do management of resources in your cluster and it had nothing to do with hadoop. But since you want to read and write data from/to HDFS then you need to install HDFS on cluster and thus it is better to install hadoop on your all nodes that will also install HDFS on all nodes. Now whether you want to use YARN, mesos or spark standalone that is your choice all will work with HDFS I myself use spark standalone for cluster management.
It is not clear about which jar files you are talking to but I assume it will be of spark then yes you need to set the path for spark jar on each node so that there will be no contradiction in paths when spark run's.

Do I need to use Spark with YARN to achieve NODE LOCAL data locality with HDFS?

Do I need to use Spark with YARN to achieve NODE LOCAL data locality with HDFS?
If I use Spark standalone cluster manager and have my data distributed in HDFS cluster, how will Spark know that data is located locally on the nodes?
YARN is a resource manager. It deals with memory and processes, and not with the workings of HDFS or data-locality.
Since Spark can read from HDFS sources, and the namenodes & datanodes take care of all that HDFS block data management outside of YARN, then I believe the answer is no, you don't need YARN. But you already have HDFS, which means you have Hadoop, so why not take advantage of integrating Spark into YARN?
the stand alone mode has its own cluster manager/resource manager which will talk to name node for locality. client/driver will put tasks based on the results.

What is the difference between a mapreduce application and a yarn application?

A cluster which runs mapreduce 2 doesn't have a job tracker and instead it is split into two separate components, resource manager and job manager. However, these thing are transparent from a user and he doesn't need to know whether the cluster is running mapreduce 1 or 2 when submitting a mapreduce job.
The thing I cannot quite understand is Yarn application. How is it different from a regular mapreduce application? What's the advantage of running a mapreduce job as a yarn application, etc? Could someone shed some light on that for me?
MR1 has Job tracker and task tracker which takes care of Map reduce application.
In MR2 Apache separated the management of the map/reduce process from the cluster's resource management by using YARN. YARN is a better resource manger than we had in MR1. It also enables versatility. MR2 is built on top of YARN.
Apart from Map reduce, we can run applications like spark, storm, Hbase, Tex etc on top of Yarn, which we cannot do using MR1.
The following is the architecture for MR1 and MR2.
HDFS <---> MR
HDFS <----> Yarn <----> MR

Understanding Spark alongside Hadoop

In the set up I have, both Hadoop and Spark are running on the same network but on different nodes. We can run Spark alongside your existing Hadoop cluster by just launching it as a separate service. Will it show any improvement in performance?
I have thousands of files around 10 GB loaded into HDFS.
I have 8 nodes for Hadoop, 1 master and 5 workers for Spark
As long as the worker is on the same node, we have the advantage of Locality. You can launch your service alongside hadoop as well.

Resources