I am working on a solution providing run-time ressources addition to an Hadoop Yarn cluster. The purpose is to handle heavy peaks on our application.
I am not an expert and I need help in order to approve / contest what I understand.
Hadoop YARN
This application an run in a cluster-mode. It provides ressource management (CPU & RAM).
A spark a application, for example, ask for a job to be done. Yarn handles the request and provides an executor computing on the Yarn cluster.
HDFS - Data & Executors
The datas are not shared through executors, so they have to be stored in a file System. In my case : HDFS. That means I will have to run a copy of my spark streaming application in the new server (hadoop node).
I am not sure of this :
The yarn cluster and HDFS are different, writing on HDFS won't write on the new hadoop node local data (because it is not an HDFS node).
As I will only write on HDFS new data from a spark streaming application, creating a new application should not be a problem.
Submit the job to yarn
--- peak, resources needed
Instance new server
Install / configure Hadoop & YARN, making it a slave
Modifying hadoop/conf/slaves, adding it's ip adress (or dns name from host file)
Moddifying dfs.include and mapred.include
On host machine :
yarn -refreshNodes
bin/hadoop dfsadmin -refreshNodes
bin/hadoop mradmin -refreshNodes
Should this work ? refreshQueues sounds not really useful here as it seems to only take care of the process queue.
I am not sure if the running job will increase it's capacity. Another idea is to wait for the new ressources to be available and submit a new job.
Thanks for you help
Related
I know that Hadoop has the Fair Scheduler, where we can assign a job to some priority group and the cluster resources are allocated to the job based on priority. What I am not sure and what I ask is how a non map-red program is prioritized by the Hadoop cluster. Specifically how do the writes to Hadoop through external clients (say some standalone program which is directly opening HDFS file and streaming data to it) would be prioritized by Hadoop when the cluster is busy running map-red jobs.
The Resource Manager only can prioritize jobs submitted to it (such as MapReduce applications, Spark jobs, etc ...).
Other than distcp, HDFS operations only interact with the NameNode and Datanodes not the Resource Manager so they would be handled by the NameNode in the order they're received.
I would like to read data from hadoop, process on spark, and wirte result on hadoop and elastic search. I have few worker nodes to do this.
Spark standalone cluster is sufficient? or Do I need to make hadoop cluster to use yarn or mesos?
If standalone cluster mode is sufficient, should jar file be set on all node unlike yarn, mesos mode?
First of all, you can not write data in Hadoop or read data from Hadoop. It is HDFS (Component of Hadoop ecosystem) which is responsible for read/write of data.
Now coming to your question
Yes, it possible to read data from HDFS and process it in spark engine and then write the output on HDFS.
YARN, mesos and spark standalone all are cluster managers and you can use any one of them to do management of resources in your cluster and it had nothing to do with hadoop. But since you want to read and write data from/to HDFS then you need to install HDFS on cluster and thus it is better to install hadoop on your all nodes that will also install HDFS on all nodes. Now whether you want to use YARN, mesos or spark standalone that is your choice all will work with HDFS I myself use spark standalone for cluster management.
It is not clear about which jar files you are talking to but I assume it will be of spark then yes you need to set the path for spark jar on each node so that there will be no contradiction in paths when spark run's.
Do I need to use Spark with YARN to achieve NODE LOCAL data locality with HDFS?
If I use Spark standalone cluster manager and have my data distributed in HDFS cluster, how will Spark know that data is located locally on the nodes?
YARN is a resource manager. It deals with memory and processes, and not with the workings of HDFS or data-locality.
Since Spark can read from HDFS sources, and the namenodes & datanodes take care of all that HDFS block data management outside of YARN, then I believe the answer is no, you don't need YARN. But you already have HDFS, which means you have Hadoop, so why not take advantage of integrating Spark into YARN?
the stand alone mode has its own cluster manager/resource manager which will talk to name node for locality. client/driver will put tasks based on the results.
Well I am new to Apache Storm and after some search and read tutorials, I didn't get that how fault tolerance, load balancing and other resource manager duties takes place in Storm cluster? Should it be configured on top of YARN or it doest the resource management job itself? Does it have its HDFS part, or there should be an existing HDFS configured in a cluster first?
Storm can manage its resources by itself or can run on top of YARN. If you have a shared cluster (ie, with other system like Hadoop, Spark, or Flink running), using YARN should be the better choice to avoid resource conflicts.
About HDFS: Storm is independent of HDFS. If you want to run in on top of HDFS, you need to setup HDFS by yourself. Furthermore, Storm provides Spouts/Bolt to access HDFS: https://storm.apache.org/documentation/storm-hdfs.html
I have a Hadoop cluster with 10 nodes. Out of the 10 nodes, on 3 of them, HBase is deployed. There are two applications sharing the cluster.
Application 1 writes and reads data from hadoop HDFs. Application 2 stores data into HBase. Is there a way in yarn to ensure that hadoop M/R jobs launched
by application 1 do not use the slots on Hbase nodes? I want only the Hbase M/R jobs launched by application 2 to use the HBase nodes.
This is needed to ensure enough resources are available for application 2 so that the HBase scans are very fast.
Any suggestions on how to achieve this?
if you run HBase and your applications on Yarn, the application masters (of HBase itself and the MR Jobs) can request the maximum of available resources on the data nodes.
Are you aware of the hortonworks project Hoya = HBase on Yarn ?
Especially one of the features is:
Run MR jobs while maintaining HBase’s low latency SLAs