Do I need to use Spark with YARN to achieve NODE LOCAL data locality with HDFS? - hadoop

Do I need to use Spark with YARN to achieve NODE LOCAL data locality with HDFS?
If I use Spark standalone cluster manager and have my data distributed in HDFS cluster, how will Spark know that data is located locally on the nodes?

YARN is a resource manager. It deals with memory and processes, and not with the workings of HDFS or data-locality.
Since Spark can read from HDFS sources, and the namenodes & datanodes take care of all that HDFS block data management outside of YARN, then I believe the answer is no, you don't need YARN. But you already have HDFS, which means you have Hadoop, so why not take advantage of integrating Spark into YARN?

the stand alone mode has its own cluster manager/resource manager which will talk to name node for locality. client/driver will put tasks based on the results.

Related

Repartitioning in Hadoop Distributed File System ( HDFS )

Is there a way to repartition data directly in HDFS? If You notice that Your partitions are unbalanced (one or more is much bigger then other) how You can deal with it?
I know that it could be done ex in Apache Spark but running a job to just repartition seems like overhead- or maybe it is good idea?
Run hdfs balancer. This tool that distributes HDFS blocks evenly across datanodes.
https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#balancer
In case you are running a Cloudera Manager or Ambari managed distribution, you can run HDFS balancer from their web UI.

Spark cluster - read/write on hadoop

I would like to read data from hadoop, process on spark, and wirte result on hadoop and elastic search. I have few worker nodes to do this.
Spark standalone cluster is sufficient? or Do I need to make hadoop cluster to use yarn or mesos?
If standalone cluster mode is sufficient, should jar file be set on all node unlike yarn, mesos mode?
First of all, you can not write data in Hadoop or read data from Hadoop. It is HDFS (Component of Hadoop ecosystem) which is responsible for read/write of data.
Now coming to your question
Yes, it possible to read data from HDFS and process it in spark engine and then write the output on HDFS.
YARN, mesos and spark standalone all are cluster managers and you can use any one of them to do management of resources in your cluster and it had nothing to do with hadoop. But since you want to read and write data from/to HDFS then you need to install HDFS on cluster and thus it is better to install hadoop on your all nodes that will also install HDFS on all nodes. Now whether you want to use YARN, mesos or spark standalone that is your choice all will work with HDFS I myself use spark standalone for cluster management.
It is not clear about which jar files you are talking to but I assume it will be of spark then yes you need to set the path for spark jar on each node so that there will be no contradiction in paths when spark run's.

Does apache storm do resource management job in the cluster by itslef?

Well I am new to Apache Storm and after some search and read tutorials, I didn't get that how fault tolerance, load balancing and other resource manager duties takes place in Storm cluster? Should it be configured on top of YARN or it doest the resource management job itself? Does it have its HDFS part, or there should be an existing HDFS configured in a cluster first?
Storm can manage its resources by itself or can run on top of YARN. If you have a shared cluster (ie, with other system like Hadoop, Spark, or Flink running), using YARN should be the better choice to avoid resource conflicts.
About HDFS: Storm is independent of HDFS. If you want to run in on top of HDFS, you need to setup HDFS by yourself. Furthermore, Storm provides Spouts/Bolt to access HDFS: https://storm.apache.org/documentation/storm-hdfs.html

hadoop yarn resource management

I have a Hadoop cluster with 10 nodes. Out of the 10 nodes, on 3 of them, HBase is deployed. There are two applications sharing the cluster.
Application 1 writes and reads data from hadoop HDFs. Application 2 stores data into HBase. Is there a way in yarn to ensure that hadoop M/R jobs launched
by application 1 do not use the slots on Hbase nodes? I want only the Hbase M/R jobs launched by application 2 to use the HBase nodes.
This is needed to ensure enough resources are available for application 2 so that the HBase scans are very fast.
Any suggestions on how to achieve this?
if you run HBase and your applications on Yarn, the application masters (of HBase itself and the MR Jobs) can request the maximum of available resources on the data nodes.
Are you aware of the hortonworks project Hoya = HBase on Yarn ?
Especially one of the features is:
Run MR jobs while maintaining HBase’s low latency SLAs

How to config and use multi master nodes in a Hadoop cluster?

Could anyone please tell us how to config and use multi master nodes in a Hadoop cluster?
If you are looking multiple NameNodes, then check HDFS high availability and HDFS federation. Both should be available in 2x Hadoop release.
One more master in 1x Hadoop release in the JobTracker and there can be only one JobTracker in a cluster. BTW, JobTracker functionality has been split in the 2x Hadoop release. Check this for more details.
There might be some other alternate options also, but depends on the requirement for having multiple masters. Is it availability, scalability or something other?

Resources