What is the difference between multi node hadoop cluster and running hadoop on mesos? - hadoop

I've built a multi node hadoop cluster, then i started studying mesos and the ability to run hadoop on mesos cluster, so here's my questions:
1) Should I run hadoop on mesos cluster? or it doesn't matter.
2) What is the difference between them?

There are different things in different hierarchies. You could deploy the hadoop cluster in a set of machines directly. So that your machines could handle hadoop jobs now.
Or you could deploy mesos cluster first, and then deploy hadoop cluster, spark cluster, kafka and other things on mesos. So that you could sumbit your hadoop jobs to the hadoop cluster, submit your spark jobs to the spark cluster.

Related

YARN-Specify Which Application to be Run on Which Nodemanager

I have a Hadoop YARN cluster including one resourcemanager and 6 nodemanagers. I want to run both Flink and Spark applications on the cluster. So I have two main question about YARN:
In case of Spark, Should I install and config Spark on resource manager and each nodemanagers? When I want to submit a Spark application on YARN, in addition to YARN resourcemanager and nodemanagers, should Spark cluster (master and slaves) be run?
Can I set YARN such that run Flink in some special nodemanagers?
Thanks
For the first question, that depends on whether you're using a packaged Hadoop distribution (like Cloudera CDH, Hortonworks HDP for example) or not. The distros will likely take care of this. If you're not using a distribution, you need to consider if you want to run Spark on YARN or Spark stand-alone.
For the second question, you can specify special Node Managers if you are using Capacity Scheduler with the node-labelling feature enabled and if you are using Hadoop 2.6 and higher.

How to set up Spark on multi-node Hadoop cluster?

I would like to install Hadoop HDFS and Spark on multi-node cluster.
I was able to successfully install and configure Hadoop on multi-node cluster. I have also installed and configured Spark on master node.
I have doubts that I have to configure the spark in slaves as well?
I have doubt that I have to configure the spark in slaves as well?
You should not. You're done. You did more than you had to to submit Spark applications to Hadoop YARN (which I concluded is the cluster manager).
Spark is a library for distributed computations on massive datasets and as such it belongs solely to your Spark applications (not any cluster you may use).
Time to spark-submit Spark applications!

Spark cluster - read/write on hadoop

I would like to read data from hadoop, process on spark, and wirte result on hadoop and elastic search. I have few worker nodes to do this.
Spark standalone cluster is sufficient? or Do I need to make hadoop cluster to use yarn or mesos?
If standalone cluster mode is sufficient, should jar file be set on all node unlike yarn, mesos mode?
First of all, you can not write data in Hadoop or read data from Hadoop. It is HDFS (Component of Hadoop ecosystem) which is responsible for read/write of data.
Now coming to your question
Yes, it possible to read data from HDFS and process it in spark engine and then write the output on HDFS.
YARN, mesos and spark standalone all are cluster managers and you can use any one of them to do management of resources in your cluster and it had nothing to do with hadoop. But since you want to read and write data from/to HDFS then you need to install HDFS on cluster and thus it is better to install hadoop on your all nodes that will also install HDFS on all nodes. Now whether you want to use YARN, mesos or spark standalone that is your choice all will work with HDFS I myself use spark standalone for cluster management.
It is not clear about which jar files you are talking to but I assume it will be of spark then yes you need to set the path for spark jar on each node so that there will be no contradiction in paths when spark run's.

How to estimate amount of spark executor on a Hortonworks Hadoop cluster?

I setup a Hortonworks Hadoop cluster:
Hortonworks version is 2.3.2.
1 NameNode, 1 Secondary NameNode, 10 DataNode
Spark 1.4.1 and deployed on all data node.
YARN is installed.
When I run a spark program, the executor is only run on 4 nodes but not whole data nodes.
How to estimate amount of spark executor on such Hadoop cluster?
The amount of executors you request is by default 4. If you want to request more, you have to call with the --num-executors = x parameter on the command line or set spark.executors.instances in the configuration. More details here:
https://spark.apache.org/docs/latest/running-on-yarn.html
Because the Spark is run on Hortonworks Hadoop with YARN, each Spark client should deploy YARN/node manager, YARN client. Otherwise, the spark client would not be scheduled.
The actual executor is related on the min number of node manager and num-executors.

Can an oozie instance run jobs on multiple hadoop clusters at the same time?

I have an available developer Hadoop cluster to run test jobs as well as an available production cluster. My question is, can I utilize oozie to kick off workflow jobs to multiple clusters on a single oozie instance?
What are the gotchas? I'm assuming I can just reconfigure the job tracker, namenode, and fs location properties for my workflow depending on which cluster I want the job to run on.
Assuming the clusters are all running the same distribution and version of hadoop, you should be able to.
As your note, you'll need to adjust the jobtracker and namenode values in your oozie actions

Resources