How to set up Spark on multi-node Hadoop cluster? - hadoop

I would like to install Hadoop HDFS and Spark on multi-node cluster.
I was able to successfully install and configure Hadoop on multi-node cluster. I have also installed and configured Spark on master node.
I have doubts that I have to configure the spark in slaves as well?

I have doubt that I have to configure the spark in slaves as well?
You should not. You're done. You did more than you had to to submit Spark applications to Hadoop YARN (which I concluded is the cluster manager).
Spark is a library for distributed computations on massive datasets and as such it belongs solely to your Spark applications (not any cluster you may use).
Time to spark-submit Spark applications!

Related

YARN-Specify Which Application to be Run on Which Nodemanager

I have a Hadoop YARN cluster including one resourcemanager and 6 nodemanagers. I want to run both Flink and Spark applications on the cluster. So I have two main question about YARN:
In case of Spark, Should I install and config Spark on resource manager and each nodemanagers? When I want to submit a Spark application on YARN, in addition to YARN resourcemanager and nodemanagers, should Spark cluster (master and slaves) be run?
Can I set YARN such that run Flink in some special nodemanagers?
Thanks
For the first question, that depends on whether you're using a packaged Hadoop distribution (like Cloudera CDH, Hortonworks HDP for example) or not. The distros will likely take care of this. If you're not using a distribution, you need to consider if you want to run Spark on YARN or Spark stand-alone.
For the second question, you can specify special Node Managers if you are using Capacity Scheduler with the node-labelling feature enabled and if you are using Hadoop 2.6 and higher.

monitoring spark cluster standalone mode with ganglia

I have installed spark 2.0.2 prebuitl for hadoop 2.4 and later from here : https://spark.apache.org/downloads.html . than, I have created my cluster composed from 1 master and 2 workers, also, I have installed Ganglia on the 3 machines (gmetad, gmond on master and gmond only on the workers). I need to monitor spark cluster usage of CPU, memory and disk when running a spark application to get the performance of my cluster.
My question is how to integrate Ganglia with spark, how to see spark metrics in ganglia web UI? I know that we must configure metric.properties file in $SPARK_HOME/conf to set up ganglia sinks..I did this but I learn here that we must have LGPL packages and this one is not included by default. How install it while I have spark prebuilt. Should I rebuilt spark ? How do it?
I have found into the two links below that spark used is built by mvn or sbt but is not same as what I have used ( Spark Pre-built)
Spark Monitoring with Ganglia and
How to integrate Ganglia for Spark 2.1 Job metrics, Spark ignoring Ganglia metrics
Thank you

does Cloudera run with cluster of machine-with Hadoop

I want to ask about "Cloudera" to run Hadoop MapReduce.
Does the cloudera support to run the application over cluster of machine?
What does cloudera actually over, when I use it with Virtual Machine to run the Hadoop MapReduce application?
If Cloudera doesn't support to run over cluster of machine? How can I run Hadoop app in cluster of machine?
I really confused about that:(
Thanks in advance.

Spark cluster - read/write on hadoop

I would like to read data from hadoop, process on spark, and wirte result on hadoop and elastic search. I have few worker nodes to do this.
Spark standalone cluster is sufficient? or Do I need to make hadoop cluster to use yarn or mesos?
If standalone cluster mode is sufficient, should jar file be set on all node unlike yarn, mesos mode?
First of all, you can not write data in Hadoop or read data from Hadoop. It is HDFS (Component of Hadoop ecosystem) which is responsible for read/write of data.
Now coming to your question
Yes, it possible to read data from HDFS and process it in spark engine and then write the output on HDFS.
YARN, mesos and spark standalone all are cluster managers and you can use any one of them to do management of resources in your cluster and it had nothing to do with hadoop. But since you want to read and write data from/to HDFS then you need to install HDFS on cluster and thus it is better to install hadoop on your all nodes that will also install HDFS on all nodes. Now whether you want to use YARN, mesos or spark standalone that is your choice all will work with HDFS I myself use spark standalone for cluster management.
It is not clear about which jar files you are talking to but I assume it will be of spark then yes you need to set the path for spark jar on each node so that there will be no contradiction in paths when spark run's.

What is the difference between multi node hadoop cluster and running hadoop on mesos?

I've built a multi node hadoop cluster, then i started studying mesos and the ability to run hadoop on mesos cluster, so here's my questions:
1) Should I run hadoop on mesos cluster? or it doesn't matter.
2) What is the difference between them?
There are different things in different hierarchies. You could deploy the hadoop cluster in a set of machines directly. So that your machines could handle hadoop jobs now.
Or you could deploy mesos cluster first, and then deploy hadoop cluster, spark cluster, kafka and other things on mesos. So that you could sumbit your hadoop jobs to the hadoop cluster, submit your spark jobs to the spark cluster.

Resources