I would like to make spark yarn client (link). Does it need to install hadoop ? or is it ok to install only yarn? ( by this
link)
No Spark do not require Hadoop for running. Apache Spark is an independent project which can run on its own. If you want you can even run it without apache yarn.
Spark support 3 type of cluster manager which are mesos, yarn and standalone. if you do not have yarn installed then it can use mesos and standalone and by default it uses standalone when you do not mention any preference for cluster manager.Links which you have mentioned is fine to use but I think more better resources are available on google.
I have two clodera hadoop cluster (prod and dev) and one client machine. This client machine is configured to be a gateway node to the prod cluster.
From this I am able to submit a spark job to my prod cluster using
spark-submit --master yarn job_script.py
Now I would like to submit the same job to my dev cluster from this client machine.
I tried using
spark-submit --master yarn://<dev_resource_manager_ip>:8032 job_script.py
But this doesn't seem to work and my job is still getting submitted to prod cluster. How could I tell spark-submit to submit job to dev cluster resource manager instead of prod cluster.
Create directory with all Hadoop XMLs for dev cluster and override HADOOP_CONF_DIR environment variable before spark-submit.
I have a Hadoop cluster deployed, and the client MapReduce program is running on another machine. How can I use that cluster?
If you have you have your jars in a client machine install hadoop-client packages in that machine and have configuration details of cluster in conf folder so that you can trigger your jobs from client machine into remote cluster
I have a cluster of two machines and using apache hbase and apache hadoop. I have to use hue so that I can interect with hbase or hdfs through GUI. I have installed it successfully on my machine(ubuntu 14.04) but it is showing nothing about hdfs or tables etc. and gives error like
1.oozie server is not running
2.could not connect to local:9090
HBase thrift server cannot be contacted
How to do setting og hue so that it should connect to my running cluster.
I am trying to understand how spark runs on YARN cluster/client. I have the following question in my mind.
Is it necessary that spark is installed on all the nodes in yarn cluster? I think it should because worker nodes in cluster execute a task and should be able to decode the code(spark APIs) in spark application sent to cluster by the driver?
It says in the documentation "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster". Why does client node have to install Hadoop when it is sending the job to cluster?
Adding to other answers.
Is it necessary that spark is installed on all the nodes in the yarn
cluster?
No, If the spark job is scheduling in YARN(either client or cluster mode). Spark installation is needed in many nodes only for standalone mode.
These are the visualizations of spark app deployment modes.
Spark Standalone Cluster
In cluster mode driver will be sitting in one of the Spark Worker node whereas in client mode it will be within the machine which launched the job.
YARN cluster mode
YARN client mode
This table offers a concise list of differences between these modes:
pics source
It says in the documentation "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client-side)
configuration files for the Hadoop cluster". Why does the client node have
to install Hadoop when it is sending the job to cluster?
Hadoop installation is not mandatory but configurations(not all) are!. We can call them Gateway nodes. It's for two main reasons.
The configuration contained in HADOOP_CONF_DIR directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration.
In YARN mode the ResourceManager’s address is picked up from the
Hadoop configuration(yarn-default.xml). Thus, the --master parameter is yarn.
Update: (2017-01-04)
Spark 2.0+ no longer requires a fat assembly jar for production
deployment. source
We are running spark jobs on YARN (we use HDP 2.2).
We don't have spark installed on the cluster. We only added the Spark assembly jar to the HDFS.
For example to run the Pi example:
./bin/spark-submit \
--verbose \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--conf spark.yarn.jar=hdfs://master:8020/spark/spark-assembly-1.3.1-hadoop2.6.0.jar \
--num-executors 2 \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 4 \
hdfs://master:8020/spark/spark-examples-1.3.1-hadoop2.6.0.jar 100
--conf spark.yarn.jar=hdfs://master:8020/spark/spark-assembly-1.3.1-hadoop2.6.0.jar - This config tell the yarn from were to take the spark assembly. If you don't use it, it will upload the jar from were you run spark-submit.
About your second question: The client node doesn't not need Hadoop installed. It only needs the configuration files. You can copy the directory from your cluster to your client.
1 - Spark if following s slave/master architecture. So on your cluster, you have to install a spark master and N spark slaves. You can run spark in a standalone mode. But using Yarn architecture will give you some benefits.
There is a very good explanation of it here : http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
2- It is necessary if you want to use Yarn or HDFS for example, but as i said before you can run it in standalone mode.
Let me try to cut glues and make it short for impatient.
6 components: 1. client, 2. driver, 3. executors, 4. application master, 5. workers, and 6. resource manager; 2 deploy modes; and 2 resource (cluster) management.
Here's the relation:
Client
Nothing special, is the one submitting spark app.
Worker, executors
Nothing special, one worker holds one or more executors.
Master, & resource (cluster) manager
(no matter client or cluster mode)
in yarn, resource manager and master sit in two different nodes;
in standalone, resource manager == master, same process in the same node.
Driver
in client mode, sits with client
in yarn - cluster mode, sits with master (in this case, client process exits after submission of app)
in standalone - cluster mode, sits with one worker
VoilĂ !