What is the master URL in EC2 spark cluster - amazon-ec2

I have a spark cluster launched using spark-ec2 script.
(EDIT: after login into the master), I can run spark jobs locally on the master node as :
spark-submit --class myApp --master local myApp.jar
But I can't seem to run the job in the cluster mode:
../spark/bin/spark-submit --class myApp --master spark://54.111.111.111:7077 --deploy-mode cluster myApp.jar
The ip address of the master is obtained from the AWS console.
I get the following errors:
WARN RestSubmissionClient: Unable to connect to server
Warning: Master endpoint spark://54.111.111.111:7077 was not a REST server. Falling back to legacy submission gateway instead.
Error connecting to master (akka.tcp://sparkMaster#54.111.111.111:7077).
Cause was: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster#54.177.156.236:7077
No master is available, exiting.
How to submit to a EC2 spark cluster ?

When you run with --master local you are also not connecting to the master. You are executing Spark operations in the same JVM as the application. (See docs.)
Your application code may be wrong too. So first just try to run spark-shell on the master node. /root/spark/bin/spark-shell is configured to connect to the EC2 Spark master when started without flags. If that works, you can try spark-shell --master spark://ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com:7077 on your laptop. Be sure to use the external IP or hostname of the master machine.
If that works too, try running your application in client mode (without --deploy-mode cluster). Hopefully in the course of trying all these, you will figure out what was wrong with your original approach. Good luck!

This is nothing to do with EC2, I had similar error on my server. I was able to resolve it by overwriting spark-env.sh SPARK_MASTER_IP.

Related

how to switch between cluster types in Apache Spark

I'm trying to switch cluster manager from standalone to 'YARN' in Apache Spark that I've installed for learning.
I read following thread to understand which cluster type should be chosen
However, I'd like to know the steps/syntax to change the cluster type.
Ex: from Standalone to YARN or from YARN to Standalone.
In spark there is one function name as --master that can helps you to execute your script on yarn Cluster mode or standalone mode.
Run the application on local mode or standalone used this with spark-submit command
--master Local[*]
or
--master spark://192.168.10.01:7077 \
--deploy-mode cluster \
Run on a YARN cluster
--master yarn
--deploy-mode cluster
For more information kindly visit this link.
https://spark.apache.org/docs/latest/submitting-applications.html
If you are not running through command line then you can directly set this master on SparkConf object.
sparkConf.setMaster(http://path/to/master/url:port) in cluster mode
or
sparkConf.setMaster(local[*]) in client/local mode

Can't see Yarn Job when doing Spark-Submit on Yarn Cluster

I am using spark-submit for my job with the command below:
spark-submit script_test.py --master yarn --deploy-mode cluster
spark-submit script_test.py --master yarn-cluster --deploy-mode cluster
The job is working fine. I can see it under the Spark History Server UI. However, I cannot see it under the RessourceManager UI ( YARN).
I have the feeling that my job is not sent to the cluster but it is running only in one node. However, I see nothing wrong on the way I use the Spark-submit command.
Am-i wrong? How can I check it? Or send the job to yarn cluster?
When you are using --master yarn means that in some place you have configured the yarn-site with hosts, ports, and so on.
Maybe the machine where you are using the spark-submit doesn't know where is the Yarn master.
You could check your hadoop/yarn/spark config files, specially the yarn-site.xml to check if the host of the Resource Manager is correct or not.
Those files are in different folders depending on which distribution of Hadoop you are using. In HDP I guess they are in /etc/hadoop/conf
Hope it helps.

Spark submit to remote yarn

I have two clodera hadoop cluster (prod and dev) and one client machine. This client machine is configured to be a gateway node to the prod cluster.
From this I am able to submit a spark job to my prod cluster using
spark-submit --master yarn job_script.py
Now I would like to submit the same job to my dev cluster from this client machine.
I tried using
spark-submit --master yarn://<dev_resource_manager_ip>:8032 job_script.py
But this doesn't seem to work and my job is still getting submitted to prod cluster. How could I tell spark-submit to submit job to dev cluster resource manager instead of prod cluster.
Create directory with all Hadoop XMLs for dev cluster and override HADOOP_CONF_DIR environment variable before spark-submit.

H2O: unable to connect to h2o cluster through python

I have a 5 node hadoop cluster running HDP 2.3.0. I setup a H2O cluster on Yarn as described here.
On running following command
hadoop jar h2odriver_hdp2.2.jar water.hadoop.h2odriver -libjars ../h2o.jar -mapperXmx 512m -nodes 3 -output /user/hdfs/H2OTestClusterOutput
I get the following ouput
H2O cluster (3 nodes) is up
(Note: Use the -disown option to exit the driver after cluster formation)
(Press Ctrl-C to kill the cluster)
Blocking until the H2O cluster shuts down...
When I try to execute the command
h2o.init(ip="10.113.57.98", port=54321)
The process remains stuck at this stage.On trying to connect to the web UI using the ip:54321, the browser tries to endlessly load the H2O admin page but nothing ever displays.
On forcefully terminating the init process I get the following error
No instance found at ip and port: 10.113.57.98:54321. Trying to start local jar...
However if I try and use H2O with python without setting up a H2O cluster, everything runs fine.
I executed all commands as the root user. Root user has permissions to read and write from the /user/hdfs hdfs directory.
I'm not sure if this is a permissions error or that the port is not accessible.
Any help would be greatly appreciated.
It looks like you are using H2O2 (H2O Classic). I recommend upgrading your H2O to the latest (H2O 3). There is a build specifically for HDP2.3 here: http://www.h2o.ai/download/h2o/hadoop
Running H2O3 is a little cleaner too:
hadoop jar h2odriver.jar -nodes 1 -mapperXmx 6g -output hdfsOutputDirName
Also, 512mb per node is tiny - what is your use case? I would give the nodes some more memory.

Submitting a Spark application to the virtualbox Spark Master

I have created a Spark application Hello World that works well locally through Eclipse IDE.
I would like to deploy remotely this application from my local machine to the virtualbox Cloudera machine, through the "spark-submit".
The command line used for that is:
C:\Users\S-LAMARTI\Desktop\AXA\Workspaces\AXA\helloworld\target>%SPARK_HOME%/spa
rk-submit --class com.saadlamarti.helloworld.App --master spark://192.168.56.102
:7077 --deploy-mode cluster helloworld-0.0.1-SNAPSHOT.jar
Unfortunately, the application doesn't work, and I get this message error:
15/10/12 12:20:40 WARN RestSubmissionClient: Unable to connect to server spark:/
/192.168.56.102:7077.
Warning: Master endpoint spark://192.168.56.102:7077 was not a REST server. Fall
ing back to legacy submission gateway instead.
Can someone have any idea, why is not working?
Remove the arguement --deploy-mode cluster and try again.
Check the master:8080,and then you can see two url,one is the client submit url,another is the rest for cluster.
Find your REST url, if you set the argument --deploy-mode cluster, you must set the argument --master spark:Rest url.

Resources