Apache Spark's deployment issue (cluster-mode) with Hive - hadoop

EDIT:
I'm developing a Spark application that reads a data from the multiple structured schemas and I'm trying to aggregate the information from those schemas. My application runs well when I run it locally. But when I run it on a cluster, I'm having trouble with the configurations (most probably with hive-site.xml) or with the submit-command arguments. I've looked for the other related posts, but couldn't find the solution SPECIFIC to my scenario. I've mentioned what commands I tried and what errors I got in detail below. I'm new to Spark and I might be missing something trivial, but can provide more information to support my question.
Original Question:
I've been trying to run my spark application in a 6-node Hadoop cluster bundled with HDP2.3 components.
Here are component information that might be useful for you guys in suggesting the solutions:
Cluster information: 6-node cluster:
128GB RAM
24 core
8TB HDD
Components used in the application
HDP - 2.3
Spark - 1.3.1
$ hadoop version:
Hadoop 2.7.1.2.3.0.0-2557
Subversion git#github.com:hortonworks/hadoop.git -r 9f17d40a0f2046d217b2bff90ad6e2fc7e41f5e1
Compiled by jenkins on 2015-07-14T13:08Z
Compiled with protoc 2.5.0
From source with checksum 54f9bbb4492f92975e84e390599b881d
Scenario:
I'm trying to use the SparkContext and HiveContext in a way to take full advantage of the spark's real time query on it's data structure like dataframe. The dependencies used in my application are:
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>1.4.0</version>
</dependency>
Below are the submit commands and the coresponding error logs that I'm getting:
Submit Command1:
spark-submit --class working.path.to.Main \
--master yarn \
--deploy-mode cluster \
--num-executors 17 \
--executor-cores 8 \
--executor-memory 25g \
--driver-memory 25g \
--num-executors 5 \
application-with-all-dependencies.jar
Error Log1:
User class threw exception: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
Submit Command2:
spark-submit --class working.path.to.Main \
--master yarn \
--deploy-mode cluster \
--num-executors 17 \
--executor-cores 8 \
--executor-memory 25g \
--driver-memory 25g \
--num-executors 5 \
--files /etc/hive/conf/hive-site.xml \
application-with-all-dependencies.jar
Error Log2:
User class threw exception: java.lang.NumberFormatException: For input string: "5s"
Since I don't have the administrative permissions, I cannot modify the configuration. Well, I can contact to the IT engineer and make the changes, but I'm looking for the
solution that involves less changes in the configuration files, if possible!
Configuration changes were suggested here.
Then I tried passing various jar files as arguments as suggested in other discussion forums.
Submit Command3:
spark-submit --class working.path.to.Main \
--master yarn \
--deploy-mode cluster \
--num-executors 17 \
--executor-cores 8 \
--executor-memory 25g \
--driver-memory 25g \
--num-executors 5 \
--jars /usr/hdp/2.3.0.0-2557/spark/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/2.3.0.0-2557/spark/lib/datanucleus-core-3.2.10.jar,/usr/hdp/2.3.0.0-2557/spark/lib/datanucleus-rdbms-3.2.9.jar \
--files /etc/hive/conf/hive-site.xml \
application-with-all-dependencies.jar
Error Log3:
User class threw exception: java.lang.NumberFormatException: For input string: "5s"
I didn't understood what happened with the following command and couldn't analyze the error log.
Submit Command4:
spark-submit --class working.path.to.Main \
--master yarn \
--deploy-mode cluster \
--num-executors 17 \
--executor-cores 8 \
--executor-memory 25g \
--driver-memory 25g \
--num-executors 5 \
--jars /usr/hdp/2.3.0.0-2557/spark/lib/*.jar \
--files /etc/hive/conf/hive-site.xml \
application-with-all-dependencies.jar
Submit Log4:
Application application_1461686223085_0014 failed 2 times due to AM Container for appattempt_1461686223085_0014_000002 exited with exitCode: 10
For more detailed output, check application tracking page:http://cluster-host:XXXX/cluster/app/application_1461686223085_0014Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_e10_1461686223085_0014_02_000001
Exit code: 10
Stack trace: ExitCodeException exitCode=10:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 10
Failing this attempt. Failing the application.
Any other possible options? Any kind of help will be highly appreciated. Please let me know if you need any other information.
Thank you.

The solution explained in here worked for my case. There are two locations hive-site.xml resides that might be confusing. Use --files /usr/hdp/current/spark-client/conf/hive-site.xml instead of --files /etc/hive/conf/hive-site.xml. I didn't have to add the jars for my configuration. Hope this will help someone struggling with the similar problem. Thanks.

Related

Running spark2 job through Oozie shell action?

As mentioned from the title i'm trying to run a shell action that kicks off a spark job but unfortunately i'm consistently getting the following error...
19/05/10 14:03:39 ERROR AbstractRpcClient: SASL authentication failed.
The most likely cause is missing or invalid credentials. Consider
'kinit'.
javax.security.sasl.SaslException: GSS initiate failed [Caused by
GSSException: No valid credentials provided (Mechanism level: Failed
to find any Kerberos tgt)]
java.io.IOException: Could not set up IO Streams to
<hbaseregionserver>
Fri May 10 14:03:39 BST 2019,
RpcRetryingCaller{globalStartTime=1557493419339, pause=100,
retries=2}, org.apache.hadoop.hbase.ipc.FailedServerException: This
server is in the failed servers list: <hbaseregionserver>
Been playing around trying to get the script to take in the kerberos ticket but having no luck, as far as I can tell its related to the Oozie job not being able to pass the kerberos ticket any ideas why its not picking it up? I'm at a loss? Related code is below
Oozie workflow action
<action name="sparkJ" cred="hive2Cred">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${oozieQueueName}</value>
</property>
</configuration>
<exec>run.sh</exec>
<file>/thePathToTheScript/run.sh#run.sh</file>
<file>/thePathToTheProperties/myp.properties#myp.properties</file>
<capture-output />
</shell>
<ok to="end" />
<error to="fail" />
</action>
Shell script
#!/bin/sh
export job_name=SPARK_JOB
export configuration=myp.properties
export num_executors=10
export executor_memory=1G
export queue=YARNQ
export max_executors=50
kinit -kt KEYTAB KPRINCIPAL
echo "[[[[[[[[[[[[[ Starting Job - name:${job_name},
configuration:${configuration} ]]]]]]]]]]]]]]"
/usr/hdp/current/spark2-client/bin/spark-submit \
--name ${job_name} \
--driver-java-options "-Dlog4j.configuration=file:./log4j.properties" \
--num-executors ${num_executors} \
--executor-memory ${executor_memory} \
--master yarn \
--keytab KEYTAB \
--principal KPRINCIPAL \
--supervise \
--deploy-mode cluster \
--queue ${queue} \
--files "./${configuration},./hbase-site.xml,./log4j.properties" \
--conf spark.driver.extraClassPath="/usr/hdp/current/hive-
client/lib/datanucleus-*.jar:/usr/hdp/current/tez-client/*.jar" \
--conf spark.executor.extraJavaOptions="-
Djava.security.auth.login.config=./jaas.conf -
Dlog4j.configuration=file:./log4j.properties" \
--conf spark.executor.extraClassPath="/usr/hdp/current/hive-
client/lib/datanucleus-*.jar:/usr/hdp/current/tez-client/*.jar" \
--conf spark.streaming.stopGracefullyOnShutdown=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.maxExecutors=${max_executors} \
--conf spark.streaming.concurrentJobs=2 \
--conf spark.streaming.backpressure.enabled=true \
--conf spark.yarn.security.tokens.hive.enabled=true \
--conf spark.yarn.security.tokens.hbase.enabled=true \
--conf spark.streaming.kafka.maxRatePerPartition=5000 \
--conf spark.streaming.backpressure.pid.maxRate=3000 \
--conf spark.streaming.backpressure.pid.minRate=200 \
--conf spark.streaming.backpressure.initialRate=5000 \
--jars /usr/hdp/current/hbase-client/lib/guava-
12.0.1.jar,/usr/hdp/current/hbase-client/lib/hbase-
common.jar,/usr/hdp/current/hbase-client/lib/hbase-
client.jar,/usr/hdp/current/hbase-client/lib/hbase-
protocol.jar,/usr/hdp/current/spark-client/lib/datanucleus-api-jdo-
3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-
3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-
3.2.10.jar \
--class myclass myjar.jar ./${configuration}
Many thanks to any help you can provide.

How to increase Permgen space for spark jobs getting launched via oozie

I am having an issue where spark my executors are running out of permgen space. I launch my spark jobs using oozie's spark action. The same jobs run without issue if I launch them manually using spark-submit.
I have tried increasing the permgen space using each of these methods
1. ---conf spark.executor.extraJavaOptions=-XX:MaxPermSize=2g
as a command-line option in <spark-opts> in oozie's workflow.xml
2. Increasing the oozie.launcher.mapreduce.child.java.opts in global <configuration> in oozie's workflow.xml
3. Programmatically in spark code, when SparkConf is created, using sparkConf.set("spark.executor.extraJavaOptions","-XX:MaxPermSize=2g")
Spark UI executor indicates that the property is successfully set. But when the executors are launched, they have the default of 256M in addition to the passed parameter which appears in single quotes :
{{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms6144m -Xmx6144m '-XX:MaxPermSize=2g' -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.driver.port=45057' -Dspark.yarn.app.container.log.dir=<LOG_DIR> -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler#10.0.0.124:45057 --executor-id 3 --hostname ip-10-0-0-126.us-west-2.compute.internal --cores 4 --app-id application_1471652816867_0103 --user-class-path file:$PWD/__app__.jar 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr
Any ideas on how to increase the PermGen space ? I am running spark 1.6.2 with hadoop 2.7.2. Oozie version is 4.2

Running Spark on Yarn Client

I have recently setup an Multinode Hadoop HA (Namenode & ResourceManager) Cluster (3 node) , The installation is completed and all daemon's run as expected
Daemon in NN1 :
2945 JournalNode
3137 DFSZKFailoverController
6385 Jps
3338 NodeManager
22730 QuorumPeerMain
2747 DataNode
3228 ResourceManager
2636 NameNode
Daemon in NN2 :
19620 Jps
3894 QuorumPeerMain
16966 ResourceManager
16808 NodeManager
16475 DataNode
16572 JournalNode
17101 NameNode
16702 DFSZKFailoverController
Daemon in DN1 :
12228 QuorumPeerMain
29060 NodeManager
28858 DataNode
29644 Jps
28956 JournalNode
I am interested to run Spark Jobs on my Yarn setup.
I have installed Scala and Spark on my NN1 and i can successfully start my spark by issuing the following command
$ spark-shell
Now , i have no knowledge about SPARK , i would like to know how can i run Spark on Yarn. I have read that we can run it as either yarn-client or yarn-cluster.
Should i install the spark & scala on all nodes in the Cluster (NN2 & DN1) to run spark on Yarn client or cluster ? If No then how can i submit the Spark Jobs from NN1 (Primary namenode) host.
I have copied over the Spark assembly JAR to the HDFS as suggested in a blog i read ,
-rw-r--r-- 3 hduser supergroup 187548272 2016-04-04 15:56 /user/spark/share/lib/spark-assembly.jar
Also created SPARK_JAR variable in my bashrc file.I tried to submit the Spark Job as yarn-client but i end up with error as below , I have no idea on if i am doing it all correct or need other settings to be done first.
[hduser#ptfhadoop01v spark-1.6.0]$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client --driver-memory 4g --executor-memory 2g --executor-cores 2 --queue thequeue lib/spark-examples*.jar 10
16/04/04 17:27:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/04/04 17:27:51 WARN SparkConf:
SPARK_WORKER_INSTANCES was detected (set to '2').
This is deprecated in Spark 1.0+.
Please instead use:
- ./spark-submit with --num-executors to specify the number of executors
- Or set SPARK_EXECUTOR_INSTANCES
- spark.executor.instances to configure the number of instances in the spark config.
16/04/04 17:27:54 WARN Client: SPARK_JAR detected in the system environment. This variable has been deprecated in favor of the spark.yarn.jar configuration variable.
16/04/04 17:27:54 WARN Client: SPARK_JAR detected in the system environment. This variable has been deprecated in favor of the spark.yarn.jar configuration variable.
16/04/04 17:27:57 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:124)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:64)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:530)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:29)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/04/04 17:27:58 WARN MetricsSystem: Stopping a MetricsSystem that is not running
Exception in thread "main" org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:124)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:64)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:530)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:29)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[hduser#ptfhadoop01v spark-1.6.0]$
Please help me to resolve this and on how to run Spark on Yarn as client or as Cluster mode.
Now , i have no knowledge about SPARK , i would like to know how can i run Spark on Yarn. I have read that we can run it as either yarn-client or yarn-cluster.
It's highly recommended that you read the official documentation of Spark on YARN at http://spark.apache.org/docs/latest/running-on-yarn.html.
You can use spark-shell with --master yarn to connect to YARN. You need to have proper configuration files on the machine you do spark-shell from, e.g. yarn-site.xml.
Should i install the spark & scala on all nodes in the Cluster (NN2 & DN1) to run spark on Yarn client or cluster ?
No. You don't have to install anything on YARN since Spark will distribute necessary files for you.
If No then how can i submit the Spark Jobs from NN1 (Primary namenode) host.
Start with spark-shell --master yarn and see if you can execute the following code:
(0 to 5).toDF.show
If you see a table-like output, you're done. Else, provide the error(s).
Also created SPARK_JAR variable in my bashrc file.I tried to submit the Spark Job as yarn-client but i end up with error as below , I have no idea on if i am doing it all correct or need other settings to be done first.
Remove the SPARK_JAR variable. Don't use it as it's not needed and might cause troubles. Read the official documentation at http://spark.apache.org/docs/latest/running-on-yarn.html to understand the basics of Spark on YARN and beyond.
By adding this property into hdfs-site.xml , it solved the issue
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
In the client mode you'd run it something like below for simple word count example
spark-submit --class org.sparkexample.WordCount --master yarn-client wordcount-sample-plain-1.0-SNAPSHOT.jar input.txt output.txt
I think you got the spark-submit command wrong there. There is no --master yarn set up.
I would highly recommend using an automated provisioning tool to set up your cluster quickly instead of a manual approach.
Refer to Cloudera or Hortonworks tools. You can use it to get setup in no time and be able to submit jobs easily without doing all these configurations manually.
Reference: https://hortonworks.com/products/hdp/

Driver not found exception while querying oracle using spark submit in yarn-client mode

I am getting the following exception while submitting job to spark:
Exception in thread "main" java.sql.SQLException:
No suitable driver found for jdbc:oracle:thin:user/pass#x.x.x.x:1521:orcl
Command used to submit job:
bin/spark-submit --class com.mycompany.app.App
--master yarn-client
--conf spark.yarn.jar=hdfs://192.168.xxx.xx:9000/spark-assembly-1.3.1-hadoop2.6.0.jar
--driver-class-path hdfs://192.168.xxx.xx:9000/ojdbc7.jar
--jars hdfs://192.168.xxx.xx:9000/ojdbc7.jar
/home/impadmin/maventest-idw/my-app/target/my-app-1.0.jar
Any pointers?
--jar does not need to be used if you are using it with HDFS, HTTP or FTP files.
You can directly use SparkContext.addJar("path of jar file")

Running JAR in Hadoop on Google Cloud using Yarn-client

i want to run a JAR in Hadoop on Google Cloud using Yarn-client.
i use this command in the master node of hadoop
spark-submit --class find --master yarn-client find.jar
but it return this error
15/06/17 10:11:06 INFO client.RMProxy: Connecting to ResourceManager at hadoop-m-on8g/10.240.180.15:8032
15/06/17 10:11:07 INFO ipc.Client: Retrying connect to server: hadoop-m-on8g/10.240.180.15:8032. Already tried 0
time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
What is the problem? In case it is useful this is my yarn-site.xml
<?xml version="1.0" ?>
<!--
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/yarn-logs/</value>
<description>
The remote path, on the default FS, to store logs.
</description>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-m-on8g</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>5999</value>
<description>
In your case, it looks like the YARN ResourceManager may be unhealthy for unknown reasons; you can try to fix yarn with the following:
sudo sudo -u hadoop /home/hadoop/hadoop-install/sbin/stop-yarn.sh
sudo sudo -u hadoop /home/hadoop/hadoop-install/sbin/start-yarn.sh
However, it looks like you're using the Click-to-Deploy solution; Click-to-Deploy's Spark + Hadoop 2 deployment actually doesn't support Spark on YARN at the moment, due to some bugs and lack of memory configs. You'd normally run into something like this if you just try to run it with --master yarn-client out-of-the-box:
15/06/17 17:21:08 INFO cluster.YarnClientSchedulerBackend: Application report from ASM:
appMasterRpcPort: -1
appStartTime: 1434561664937
yarnAppState: ACCEPTED
15/06/17 17:21:09 INFO cluster.YarnClientSchedulerBackend: Application report from ASM:
appMasterRpcPort: -1
appStartTime: 1434561664937
yarnAppState: ACCEPTED
15/06/17 17:21:10 INFO cluster.YarnClientSchedulerBackend: Application report from ASM:
appMasterRpcPort: 0
appStartTime: 1434561664937
yarnAppState: RUNNING
15/06/17 17:21:15 ERROR cluster.YarnClientSchedulerBackend: Yarn application already ended: FAILED
15/06/17 17:21:15 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/metrics/json,null}
15/06/17 17:21:15 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
The well-supported way to deploy is a cluster on Google Compute Engine with Hadoop 2 and Spark configured to be able to run on YARN is to use bdutil. You'd run something like:
./bdutil -P <instance prefix> -p <project id> -b <bucket> -z <zone> -d \
-e extensions/spark/spark_on_yarn_env.sh generate_config my_custom_env.sh
./bdutil -e my_custom_env.sh deploy
# Shorthand for logging in to the master
./bdutil -e my_custom_env.sh shell
# Handy way to run a socks proxy to make it easy to access the web UIs
./bdutil -e my_custom_env.sh socksproxy
# When done, delete your cluster
./bdutil -e my_custom_env.sh delete
With spark_on_yarn_env.sh Spark should default to yarn-client, though you can always re-specify --master yarn-client if you want. You can see a more detailed explanation of the flags available in bdutil with ./bdutil --help. Here are the help entries just for the flags I included above:
-b, --bucket
Google Cloud Storage bucket used in deployment and by the cluster.
-d, --use_attached_pds
If true, uses additional non-boot volumes, optionally creating them on
deploy if they don't exist already and deleting them on cluster delete.
-e, --env_var_files
Comma-separated list of bash files that are sourced to configure the cluster
and installed software. Files are sourced in order with later files being
sourced last. bdutil_env.sh is always sourced first. Flag arguments are
set after all sourced files, but before the evaluate_late_variable_bindings
method of bdutil_env.sh. see bdutil_env.sh for more information.
-P, --prefix
Common prefix for cluster nodes.
-p, --project
The Google Cloud Platform project to use to create the cluster.
-z, --zone
Specify the Google Compute Engine zone to use.

Resources