spark-submit not waiting for state FINISHED before exiting program - bash

I am submitting a spark job to our yarn service via spark-submit. From my understanding spark-submit should continue running until there is a state status of FINISHED before moving on. However once submitted through bamboo, spark-submit is exiting and going straight to the wait which then the sql query is going to run. But the sql query shouldnt run until the spark job is 100% finished. Not sure why my spark-submit is not waiting. Any help is appreciated, thanks
nohup spark-submit --name "${APP_NAME}" \
--class "${SPARK_CLASS_NAME}" \
--files jaas.conf,kafka.properties,distributed.properties,${KEYTAB},pools.xml \
--principal ${PRINCIPAL} \
--keytab ${KEYTAB_ALT} \
--conf "spark.driver.extraJavaOptions=${JVM_ARGS}" \
--conf "spark.executor.extraJavaOptions=${JVM_ARGS}" \
--conf spark.haplogic.env=${ENV} \
--conf spark.scheduler.allocation.file=${POOL_SCHEDULER_FILE} \
--conf spark.master=yarn \
--conf spark.submit.deployMode=cluster \
--conf spark.yarn.submit.waitAppCompletion=true \
--conf spark.driver.memory=$(getProperty "spark.driver.memory") \
--conf spark.executor.memory=$(getProperty "spark.executor.memory") \
--conf spark.executor.instances=$(getProperty "spark.executor.instances") \
--conf spark.executor.cores=$(getProperty "spark.executor.cores") \
--conf spark.yarn.maxAppAttempts=$(getProperty "spark.yarn.maxAppAttempts") \
--conf spark.dynamicAllocation.enabled=$(getProperty "spark.dynamicAllocation.enabled") \
--conf spark.yarn.queue=$(getProperty "spark.yarn.queue") \
--conf spark.memory.fraction=$(getProperty "spark.memory.fraction") \
--conf spark.memory.storageFraction=$(getProperty "spark.memory.storageFraction") \
--conf spark.eventLog.enabled=$(getProperty "spark.eventLog.enabled") \
--conf spark.serializer=org.apache.spark.serializer.JavaSerializer \
--conf spark.acls.enable=true \
--conf spark.admin.acls.groups=${USER_GROUPS} \
--conf spark.acls.enable.groups=${USER_GROUPS} \
--conf spark.ui.view.acls.groups=${USER_GROUPS} \
--conf spark.serializer=org.apache.spark.serializer.JavaSerializer \
--conf spark.yarn.appMasterEnv.SECRETS_LIB_MASTER_KEY=${SECRETS_LIB_MASTER_KEY} \
${JARFILE_NAME} >> ${LOG_FILE} 2>&1 &
sleep 90

The issue was a bash knowledge issue,
& commands runs spark-submit on the background so bamboo will act as if it was complete after the sleep.

Related

How to pass additional parameter in bash script

I want to design a pipeline for executing a program that can have multiple configurations by argument. Developer is not interested to have each argument as a variable and they want to have the option to be able to add multiple variables by using pipeline. we are using bash and our development using gitlab-ci and we are using octopus for uat env deployment.
example:
spark2-submit \
--master $MASTER \
--name $NAME \
--queue $QUEUE \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.executorIdleTimeout=12
As you can see in the above example, I want to have flexibility in adding more "--conf" parameters.
should I have a dummy parameter and then add it to the end of this command?
for example:
spark2-submit \
--master $MASTER \
--name $NAME \
--queue $QUEUE \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.executorIdleTimeout=12 \
$additional_param
I am using Gitlab for my code repo and Octopus for my CICD. I am using bash for deployment. I am looking for a flexible option that I can use the full feature of the Octopus variable option and gitlab. what is your recommendation? do you have a better suggestion?
This is what Charles is hinting at with "Lists of arguments should be stored in arrays":
spark2_opts=(
--master "$MASTER"
--name "$NAME"
--queue "$QUEUE"
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.enabled=true
--conf spark.dynamicAllocation.executorIdleTimeout=12
)
# add additional options, through some mechanism such as the environment:
if [[ -n "$SOME_ENV_VAR" ]]; then
spark2_opts+=( --conf "$SOME_ENV_VAR" )
fi
# and execute
spark2-submit "${spark2_opts[#]}"
bash array definitions can contain arbitrary whitespace, including newlines,
so format for readability

Shell: How to stop a command line if it times out?

I want to stop the command if it run more than 1 minutes and continue run the next command.
for((part=0;part<=100;part++));
do
spark-submit \
--verbose \
--master yarn \
--deploy-mode cluster \
--conf spark.pyspark.python=myenv/bin/python3 \
python_demo.py $part
done
The command spark-submit will submit my code to yarn. After submitting successfully, it'll run until the code ython_demo.py stops. But now I want to the continue to submit if one is submitted successfully.
Now the shell run like :
spark-submit -> submit successfully (about 1 minutes)-> run python_demo.py(it will run for a long time) -> spark-submit next part
Expected:
spark-submit -> if run more than 1 minutes(It means a successful submission)-> spark-submit next part
With timeout command you can do something like this:
for((part=1;part<=100;part++))
do
timeout 60 2>/dev/null spark-submit \
--verbose \
--master yarn \
--deploy-mode cluster \
--conf spark.pyspark.python=myenv/bin/python3 \
python_demo.py $part
if [[ $? != 0 ]]
then
break
else
continue
fi
done

spark-submit: command not found

A very simple question:
I try to use a bash script to submit spark jobs. But somehow it keeps complaining that it cannot find spark-submit command.
But when I just copy out the command and run directly in my terminal, it runs fine.
My shell is fish shell, here's what I have in my fish shell config: ~/.config/fish/config.fish:
alias spark-submit='/Users/MY_NAME/Downloads/spark-2.0.2-bin-hadoop2.7/bin/spark-submit'
Here's my bash script:
#!/usr/bin/env bash
SUBMIT_COMMAND="HADOOP_USER_NAME=hdfs spark-submit \
--master $MASTER \
--deploy-mode client \
--driver-memory $DRIVER_MEMORY \
--executor-memory $EXECUTOR_MEMORY \
--num-executors $NUM_EXECUTORS \
--executor-cores $EXECUTOR_CORES \
--conf spark.shuffle.compress=true \
--conf spark.network.timeout=2000s \
$DEBUG_PARAM \
--class com.fisher.coder.OfflineIndexer \
--verbose \
$JAR_PATH \
--local $LOCAL \
$SOLR_HOME \
--solrconfig 'resource:solrhome/' \
$ZK_QUORUM_PARAM \
--source $SOURCE \
--limit $LIMIT \
--sample $SAMPLE \
--dest $DEST \
--copysolrconfig \
--shards $SHARDS \
$S3_ZK_ZNODE_PARENT \
$S3_HBASE_ROOTDIR \
"
eval "$SUBMIT_COMMAND"
What I've tried:
I could run this command perfectly fine on my Mac OS X fish shell when I copy this command literally out and directly run.
However, what I wanted to achieve is to be able to run ./submit.sh -local which executes the above shell.
Any clues please?
You seem to be confused about what a fish alias is. When you run this:
alias spark-submit='/Users/MY_NAME/Downloads/spark-2.0.2-bin-hadoop2.7/bin/spark-submit'
You are actually doing this:
function spark-submit
/Users/MY_NAME/Downloads/spark-2.0.2-bin-hadoop2.7/bin/spark-submit $argv
end
That is, you are defining a fish function. Your bash script has no knowledge of that function. You need to either put that path in your $PATH variable or put a similar alias command in your bash script.
Make sure this command is added to path:
export PATH=$PATH:/Users/{your_own_path_where_spark_installed}/bin
For mac, open either one of these files ~/.bash, ~/.zprofile, ~/.zshrc and add the command below in the file.

Hadoop global variable with streaming

I understand that i can give some global value to my mappers via the Job and the Configuration.
But how can i do that using Hadoop Streaming(Python in my case)?
What is the right way?
Based on the docs you can specify a command line option (-cmdenv name=value) to set environment variables on each distributed machine that you can then use in your mappers/reducers:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input input.txt \
-output output.txt \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py \
-cmdenv MY_PARAM=thing_I_need

Setting the number of Reducers for an Amazon EMR application

I am trying to run the wordcount example under Amazon EMR.
-1- First, I create a cluster with the following command:
./elastic-mapreduce --create --name "MyTest" --alive
This creates a cluster with a single instance and returns a jobID, lets say j-12NWUOKABCDEF
-2- Second, I start a Job using the following command:
./elastic-mapreduce --jobflow j-12NWUOKABCDEF --jar s3n://mybucket/jar-files/wordcount.jar --main-class abc.WordCount
--arg s3n://mybucket/input-data/
--arg s3n://mybucket/output-data/
--arg -Dmapred.reduce.tasks=3
My WordCount class belongs to the package abc.
This executes without any problem, but I am getting only one reducer.
Which means that the parameter "mapred.reduce.tasks=3" is ignored.
Is there any way to specify the number of reducers that I want my application to use ?
Thank you,
Neeraj.
The "-D" and the "mapred.reduce.tasks=3" should be separate arguments.
Try to launch the EMR cluster by setting reducers and mapper with --bootstrap-action option as
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons --args "-m,mapred.map.tasks=6,-m,mapred.reduce.tasks=3"
You can use the streaming Jar's built-in option of -numReduceTasks. For example with the Ruby EMR CLI tool:
elastic-mapreduce --create --enable-debugging \
--ami-version "3.3.1" \
--log-uri s3n://someBucket/logs \
--name "someJob" \
--num-instances 6 \
--master-instance-type "m3.xlarge" --slave-instance-type "c3.8xlarge" \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia \
--stream \
--arg "-files" \
--arg "s3://someBucket/some_job.py,s3://someBucket/some_file.txt" \
--mapper "python27 some_job.py some_file.txt" \
--reducer cat \
--args "-numReduceTasks,8" \
--input s3://someBucket/myInput \
--output s3://someBucket/myOutput \
--step-name "main processing"

Resources