Shell: How to stop a command line if it times out? - shell

I want to stop the command if it run more than 1 minutes and continue run the next command.
for((part=0;part<=100;part++));
do
spark-submit \
--verbose \
--master yarn \
--deploy-mode cluster \
--conf spark.pyspark.python=myenv/bin/python3 \
python_demo.py $part
done
The command spark-submit will submit my code to yarn. After submitting successfully, it'll run until the code ython_demo.py stops. But now I want to the continue to submit if one is submitted successfully.
Now the shell run like :
spark-submit -> submit successfully (about 1 minutes)-> run python_demo.py(it will run for a long time) -> spark-submit next part
Expected:
spark-submit -> if run more than 1 minutes(It means a successful submission)-> spark-submit next part

With timeout command you can do something like this:
for((part=1;part<=100;part++))
do
timeout 60 2>/dev/null spark-submit \
--verbose \
--master yarn \
--deploy-mode cluster \
--conf spark.pyspark.python=myenv/bin/python3 \
python_demo.py $part
if [[ $? != 0 ]]
then
break
else
continue
fi
done

Related

How to pass additional parameter in bash script

I want to design a pipeline for executing a program that can have multiple configurations by argument. Developer is not interested to have each argument as a variable and they want to have the option to be able to add multiple variables by using pipeline. we are using bash and our development using gitlab-ci and we are using octopus for uat env deployment.
example:
spark2-submit \
--master $MASTER \
--name $NAME \
--queue $QUEUE \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.executorIdleTimeout=12
As you can see in the above example, I want to have flexibility in adding more "--conf" parameters.
should I have a dummy parameter and then add it to the end of this command?
for example:
spark2-submit \
--master $MASTER \
--name $NAME \
--queue $QUEUE \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.executorIdleTimeout=12 \
$additional_param
I am using Gitlab for my code repo and Octopus for my CICD. I am using bash for deployment. I am looking for a flexible option that I can use the full feature of the Octopus variable option and gitlab. what is your recommendation? do you have a better suggestion?
This is what Charles is hinting at with "Lists of arguments should be stored in arrays":
spark2_opts=(
--master "$MASTER"
--name "$NAME"
--queue "$QUEUE"
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.enabled=true
--conf spark.dynamicAllocation.executorIdleTimeout=12
)
# add additional options, through some mechanism such as the environment:
if [[ -n "$SOME_ENV_VAR" ]]; then
spark2_opts+=( --conf "$SOME_ENV_VAR" )
fi
# and execute
spark2-submit "${spark2_opts[#]}"
bash array definitions can contain arbitrary whitespace, including newlines,
so format for readability

spark-submit not waiting for state FINISHED before exiting program

I am submitting a spark job to our yarn service via spark-submit. From my understanding spark-submit should continue running until there is a state status of FINISHED before moving on. However once submitted through bamboo, spark-submit is exiting and going straight to the wait which then the sql query is going to run. But the sql query shouldnt run until the spark job is 100% finished. Not sure why my spark-submit is not waiting. Any help is appreciated, thanks
nohup spark-submit --name "${APP_NAME}" \
--class "${SPARK_CLASS_NAME}" \
--files jaas.conf,kafka.properties,distributed.properties,${KEYTAB},pools.xml \
--principal ${PRINCIPAL} \
--keytab ${KEYTAB_ALT} \
--conf "spark.driver.extraJavaOptions=${JVM_ARGS}" \
--conf "spark.executor.extraJavaOptions=${JVM_ARGS}" \
--conf spark.haplogic.env=${ENV} \
--conf spark.scheduler.allocation.file=${POOL_SCHEDULER_FILE} \
--conf spark.master=yarn \
--conf spark.submit.deployMode=cluster \
--conf spark.yarn.submit.waitAppCompletion=true \
--conf spark.driver.memory=$(getProperty "spark.driver.memory") \
--conf spark.executor.memory=$(getProperty "spark.executor.memory") \
--conf spark.executor.instances=$(getProperty "spark.executor.instances") \
--conf spark.executor.cores=$(getProperty "spark.executor.cores") \
--conf spark.yarn.maxAppAttempts=$(getProperty "spark.yarn.maxAppAttempts") \
--conf spark.dynamicAllocation.enabled=$(getProperty "spark.dynamicAllocation.enabled") \
--conf spark.yarn.queue=$(getProperty "spark.yarn.queue") \
--conf spark.memory.fraction=$(getProperty "spark.memory.fraction") \
--conf spark.memory.storageFraction=$(getProperty "spark.memory.storageFraction") \
--conf spark.eventLog.enabled=$(getProperty "spark.eventLog.enabled") \
--conf spark.serializer=org.apache.spark.serializer.JavaSerializer \
--conf spark.acls.enable=true \
--conf spark.admin.acls.groups=${USER_GROUPS} \
--conf spark.acls.enable.groups=${USER_GROUPS} \
--conf spark.ui.view.acls.groups=${USER_GROUPS} \
--conf spark.serializer=org.apache.spark.serializer.JavaSerializer \
--conf spark.yarn.appMasterEnv.SECRETS_LIB_MASTER_KEY=${SECRETS_LIB_MASTER_KEY} \
${JARFILE_NAME} >> ${LOG_FILE} 2>&1 &
sleep 90
The issue was a bash knowledge issue,
& commands runs spark-submit on the background so bamboo will act as if it was complete after the sleep.

spark-submit: command not found

A very simple question:
I try to use a bash script to submit spark jobs. But somehow it keeps complaining that it cannot find spark-submit command.
But when I just copy out the command and run directly in my terminal, it runs fine.
My shell is fish shell, here's what I have in my fish shell config: ~/.config/fish/config.fish:
alias spark-submit='/Users/MY_NAME/Downloads/spark-2.0.2-bin-hadoop2.7/bin/spark-submit'
Here's my bash script:
#!/usr/bin/env bash
SUBMIT_COMMAND="HADOOP_USER_NAME=hdfs spark-submit \
--master $MASTER \
--deploy-mode client \
--driver-memory $DRIVER_MEMORY \
--executor-memory $EXECUTOR_MEMORY \
--num-executors $NUM_EXECUTORS \
--executor-cores $EXECUTOR_CORES \
--conf spark.shuffle.compress=true \
--conf spark.network.timeout=2000s \
$DEBUG_PARAM \
--class com.fisher.coder.OfflineIndexer \
--verbose \
$JAR_PATH \
--local $LOCAL \
$SOLR_HOME \
--solrconfig 'resource:solrhome/' \
$ZK_QUORUM_PARAM \
--source $SOURCE \
--limit $LIMIT \
--sample $SAMPLE \
--dest $DEST \
--copysolrconfig \
--shards $SHARDS \
$S3_ZK_ZNODE_PARENT \
$S3_HBASE_ROOTDIR \
"
eval "$SUBMIT_COMMAND"
What I've tried:
I could run this command perfectly fine on my Mac OS X fish shell when I copy this command literally out and directly run.
However, what I wanted to achieve is to be able to run ./submit.sh -local which executes the above shell.
Any clues please?
You seem to be confused about what a fish alias is. When you run this:
alias spark-submit='/Users/MY_NAME/Downloads/spark-2.0.2-bin-hadoop2.7/bin/spark-submit'
You are actually doing this:
function spark-submit
/Users/MY_NAME/Downloads/spark-2.0.2-bin-hadoop2.7/bin/spark-submit $argv
end
That is, you are defining a fish function. Your bash script has no knowledge of that function. You need to either put that path in your $PATH variable or put a similar alias command in your bash script.
Make sure this command is added to path:
export PATH=$PATH:/Users/{your_own_path_where_spark_installed}/bin
For mac, open either one of these files ~/.bash, ~/.zprofile, ~/.zshrc and add the command below in the file.

Start server, run tests, stop server

I have a Makefile target that looks like the following
integration-test: git-hooks
java -Djava.library.path=$$(pwd)/test/integration/lib/DynamoDBLocal_lib \
-Djava.util.logging.config.file=/dev/null \
-Dorg.eclipse.jetty.LEVEL=WARN \
-Dlog4j.com.amazonaws.services.dynamodbv2.local.server.LocalDynamoDBServerHandler=OFF \
-jar $$(pwd)/test/integration/lib/DynamoDBLocal.jar \
-inMemory \
-port 8000 &
sleep 3
./node_modules/.bin/mocha --compilers coffee:coffee-script/register \
--reporter spec \
test/integration/main.coffee
ps -ef | grep [D]ynamoDBLocal_lib | awk '{print $$2}' | xargs kill
Here's what I'm doing:
the Java command starts a local instance of Amazon's DynamoDB.
I give it 3 seconds to start
I run my integration tests
I kill the database
What I would like is kill the database regardless of the fact that the tests passed or not.
To do that I suppose I need the exit status of the test command and return that, both if the tests fails or if they succeeded.
What is happening is that if tests pass the database is correctly killed, if the tests fail it's not.
I've read in the docs that you can prepend a - in front a command to have make ignore it if it produce a non zero exit status, the problem if I do that is that I don't know if the tests failed or not, since $? will always return 0.
What's the usual practice in this scenario? I'm fine in splitting the target in more targets if that solves my issue.
Thank you.
You'll have to run the entire thing in a single shell, which means you'll need to use command separators (e.g., ;) and backslashes to connect the lines. Then you can store the result and exit with it:
integration-test: git-hooks
{ java -Djava.library.path=$$(pwd)/test/integration/lib/DynamoDBLocal_lib \
-Djava.util.logging.config.file=/dev/null \
-Dorg.eclipse.jetty.LEVEL=WARN \
-Dlog4j.com.amazonaws.services.dynamodbv2.local.server.LocalDynamoDBServerHandler=OFF \
-jar $$(pwd)/test/integration/lib/DynamoDBLocal.jar \
-inMemory \
-port 8000 & }; \
sleep 3; \
./node_modules/.bin/mocha --compilers coffee:coffee-script/register \
--reporter spec \
test/integration/main.coffee; \
r=$$?; \
ps -ef | grep [D]ynamoDBLocal_lib | awk '{print $$2}' | xargs kill; \
exit $$r
However, you can actually do even better if you use a single shell, by killing only the exact process you want instead of using ps:
integration-test: git-hooks
{ java -Djava.library.path=$$(pwd)/test/integration/lib/DynamoDBLocal_lib \
-Djava.util.logging.config.file=/dev/null \
-Dorg.eclipse.jetty.LEVEL=WARN \
-Dlog4j.com.amazonaws.services.dynamodbv2.local.server.LocalDynamoDBServerHandler=OFF \
-jar $$(pwd)/test/integration/lib/DynamoDBLocal.jar \
-inMemory \
-port 8000 & }; \
pid=$$!; \
sleep 3; \
./node_modules/.bin/mocha --compilers coffee:coffee-script/register \
--reporter spec \
test/integration/main.coffee; \
r=$$?; \
kill $$pid; \
exit $$r

Setting the number of Reducers for an Amazon EMR application

I am trying to run the wordcount example under Amazon EMR.
-1- First, I create a cluster with the following command:
./elastic-mapreduce --create --name "MyTest" --alive
This creates a cluster with a single instance and returns a jobID, lets say j-12NWUOKABCDEF
-2- Second, I start a Job using the following command:
./elastic-mapreduce --jobflow j-12NWUOKABCDEF --jar s3n://mybucket/jar-files/wordcount.jar --main-class abc.WordCount
--arg s3n://mybucket/input-data/
--arg s3n://mybucket/output-data/
--arg -Dmapred.reduce.tasks=3
My WordCount class belongs to the package abc.
This executes without any problem, but I am getting only one reducer.
Which means that the parameter "mapred.reduce.tasks=3" is ignored.
Is there any way to specify the number of reducers that I want my application to use ?
Thank you,
Neeraj.
The "-D" and the "mapred.reduce.tasks=3" should be separate arguments.
Try to launch the EMR cluster by setting reducers and mapper with --bootstrap-action option as
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons --args "-m,mapred.map.tasks=6,-m,mapred.reduce.tasks=3"
You can use the streaming Jar's built-in option of -numReduceTasks. For example with the Ruby EMR CLI tool:
elastic-mapreduce --create --enable-debugging \
--ami-version "3.3.1" \
--log-uri s3n://someBucket/logs \
--name "someJob" \
--num-instances 6 \
--master-instance-type "m3.xlarge" --slave-instance-type "c3.8xlarge" \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia \
--stream \
--arg "-files" \
--arg "s3://someBucket/some_job.py,s3://someBucket/some_file.txt" \
--mapper "python27 some_job.py some_file.txt" \
--reducer cat \
--args "-numReduceTasks,8" \
--input s3://someBucket/myInput \
--output s3://someBucket/myOutput \
--step-name "main processing"

Resources