Special character in dataproc yarn properties - bash

I found this example of command to create dataproc cluster and setting some yarn properties.
gcloud dataproc clusters create cluster_name \
--bucket="profiling-job-default" \
--zone=europe-west1-c \
--master-boot-disk-size=500GB \
--worker-boot-disk-size=500GB \
--master-machine-type=n1-standard-16 \
--num-workers=10 \
--worker-machine-type=n1-standard-16 \
--initialization-actions gs://custom_init_gcp.sh \
--metadata MINICONDA_VARIANT=2 \
--properties=^--^yarn:yarn.scheduler.minimum-allocation-vcores=4--capacity-scheduler:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
I notice a particular string ^--^ before the property key-value: yarn:yarn.scheduler.minimum-allocation-vcores=4.
What does ^--^ mean? It is a sort of escape for --?
Where is this documented?

This is gcloud syntax for list and dictionary type values escaping.
It means that characters specified between ^ are treated as values and key-values delimiter for list and dictionary flags.

Related

How to pass additional parameter in bash script

I want to design a pipeline for executing a program that can have multiple configurations by argument. Developer is not interested to have each argument as a variable and they want to have the option to be able to add multiple variables by using pipeline. we are using bash and our development using gitlab-ci and we are using octopus for uat env deployment.
example:
spark2-submit \
--master $MASTER \
--name $NAME \
--queue $QUEUE \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.executorIdleTimeout=12
As you can see in the above example, I want to have flexibility in adding more "--conf" parameters.
should I have a dummy parameter and then add it to the end of this command?
for example:
spark2-submit \
--master $MASTER \
--name $NAME \
--queue $QUEUE \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.executorIdleTimeout=12 \
$additional_param
I am using Gitlab for my code repo and Octopus for my CICD. I am using bash for deployment. I am looking for a flexible option that I can use the full feature of the Octopus variable option and gitlab. what is your recommendation? do you have a better suggestion?
This is what Charles is hinting at with "Lists of arguments should be stored in arrays":
spark2_opts=(
--master "$MASTER"
--name "$NAME"
--queue "$QUEUE"
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.enabled=true
--conf spark.dynamicAllocation.executorIdleTimeout=12
)
# add additional options, through some mechanism such as the environment:
if [[ -n "$SOME_ENV_VAR" ]]; then
spark2_opts+=( --conf "$SOME_ENV_VAR" )
fi
# and execute
spark2-submit "${spark2_opts[#]}"
bash array definitions can contain arbitrary whitespace, including newlines,
so format for readability

AWS list instances that are pending and running with one CLI call

I have an application which needs to know how many EC2 instances I have in both the pending and running states. I could run the two following commands and sum them to get the result, but that means my awscli will be making two requests, which is both slow and probably bad practice.
aws ec2 describe-instances \
--filters Name=instance-state-name,Values="running" \
--query 'Reservations[*].Instances[*].[InstanceId]' \
--output text \
| wc -l
aws ec2 describe-instances \
--filters Name=instance-state-name,Values="pending" \
--query 'Reservations[*].Instances[*].[InstanceId]' \
--output text \
| wc -l
Is there a way of combining these two queries into a single one, or another way of getting the total pending + running instances using a single query?
Cheers!
Shorthand syntax with commas:
Values=running,pending
You can add several filter values as follows:
aws ec2 describe-instances \
--filters "Name=instance-state-name,Values=running,pending" \
--query 'Reservations[*].Instances[*].[InstanceId]' \
--output text \
| wc -l

spark-submit: command not found

A very simple question:
I try to use a bash script to submit spark jobs. But somehow it keeps complaining that it cannot find spark-submit command.
But when I just copy out the command and run directly in my terminal, it runs fine.
My shell is fish shell, here's what I have in my fish shell config: ~/.config/fish/config.fish:
alias spark-submit='/Users/MY_NAME/Downloads/spark-2.0.2-bin-hadoop2.7/bin/spark-submit'
Here's my bash script:
#!/usr/bin/env bash
SUBMIT_COMMAND="HADOOP_USER_NAME=hdfs spark-submit \
--master $MASTER \
--deploy-mode client \
--driver-memory $DRIVER_MEMORY \
--executor-memory $EXECUTOR_MEMORY \
--num-executors $NUM_EXECUTORS \
--executor-cores $EXECUTOR_CORES \
--conf spark.shuffle.compress=true \
--conf spark.network.timeout=2000s \
$DEBUG_PARAM \
--class com.fisher.coder.OfflineIndexer \
--verbose \
$JAR_PATH \
--local $LOCAL \
$SOLR_HOME \
--solrconfig 'resource:solrhome/' \
$ZK_QUORUM_PARAM \
--source $SOURCE \
--limit $LIMIT \
--sample $SAMPLE \
--dest $DEST \
--copysolrconfig \
--shards $SHARDS \
$S3_ZK_ZNODE_PARENT \
$S3_HBASE_ROOTDIR \
"
eval "$SUBMIT_COMMAND"
What I've tried:
I could run this command perfectly fine on my Mac OS X fish shell when I copy this command literally out and directly run.
However, what I wanted to achieve is to be able to run ./submit.sh -local which executes the above shell.
Any clues please?
You seem to be confused about what a fish alias is. When you run this:
alias spark-submit='/Users/MY_NAME/Downloads/spark-2.0.2-bin-hadoop2.7/bin/spark-submit'
You are actually doing this:
function spark-submit
/Users/MY_NAME/Downloads/spark-2.0.2-bin-hadoop2.7/bin/spark-submit $argv
end
That is, you are defining a fish function. Your bash script has no knowledge of that function. You need to either put that path in your $PATH variable or put a similar alias command in your bash script.
Make sure this command is added to path:
export PATH=$PATH:/Users/{your_own_path_where_spark_installed}/bin
For mac, open either one of these files ~/.bash, ~/.zprofile, ~/.zshrc and add the command below in the file.

Do the attributes of sqoop command follow some syntactical order?

For example
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp_add \
--m 1 \ (or --num-mappers 10)
--where “city =’abcd’” \
--target-dir /whereque
is same as?
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp_add \
--where “city =’abcd’” \
--target-dir /whereque
--m 1 \ (or --num-mappers 10)
I tried above two options and it worked. But my question is can we jumble up the attributes for all the cases?
Sqoop command usually follows this syntax:
sqoop command [GENERIC-ARGS] [TOOL-ARGS]
You cannot change the order of the usage. However you can change order of the tool-args.
For more have a look at documentation.
Actually, you don't have any generic argument in your code. Generic arguments are related to "configuration" settings. There are listed below:
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
Sqoop command is shown below:
sqoop import [GENERIC-ARGS] [TOOL-ARGS]
Please see some points below on the order in which Command should be executed.
1.Generic arguments must always be placed after tool name
2.All generic arguments must always be placed before tool arguments
3.Generic arguments are always preceded by single dash(-) character.
4.Tool arguments are always preceded with 2 dashes(--), exception to which is single character arguments

Setting the number of Reducers for an Amazon EMR application

I am trying to run the wordcount example under Amazon EMR.
-1- First, I create a cluster with the following command:
./elastic-mapreduce --create --name "MyTest" --alive
This creates a cluster with a single instance and returns a jobID, lets say j-12NWUOKABCDEF
-2- Second, I start a Job using the following command:
./elastic-mapreduce --jobflow j-12NWUOKABCDEF --jar s3n://mybucket/jar-files/wordcount.jar --main-class abc.WordCount
--arg s3n://mybucket/input-data/
--arg s3n://mybucket/output-data/
--arg -Dmapred.reduce.tasks=3
My WordCount class belongs to the package abc.
This executes without any problem, but I am getting only one reducer.
Which means that the parameter "mapred.reduce.tasks=3" is ignored.
Is there any way to specify the number of reducers that I want my application to use ?
Thank you,
Neeraj.
The "-D" and the "mapred.reduce.tasks=3" should be separate arguments.
Try to launch the EMR cluster by setting reducers and mapper with --bootstrap-action option as
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons --args "-m,mapred.map.tasks=6,-m,mapred.reduce.tasks=3"
You can use the streaming Jar's built-in option of -numReduceTasks. For example with the Ruby EMR CLI tool:
elastic-mapreduce --create --enable-debugging \
--ami-version "3.3.1" \
--log-uri s3n://someBucket/logs \
--name "someJob" \
--num-instances 6 \
--master-instance-type "m3.xlarge" --slave-instance-type "c3.8xlarge" \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia \
--stream \
--arg "-files" \
--arg "s3://someBucket/some_job.py,s3://someBucket/some_file.txt" \
--mapper "python27 some_job.py some_file.txt" \
--reducer cat \
--args "-numReduceTasks,8" \
--input s3://someBucket/myInput \
--output s3://someBucket/myOutput \
--step-name "main processing"

Resources