Setting the number of Reducers for an Amazon EMR application - hadoop

I am trying to run the wordcount example under Amazon EMR.
-1- First, I create a cluster with the following command:
./elastic-mapreduce --create --name "MyTest" --alive
This creates a cluster with a single instance and returns a jobID, lets say j-12NWUOKABCDEF
-2- Second, I start a Job using the following command:
./elastic-mapreduce --jobflow j-12NWUOKABCDEF --jar s3n://mybucket/jar-files/wordcount.jar --main-class abc.WordCount
--arg s3n://mybucket/input-data/
--arg s3n://mybucket/output-data/
--arg -Dmapred.reduce.tasks=3
My WordCount class belongs to the package abc.
This executes without any problem, but I am getting only one reducer.
Which means that the parameter "mapred.reduce.tasks=3" is ignored.
Is there any way to specify the number of reducers that I want my application to use ?
Thank you,
Neeraj.

The "-D" and the "mapred.reduce.tasks=3" should be separate arguments.

Try to launch the EMR cluster by setting reducers and mapper with --bootstrap-action option as
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons --args "-m,mapred.map.tasks=6,-m,mapred.reduce.tasks=3"

You can use the streaming Jar's built-in option of -numReduceTasks. For example with the Ruby EMR CLI tool:
elastic-mapreduce --create --enable-debugging \
--ami-version "3.3.1" \
--log-uri s3n://someBucket/logs \
--name "someJob" \
--num-instances 6 \
--master-instance-type "m3.xlarge" --slave-instance-type "c3.8xlarge" \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia \
--stream \
--arg "-files" \
--arg "s3://someBucket/some_job.py,s3://someBucket/some_file.txt" \
--mapper "python27 some_job.py some_file.txt" \
--reducer cat \
--args "-numReduceTasks,8" \
--input s3://someBucket/myInput \
--output s3://someBucket/myOutput \
--step-name "main processing"

Related

How to pass additional parameter in bash script

I want to design a pipeline for executing a program that can have multiple configurations by argument. Developer is not interested to have each argument as a variable and they want to have the option to be able to add multiple variables by using pipeline. we are using bash and our development using gitlab-ci and we are using octopus for uat env deployment.
example:
spark2-submit \
--master $MASTER \
--name $NAME \
--queue $QUEUE \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.executorIdleTimeout=12
As you can see in the above example, I want to have flexibility in adding more "--conf" parameters.
should I have a dummy parameter and then add it to the end of this command?
for example:
spark2-submit \
--master $MASTER \
--name $NAME \
--queue $QUEUE \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.executorIdleTimeout=12 \
$additional_param
I am using Gitlab for my code repo and Octopus for my CICD. I am using bash for deployment. I am looking for a flexible option that I can use the full feature of the Octopus variable option and gitlab. what is your recommendation? do you have a better suggestion?
This is what Charles is hinting at with "Lists of arguments should be stored in arrays":
spark2_opts=(
--master "$MASTER"
--name "$NAME"
--queue "$QUEUE"
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.enabled=true
--conf spark.dynamicAllocation.executorIdleTimeout=12
)
# add additional options, through some mechanism such as the environment:
if [[ -n "$SOME_ENV_VAR" ]]; then
spark2_opts+=( --conf "$SOME_ENV_VAR" )
fi
# and execute
spark2-submit "${spark2_opts[#]}"
bash array definitions can contain arbitrary whitespace, including newlines,
so format for readability

Special character in dataproc yarn properties

I found this example of command to create dataproc cluster and setting some yarn properties.
gcloud dataproc clusters create cluster_name \
--bucket="profiling-job-default" \
--zone=europe-west1-c \
--master-boot-disk-size=500GB \
--worker-boot-disk-size=500GB \
--master-machine-type=n1-standard-16 \
--num-workers=10 \
--worker-machine-type=n1-standard-16 \
--initialization-actions gs://custom_init_gcp.sh \
--metadata MINICONDA_VARIANT=2 \
--properties=^--^yarn:yarn.scheduler.minimum-allocation-vcores=4--capacity-scheduler:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
I notice a particular string ^--^ before the property key-value: yarn:yarn.scheduler.minimum-allocation-vcores=4.
What does ^--^ mean? It is a sort of escape for --?
Where is this documented?
This is gcloud syntax for list and dictionary type values escaping.
It means that characters specified between ^ are treated as values and key-values delimiter for list and dictionary flags.

Configure EMR Cluster for Fair Scheduling

I am trying to spin up an emr cluster with fair scheduling such that I can run multiple steps in parallel. I see that this is possible via pipeline (https://aws.amazon.com/about-aws/whats-new/2015/06/run-parallel-hadoop-jobs-on-your-amazon-emr-cluster-using-aws-data-pipeline/), but I already have cluster management / creating automated via an airflow job calling the awscli[1] so it would be great to just update my configurations.
aws emr create-cluster \
--applications Name=Spark Name=Ganglia \
--ec2-attributes "${EC2_PROPERTIES}" \
--service-role EMR_DefaultRole \
--release-label emr-5.8.0 \
--log-uri ${S3_LOGS} \
--enable-debugging \
--name ${CLUSTER_NAME} \
--region us-east-1 \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=4,InstanceType=m3.xlarge)
I think it may be achieved using the --configurations (https://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html) flag, but not sure of the correct env names
Yes, you are correct. You can use EMR configurations to achieve your goal. You can create a JSON file with something like below :
yarn-config.json:
[
{
"Classification": "yarn-site",
"Properties": {
"yarn.resourcemanager.scheduler.class": "org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler"
}
}
]
as per Hadoop Fair Scheduler docs
Then modify you AWS CLI as :
aws emr create-cluster \
--applications Name=Spark Name=Ganglia \
--ec2-attributes "${EC2_PROPERTIES}" \
--service-role EMR_DefaultRole \
--release-label emr-5.8.0 \
--log-uri ${S3_LOGS} \
--enable-debugging \
--name ${CLUSTER_NAME} \
--region us-east-1 \
--instance-groups \
--configurations file://yarn-config.json
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge
InstanceGroupType=CORE,InstanceCount=4,InstanceType=m3.xlarge)

spark-submit: command not found

A very simple question:
I try to use a bash script to submit spark jobs. But somehow it keeps complaining that it cannot find spark-submit command.
But when I just copy out the command and run directly in my terminal, it runs fine.
My shell is fish shell, here's what I have in my fish shell config: ~/.config/fish/config.fish:
alias spark-submit='/Users/MY_NAME/Downloads/spark-2.0.2-bin-hadoop2.7/bin/spark-submit'
Here's my bash script:
#!/usr/bin/env bash
SUBMIT_COMMAND="HADOOP_USER_NAME=hdfs spark-submit \
--master $MASTER \
--deploy-mode client \
--driver-memory $DRIVER_MEMORY \
--executor-memory $EXECUTOR_MEMORY \
--num-executors $NUM_EXECUTORS \
--executor-cores $EXECUTOR_CORES \
--conf spark.shuffle.compress=true \
--conf spark.network.timeout=2000s \
$DEBUG_PARAM \
--class com.fisher.coder.OfflineIndexer \
--verbose \
$JAR_PATH \
--local $LOCAL \
$SOLR_HOME \
--solrconfig 'resource:solrhome/' \
$ZK_QUORUM_PARAM \
--source $SOURCE \
--limit $LIMIT \
--sample $SAMPLE \
--dest $DEST \
--copysolrconfig \
--shards $SHARDS \
$S3_ZK_ZNODE_PARENT \
$S3_HBASE_ROOTDIR \
"
eval "$SUBMIT_COMMAND"
What I've tried:
I could run this command perfectly fine on my Mac OS X fish shell when I copy this command literally out and directly run.
However, what I wanted to achieve is to be able to run ./submit.sh -local which executes the above shell.
Any clues please?
You seem to be confused about what a fish alias is. When you run this:
alias spark-submit='/Users/MY_NAME/Downloads/spark-2.0.2-bin-hadoop2.7/bin/spark-submit'
You are actually doing this:
function spark-submit
/Users/MY_NAME/Downloads/spark-2.0.2-bin-hadoop2.7/bin/spark-submit $argv
end
That is, you are defining a fish function. Your bash script has no knowledge of that function. You need to either put that path in your $PATH variable or put a similar alias command in your bash script.
Make sure this command is added to path:
export PATH=$PATH:/Users/{your_own_path_where_spark_installed}/bin
For mac, open either one of these files ~/.bash, ~/.zprofile, ~/.zshrc and add the command below in the file.

Hadoop streaming tasks on EMR always fail with "PipeMapRed.waitOutputThreads(): subprocess failed with code 143"

My hadoop streaming map-reduce jobs on Amazon EMR keep failing with the
following error:
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 143
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:372)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:586)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
From what I have read online it appears that this is related to a SIGTERM being
sent to the task (See this thread here). I have tried experimenting with
--jobconf "mapred.task.timeout=X" but still receive the same error for values
of X even up to an hour. I have also tried reporting
reporter:status:<message> at regular intervals to the STDERR as described in
the streaming docs. This also however does nothing to prevent this error
occurring. As far as I can see my process starts and begins working initially as
I get the expected output being produced in log files. Each task attempt
however always ends in this error.
This is the code I am using to launch my streaming job with make:
instances = 50
type = m1.small
bid = 0.010
maptasks = 20000
timeout = 3600000
hadoop: upload_scripts upload_data
emr -c ~/.ec2/credentials.json \
--create \
--name "Run $(maptasks) jobs with $(timeout) minute timeout and no reducer" \
--instance-group master \
--instance-type $(type) \
--instance-count 1 \
--instance-group core \
--instance-type $(type) \
--instance-count 1 \
--instance-group task \
--instance-type $(type) \
--instance-count $(instances) \
--bid-price $(bid) \
--bootstrap-action $(S3-srpt)$(bootstrap-database) \
--args "$(database)","$(http)/data","$(hadoop)" \
--bootstrap-action $(S3-srpt)$(bootstrap-phmmer) \
--args "$(hadoop)" \
--stream \
--jobconf "mapred.map.tasks=$(maptasks)" \
--jobconf "mapred.task.timeout=$(timeout)" \
--input $(S3-data)$(database) \
--output $(S3-otpt)$(shell date +%Y-%m-%d-%H-%M-%S) \
--mapper '$(S3-srpt)$(mapper-phmmer) $(hadoop)/$(database) $(hadoop)/phmmer' \
--reducer NONE

Resources