Configure EMR Cluster for Fair Scheduling - hadoop

I am trying to spin up an emr cluster with fair scheduling such that I can run multiple steps in parallel. I see that this is possible via pipeline (https://aws.amazon.com/about-aws/whats-new/2015/06/run-parallel-hadoop-jobs-on-your-amazon-emr-cluster-using-aws-data-pipeline/), but I already have cluster management / creating automated via an airflow job calling the awscli[1] so it would be great to just update my configurations.
aws emr create-cluster \
--applications Name=Spark Name=Ganglia \
--ec2-attributes "${EC2_PROPERTIES}" \
--service-role EMR_DefaultRole \
--release-label emr-5.8.0 \
--log-uri ${S3_LOGS} \
--enable-debugging \
--name ${CLUSTER_NAME} \
--region us-east-1 \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=4,InstanceType=m3.xlarge)
I think it may be achieved using the --configurations (https://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html) flag, but not sure of the correct env names

Yes, you are correct. You can use EMR configurations to achieve your goal. You can create a JSON file with something like below :
yarn-config.json:
[
{
"Classification": "yarn-site",
"Properties": {
"yarn.resourcemanager.scheduler.class": "org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler"
}
}
]
as per Hadoop Fair Scheduler docs
Then modify you AWS CLI as :
aws emr create-cluster \
--applications Name=Spark Name=Ganglia \
--ec2-attributes "${EC2_PROPERTIES}" \
--service-role EMR_DefaultRole \
--release-label emr-5.8.0 \
--log-uri ${S3_LOGS} \
--enable-debugging \
--name ${CLUSTER_NAME} \
--region us-east-1 \
--instance-groups \
--configurations file://yarn-config.json
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge
InstanceGroupType=CORE,InstanceCount=4,InstanceType=m3.xlarge)

Related

AWS list instances that are pending and running with one CLI call

I have an application which needs to know how many EC2 instances I have in both the pending and running states. I could run the two following commands and sum them to get the result, but that means my awscli will be making two requests, which is both slow and probably bad practice.
aws ec2 describe-instances \
--filters Name=instance-state-name,Values="running" \
--query 'Reservations[*].Instances[*].[InstanceId]' \
--output text \
| wc -l
aws ec2 describe-instances \
--filters Name=instance-state-name,Values="pending" \
--query 'Reservations[*].Instances[*].[InstanceId]' \
--output text \
| wc -l
Is there a way of combining these two queries into a single one, or another way of getting the total pending + running instances using a single query?
Cheers!
Shorthand syntax with commas:
Values=running,pending
You can add several filter values as follows:
aws ec2 describe-instances \
--filters "Name=instance-state-name,Values=running,pending" \
--query 'Reservations[*].Instances[*].[InstanceId]' \
--output text \
| wc -l

How to add Infrastructure Agent to Heroku applications

I am attempting to have the New Relic Infrastructure Agent monitor my heroku applications.
The documentation says to run the following:
docker run \
-d \
--name newrelic-infra \
--network=host \
--cap-add=SYS_PTRACE \
--privileged \
--pid=host \
-v "/:/host:ro" \
-v "/var/run/docker.sock:/var/run/docker.sock" \
-e NRIA_LICENSE_KEY=[Key]\
newrelic/infrastructure:latest
But where do I actually run or put this so it runs it on my Heroku apps?

Azure CLI Bash - attaching an app insights resource to api managment

I can create an api management instance
az apim create \
--name "$apimName" \
--resource-group "$resourceGroupName" \
--publisher-name 'Publisher' \
--publisher-email 'Myemail#email.com.au' \
--tags "${resourceTags[#]}"
and I can create an app insights instance
az monitor app-insights component create \
--app "${apimName}-appins" \
--location "australiaeast" \
--resource-group "$resourceGroupName" \
--tags "${resourceTags[#]}"
How do I attach the api insights to api management?
Try the command below, replace the values with yours.
az resource create --resource-type 'Microsoft.ApiManagement/service/loggers' -g '<apim-group-name>' -n '<apim-name>/loggers/<appinsight-name>' --properties '{
"loggerType":"applicationInsights",
"description": null,
"credentials": {
"instrumentationKey": "<Instrumentation-Key-of-your-appinsight>"
},
"isBuffered": true,
"resourceId": "/subscriptions/<subscription-id>/resourceGroups/<appinsight-group-name>/providers/microsoft.insights/components/<appinsight-name>"
}'

How does "knife ec2 server create"'s expansion of --run-list work?

I'm unable to bootstrap my server because "knife ec2 server create" keeps expanding my runlist to "roles".
knife ec2 server create \
-V \
--run-list 'role[pgs]' \
--environment $1 \
--image $AMI \
--region $REGION \
--flavor $PGS_INSTANCE_TYPE \
--identity-file $SSH_KEY \
--security-group-ids $PGS_SECURITY_GROUP \
--subnet $PRIVATE_SUBNET \
--ssh-user ubuntu \
--server-connect-attribute private_ip_address \
--availability-zone $AZ \
--node-name pgs \
--tags VPC=$VPC
Consistently fails because 'roles[pgs]' is expanded to 'roles'. Why is this? Is there some escaping or alternative method I can use?
I'm currently working around this by bootstrapping with an empty run-list and then overriding the runlist by running chef-client once the node is registered.
This is a feature of bash. [] is a wildcard expander. You should can escape the brackets using "\".

Setting the number of Reducers for an Amazon EMR application

I am trying to run the wordcount example under Amazon EMR.
-1- First, I create a cluster with the following command:
./elastic-mapreduce --create --name "MyTest" --alive
This creates a cluster with a single instance and returns a jobID, lets say j-12NWUOKABCDEF
-2- Second, I start a Job using the following command:
./elastic-mapreduce --jobflow j-12NWUOKABCDEF --jar s3n://mybucket/jar-files/wordcount.jar --main-class abc.WordCount
--arg s3n://mybucket/input-data/
--arg s3n://mybucket/output-data/
--arg -Dmapred.reduce.tasks=3
My WordCount class belongs to the package abc.
This executes without any problem, but I am getting only one reducer.
Which means that the parameter "mapred.reduce.tasks=3" is ignored.
Is there any way to specify the number of reducers that I want my application to use ?
Thank you,
Neeraj.
The "-D" and the "mapred.reduce.tasks=3" should be separate arguments.
Try to launch the EMR cluster by setting reducers and mapper with --bootstrap-action option as
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons --args "-m,mapred.map.tasks=6,-m,mapred.reduce.tasks=3"
You can use the streaming Jar's built-in option of -numReduceTasks. For example with the Ruby EMR CLI tool:
elastic-mapreduce --create --enable-debugging \
--ami-version "3.3.1" \
--log-uri s3n://someBucket/logs \
--name "someJob" \
--num-instances 6 \
--master-instance-type "m3.xlarge" --slave-instance-type "c3.8xlarge" \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia \
--stream \
--arg "-files" \
--arg "s3://someBucket/some_job.py,s3://someBucket/some_file.txt" \
--mapper "python27 some_job.py some_file.txt" \
--reducer cat \
--args "-numReduceTasks,8" \
--input s3://someBucket/myInput \
--output s3://someBucket/myOutput \
--step-name "main processing"

Resources