Running a script on all nodes of Hadoop in Amazon EMR - hadoop

How do you run a script on all nodes (master and slaves) on Amazon EMR, the script-runner.jar runs only on the Namenode.

You have the bootstrap option:
You can use a bootstrap action to install additional software and to change the configuration of applications on the cluster. Bootstrap actions are scripts that are run on the cluster nodes when Amazon EMR launches the cluster. They run before Hadoop starts and before the node begins processing data. You can create custom bootstrap actions, or use predefined bootstrap actions provided by Amazon EMR.
from the documentation: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html
It's as simple as placing a script to do the copying into S3, and then if you're starting EMR from the command line, add a parameter like this:
--bootstrap-action 's3://my-bucket/boostrap.sh'
Or if you're doing it through the web interface, just enter the location of the file in as a "Custom action" in "Bootstrap Actions".

Related

Run custom shell script on all slave nodes in EMR

AWS Step documentation says steps only execute on the master, does that mean even if I am logged in to any of the slave nodes and execute the add-steps command on it, the command would go and add the step on to the master node only? How can I then execute a custom shell script on all the slave nodes? Bootstrapping is not an option since the shell script requires the emrf-site.xml to be already created which does not happen until the EMR is completely up and running.
You can use "Custom JAR" step type to run "script-runner.jar" that will run any bash script on every cluster node:
aws emr create-cluster --name ... --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://region.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://mybucket/script-path/my_script.sh"]
More info here: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-script.html

Modifying yarn config on EMR

I need to make a change to the YARN configuration on an EMR cluster.
Do I need to make the change to just the yarn-site.xml file on the Hadoop master ? If so, how can I propagate the change to the datanodes ? Do I just need to restart yarn as detailed here ? I am using EMR 5.8.0.
https://aws.amazon.com/premiumsupport/knowledge-center/restart-service-emr/
You will need to identigy which YARN Daemon enforces that parameter and if needed will need to restart that Daemon accordingly.
Ex:
EMR Master has YARN ResourceManager
EMR Core has YARN Nodemanager
If you need to change a parameter that corresponds to YARN ResourceManager(like yarn.resourcemanager.*), then you might need to edit yarn-site on just master and restart just the ResourceManager daemon.
If you want to change a parameter like yarn.nodemanager.* , then you will need to change yarn-site on all core nodes and might need to restart NodeManager daemon on all core nodes.
Now, when it comes to how to change this setting on all core's at once, there are lot of tools out there to do it(Like Ansible, PDSH , AWS SSM etc. ). EMR does not have any API that supports changing config's on fly. If you are trying to provision a cluster with desired configuration , use EMR Configurations API. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html

How do I submit more than one job to Hadoop in a step using the Elastic MapReduce API?

Amazon EMR Documentation to add steps to cluster says that a single Elastic MapReduce step can submit several jobs to Hadoop. However, Amazon EMR Documentation for Step configuration suggests that a single step can accommodate just one execution of hadoop-streaming.jar (that is, HadoopJarStep is a HadoopJarStepConfig rather than an array of HadoopJarStepConfigs).
What is the proper syntax for submitting several jobs to Hadoop in a step?
Like Amazon EMR Documentation says, you can create a cluster to run some script my_script.sh on the master instance in a step:
aws emr create-cluster --name "Test cluster" --ami-version 3.11 --use-default-roles
--ec2-attributes KeyName=myKey --instance-type m3.xlarge --instance count 3
--steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://mybucket/script-path/my_script.sh"]
my_script.sh should look something like this:
#!/usr/bin/env bash
hadoop jar my_first_step.jar [mainClass] args... &
hadoop jar my_second_step.jar [mainClass] args... &
.
.
.
wait
This way, multiple jobs are submitted to Hadoop in the same step---but unfortunately, the EMR interface won't be able to track them. To do this, you should use the Hadoop web interfaces as shown here, or simply ssh to the master instance and explore with mapred job.

Apache Giraph on EMR

Has any tried Apache Giraph on EMR?
It seems to me the only requirements to run on EMR are to add proper bootstrap scripts to the Job Flow configuration. Then I should just need to use a standard Custom JAR launch step to launch the Giraph Runner with appropriate arguments for my Giraph program.
Any documentation/tutorial or if you could just share your experience with Giraph on EMR, that will be much appreciated.
Yes, I run Giraph jobs on EMR regularly but I don't use "Job Flows", I manually login to the master node and use it as a normal Hadoop cluster (I just submit the job with hadoop jar command).
You are right, you need to add bootstrap scripts to run Zookeeper and to add Zookeeper details to core-site config. Here is how I did it :
Bootstrap actions -
Configure Hadoop s3://elasticmapreduce/bootstrap-actions/configure-hadoop --site-key-value, io.file.buffer.size=65536, --core-key-value, giraph.zkList=localhost:2181, --mapred-key-value, mapreduce.job.counters.limit=1200
Run if s3://elasticmapreduce/bootstrap-actions/run-if instance.isMaster=true, s3://hpc-chikitsa/zookeeper_install.sh
The contents of zookeeper_install.sh are :
#!/bin/bash
wget --no-check-certificate http://apache.mesi.com.ar/zookeeper/zookeeper3.4./zookeeper3.4.5.tar.gz
tar zxvf zookeeper-3.4.5.tar.gz
cd zookeeper-3.4.5
mv conf/zoo_sample.cfg conf/zoo.cfg
sudo bin/zkServer.sh start
Then copy your Giraph jar file to master node (using scp) and then ssh to master node and submit the job using hadoop jar command.
Hope that helps.
Here is a relevant mail-thread on giraph-user mailing list :
https://www.mail-archive.com/user%40giraph.apache.org/msg01240.html

hadoop cluster clarification

I am a newbie in hadoop and I am trying to run a hadoop jar on Amazon EC2. I have started my amazon ec2 instance through the console, uploaded my files to the dfs and then was able to successfully run the job jar and generate output on the instance.
But still I am confused on one part. I am not sure if the job was run on a single machine in amazon ec2 or was it ran on a cluster? How do I find the number of worker nodes involved for my jar run?
In some reference links I see we have to use launch-cluster command , for example "bin/hadoop-ec2 launch-cluster test-cluster 2" . What is the difference in starting the instance from the console and using this command like launch-cluster.

Resources