Run custom shell script on all slave nodes in EMR - hadoop

AWS Step documentation says steps only execute on the master, does that mean even if I am logged in to any of the slave nodes and execute the add-steps command on it, the command would go and add the step on to the master node only? How can I then execute a custom shell script on all the slave nodes? Bootstrapping is not an option since the shell script requires the emrf-site.xml to be already created which does not happen until the EMR is completely up and running.

You can use "Custom JAR" step type to run "script-runner.jar" that will run any bash script on every cluster node:
aws emr create-cluster --name ... --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://region.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://mybucket/script-path/my_script.sh"]
More info here: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-script.html

Related

Unable to start Mesos slave on single node cluster

From what I know I am able to set up Mesos master, slave, zookeeper, marathon on a single node.
But once I execute the command to start mesos-master and after that I am trying to start mesos-slave as well but I don't have any way to continue to execute other commands else where. I have to stop the running and run but the problem is mesos-master already stop running.
Don't execute the commands directly from your shell, you want to start all of those components (zookeeper, mesos-master, mesos-slave, and marathon) as services.
/etc/init.d/zookeeper start
start mesos-master
start mesos-slave
start marathon
I forget if zookeeper creates the init script as part of the install for you or not, you may have to find it in the Hadoop docs.
As for the other 3, they all use 'upstart' and you can find the configuration files in /etc/init/

how to create Amazon EMR cluster from the command line in Ubuntu?

how to create Amazon EMR cluster from the command line in Ubuntu? I have the private key,access key and the pem file?....Can anyone guide me as how to run the word count example from the command line
You can use AWS command line tools (CLI) for this. http://docs.aws.amazon.com/cli/latest/userguide/installing.html
Once these are installed, you have to configure the tool using 'aws configure' command and enter priate key, access key.
http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html
You will also need to enter the region where your EMR cluster (and other resources) will be launched.
To create cluster, the 'create-cluster' command need to be used.
http://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html
You dont need the pem file for these steps.
Once the cluster is launched, you can run the word count demo as a 'step'. You can add word count demo as a 'step'
Starting a cluster and running a hadoop job (a script in this case):
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hadoop-script.html
Some examples of add-steps is in this section for an already running cluster:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

How do I submit more than one job to Hadoop in a step using the Elastic MapReduce API?

Amazon EMR Documentation to add steps to cluster says that a single Elastic MapReduce step can submit several jobs to Hadoop. However, Amazon EMR Documentation for Step configuration suggests that a single step can accommodate just one execution of hadoop-streaming.jar (that is, HadoopJarStep is a HadoopJarStepConfig rather than an array of HadoopJarStepConfigs).
What is the proper syntax for submitting several jobs to Hadoop in a step?
Like Amazon EMR Documentation says, you can create a cluster to run some script my_script.sh on the master instance in a step:
aws emr create-cluster --name "Test cluster" --ami-version 3.11 --use-default-roles
--ec2-attributes KeyName=myKey --instance-type m3.xlarge --instance count 3
--steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://mybucket/script-path/my_script.sh"]
my_script.sh should look something like this:
#!/usr/bin/env bash
hadoop jar my_first_step.jar [mainClass] args... &
hadoop jar my_second_step.jar [mainClass] args... &
.
.
.
wait
This way, multiple jobs are submitted to Hadoop in the same step---but unfortunately, the EMR interface won't be able to track them. To do this, you should use the Hadoop web interfaces as shown here, or simply ssh to the master instance and explore with mapred job.

Running a script on all nodes of Hadoop in Amazon EMR

How do you run a script on all nodes (master and slaves) on Amazon EMR, the script-runner.jar runs only on the Namenode.
You have the bootstrap option:
You can use a bootstrap action to install additional software and to change the configuration of applications on the cluster. Bootstrap actions are scripts that are run on the cluster nodes when Amazon EMR launches the cluster. They run before Hadoop starts and before the node begins processing data. You can create custom bootstrap actions, or use predefined bootstrap actions provided by Amazon EMR.
from the documentation: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html
It's as simple as placing a script to do the copying into S3, and then if you're starting EMR from the command line, add a parameter like this:
--bootstrap-action 's3://my-bucket/boostrap.sh'
Or if you're doing it through the web interface, just enter the location of the file in as a "Custom action" in "Bootstrap Actions".

hadoop cluster clarification

I am a newbie in hadoop and I am trying to run a hadoop jar on Amazon EC2. I have started my amazon ec2 instance through the console, uploaded my files to the dfs and then was able to successfully run the job jar and generate output on the instance.
But still I am confused on one part. I am not sure if the job was run on a single machine in amazon ec2 or was it ran on a cluster? How do I find the number of worker nodes involved for my jar run?
In some reference links I see we have to use launch-cluster command , for example "bin/hadoop-ec2 launch-cluster test-cluster 2" . What is the difference in starting the instance from the console and using this command like launch-cluster.

Resources