How do I submit more than one job to Hadoop in a step using the Elastic MapReduce API? - hadoop

Amazon EMR Documentation to add steps to cluster says that a single Elastic MapReduce step can submit several jobs to Hadoop. However, Amazon EMR Documentation for Step configuration suggests that a single step can accommodate just one execution of hadoop-streaming.jar (that is, HadoopJarStep is a HadoopJarStepConfig rather than an array of HadoopJarStepConfigs).
What is the proper syntax for submitting several jobs to Hadoop in a step?

Like Amazon EMR Documentation says, you can create a cluster to run some script my_script.sh on the master instance in a step:
aws emr create-cluster --name "Test cluster" --ami-version 3.11 --use-default-roles
--ec2-attributes KeyName=myKey --instance-type m3.xlarge --instance count 3
--steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://mybucket/script-path/my_script.sh"]
my_script.sh should look something like this:
#!/usr/bin/env bash
hadoop jar my_first_step.jar [mainClass] args... &
hadoop jar my_second_step.jar [mainClass] args... &
.
.
.
wait
This way, multiple jobs are submitted to Hadoop in the same step---but unfortunately, the EMR interface won't be able to track them. To do this, you should use the Hadoop web interfaces as shown here, or simply ssh to the master instance and explore with mapred job.

Related

Running a Spark job with spark-submit across the whole cluster

I have recently set up a Spark cluster on Amazon EMR with 1 master and 2 slaves.
I can run pyspark, and submit jobs with spark-submit.
However, when I create a standalone job, like job.py, I create a SparkContext, like so:
sc=SparkContext("local", "App Name")
This doesn't seem right, but I'm not sure what to put there.
When I submit the job, I am sure it is not utilizing the whole cluster.
If I want to run a job against my entire cluster, say 4 processes per slave, what do I have to
a.) pass as arguments to spark-submit
b.) pass as arguments to SparkContext() in the script itself.
You can create spark context using
conf = SparkConf().setAppName(appName)
sc = SparkContext(conf=conf)
and you have to submit the program to spark-submit using the following command for spark standalone cluster
./bin/spark-submit --master spark://<sparkMasterIP>:7077 code.py
For Mesos cluster
./bin/spark-submit --master mesos://207.184.161.138:7077 code.py
For YARN cluster
./bin/spark-submit --master yarn --deploy-mode cluster code.py
For YARN master, the configuration would be read from HADOOP_CONF_DIR.

Pig job gets killed on Amazon EMR.

I have been trying to run a pig job with multiple steps on Amazon EMR. Here are the details of my environment:
Number of nodes: 20
AMI Version: 3.1.0
Hadoop Distribution: 2.4.0
The pig script has multiple steps and it spawns a long-running map reduce job that has both a map phase and reduce phase. After running for sometime (sometimes an hour, sometimes three or four), the job is killed. The information on the resource manager for the job is:
Kill job received from hadoop (auth:SIMPLE) at
Job received Kill while in RUNNING state.
Obviously, I did not kill it :)
My question is: how do I go about trying to identify what exactly happened? How do I diagnose the issue? Which log files to look at (what to grep for)? Any help on even where the appropriate log files would be greatly helpful. I am new to YARN/Hadoop 2.0
There can be number of reasons. Enable debugging on your cluster and see in the stderr logs for more information.
aws emr create-cluster --name "Test cluster" --ami-version 3.9 --log-uri s3://mybucket/logs/ \
--enable-debugging --applications Name=Hue Name=Hive Name=Pig
More details here:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-debugging.html

Running a script on all nodes of Hadoop in Amazon EMR

How do you run a script on all nodes (master and slaves) on Amazon EMR, the script-runner.jar runs only on the Namenode.
You have the bootstrap option:
You can use a bootstrap action to install additional software and to change the configuration of applications on the cluster. Bootstrap actions are scripts that are run on the cluster nodes when Amazon EMR launches the cluster. They run before Hadoop starts and before the node begins processing data. You can create custom bootstrap actions, or use predefined bootstrap actions provided by Amazon EMR.
from the documentation: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html
It's as simple as placing a script to do the copying into S3, and then if you're starting EMR from the command line, add a parameter like this:
--bootstrap-action 's3://my-bucket/boostrap.sh'
Or if you're doing it through the web interface, just enter the location of the file in as a "Custom action" in "Bootstrap Actions".

Apache Giraph on EMR

Has any tried Apache Giraph on EMR?
It seems to me the only requirements to run on EMR are to add proper bootstrap scripts to the Job Flow configuration. Then I should just need to use a standard Custom JAR launch step to launch the Giraph Runner with appropriate arguments for my Giraph program.
Any documentation/tutorial or if you could just share your experience with Giraph on EMR, that will be much appreciated.
Yes, I run Giraph jobs on EMR regularly but I don't use "Job Flows", I manually login to the master node and use it as a normal Hadoop cluster (I just submit the job with hadoop jar command).
You are right, you need to add bootstrap scripts to run Zookeeper and to add Zookeeper details to core-site config. Here is how I did it :
Bootstrap actions -
Configure Hadoop s3://elasticmapreduce/bootstrap-actions/configure-hadoop --site-key-value, io.file.buffer.size=65536, --core-key-value, giraph.zkList=localhost:2181, --mapred-key-value, mapreduce.job.counters.limit=1200
Run if s3://elasticmapreduce/bootstrap-actions/run-if instance.isMaster=true, s3://hpc-chikitsa/zookeeper_install.sh
The contents of zookeeper_install.sh are :
#!/bin/bash
wget --no-check-certificate http://apache.mesi.com.ar/zookeeper/zookeeper3.4./zookeeper3.4.5.tar.gz
tar zxvf zookeeper-3.4.5.tar.gz
cd zookeeper-3.4.5
mv conf/zoo_sample.cfg conf/zoo.cfg
sudo bin/zkServer.sh start
Then copy your Giraph jar file to master node (using scp) and then ssh to master node and submit the job using hadoop jar command.
Hope that helps.
Here is a relevant mail-thread on giraph-user mailing list :
https://www.mail-archive.com/user%40giraph.apache.org/msg01240.html

hadoop cluster clarification

I am a newbie in hadoop and I am trying to run a hadoop jar on Amazon EC2. I have started my amazon ec2 instance through the console, uploaded my files to the dfs and then was able to successfully run the job jar and generate output on the instance.
But still I am confused on one part. I am not sure if the job was run on a single machine in amazon ec2 or was it ran on a cluster? How do I find the number of worker nodes involved for my jar run?
In some reference links I see we have to use launch-cluster command , for example "bin/hadoop-ec2 launch-cluster test-cluster 2" . What is the difference in starting the instance from the console and using this command like launch-cluster.

Resources