I have been trying to run a pig job with multiple steps on Amazon EMR. Here are the details of my environment:
Number of nodes: 20
AMI Version: 3.1.0
Hadoop Distribution: 2.4.0
The pig script has multiple steps and it spawns a long-running map reduce job that has both a map phase and reduce phase. After running for sometime (sometimes an hour, sometimes three or four), the job is killed. The information on the resource manager for the job is:
Kill job received from hadoop (auth:SIMPLE) at
Job received Kill while in RUNNING state.
Obviously, I did not kill it :)
My question is: how do I go about trying to identify what exactly happened? How do I diagnose the issue? Which log files to look at (what to grep for)? Any help on even where the appropriate log files would be greatly helpful. I am new to YARN/Hadoop 2.0

There can be number of reasons. Enable debugging on your cluster and see in the stderr logs for more information.
aws emr create-cluster --name "Test cluster" --ami-version 3.9 --log-uri s3://mybucket/logs/ \
--enable-debugging --applications Name=Hue Name=Hive Name=Pig
Get list of executed job on Hadoop cluster after cluster reboot

I have a hadoop cluster 2.7.4 version. Due to some reason, I have to restart my cluster. I need job IDs of those jobs that were executed on cluster before cluster reboot. Command mapred -list provide currently running of waiting jobs details only
You can see a list of all jobs on the Yarn Resource Manager Web UI.
In your browser go to http://ResourceManagerIPAdress:8088/
This is how the history looks on the Yarn cluster I am currently testing on (and I restarted the services several times):
gcloud console indicating job is running, while hadoop application manager says it is finished

The job that I've submitted to spark cluster is not finishing. I see it is pending forever, however logs say that even spark jetty connector is shut down:
17/05/23 11:53:39 INFO org.spark_project.jetty.server.ServerConnector: Stopped ServerConnector#4f67e3df{HTTP/1.1}{}
I run latest cloud dataproc v1.1 (spark 2.0.2) on yarn. I submit spark job via gcloud api:
gcloud dataproc jobs submit spark --project stage --cluster datasys-stg \
--async --jar hdfs:///apps/jdbc-job/jdbc-job.jar --labels name=jdbc-job -- --dbType=test
The same spark pi stuff is finished correctly:
gcloud dataproc jobs submit spark --project stage --cluster datasys-stg --async \
--class org.apache.spark.examples.SparkPi --jars file:///usr/lib/spark/examples/jars/spark-examples.jar -- 100
While visiting hadoop application manager interface I see it is finished with Successful result:
Google cloud console and job list is showing it is still running until killed (see job run for 20 hours before killed, while hadoop says it ran for 19 seconds):
Is there something I can monitor to see what is preventing gcloud to finish the job?
I couldn't find anything that I can monitor my application is not finishing, but I've found the actual problem and fixed it. Turns out I had abandoned threads in my application - I had connection to RabbitMQ and that seemed to create some threads that prevented application from being finally stoped by gcloud.

H2O: unable to connect to h2o cluster through python

I have a 5 node hadoop cluster running HDP 2.3.0. I setup a H2O cluster on Yarn as described here.
On running following command
hadoop jar h2odriver_hdp2.2.jar water.hadoop.h2odriver -libjars ../h2o.jar -mapperXmx 512m -nodes 3 -output /user/hdfs/H2OTestClusterOutput
I get the following ouput
H2O cluster (3 nodes) is up
(Note: Use the -disown option to exit the driver after cluster formation)
(Press Ctrl-C to kill the cluster)
Blocking until the H2O cluster shuts down...
When I try to execute the command
h2o.init(ip="", port=54321)
The process remains stuck at this stage.On trying to connect to the web UI using the ip:54321, the browser tries to endlessly load the H2O admin page but nothing ever displays.
On forcefully terminating the init process I get the following error
No instance found at ip and port: Trying to start local jar...
However if I try and use H2O with python without setting up a H2O cluster, everything runs fine.
I executed all commands as the root user. Root user has permissions to read and write from the /user/hdfs hdfs directory.
I'm not sure if this is a permissions error or that the port is not accessible.
Any help would be greatly appreciated.
It looks like you are using H2O2 (H2O Classic). I recommend upgrading your H2O to the latest (H2O 3). There is a build specifically for HDP2.3 here: http://www.h2o.ai/download/h2o/hadoop
Running H2O3 is a little cleaner too:
hadoop jar h2odriver.jar -nodes 1 -mapperXmx 6g -output hdfsOutputDirName
Also, 512mb per node is tiny - what is your use case? I would give the nodes some more memory.

How do I submit more than one job to Hadoop in a step using the Elastic MapReduce API?

Amazon EMR Documentation to add steps to cluster says that a single Elastic MapReduce step can submit several jobs to Hadoop. However, Amazon EMR Documentation for Step configuration suggests that a single step can accommodate just one execution of hadoop-streaming.jar (that is, HadoopJarStep is a HadoopJarStepConfig rather than an array of HadoopJarStepConfigs).
What is the proper syntax for submitting several jobs to Hadoop in a step?
Like Amazon EMR Documentation says, you can create a cluster to run some script my_script.sh on the master instance in a step:
aws emr create-cluster --name "Test cluster" --ami-version 3.11 --use-default-roles
--ec2-attributes KeyName=myKey --instance-type m3.xlarge --instance count 3
--steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://mybucket/script-path/my_script.sh"]
my_script.sh should look something like this:
#!/usr/bin/env bash
hadoop jar my_first_step.jar [mainClass] args... &
hadoop jar my_second_step.jar [mainClass] args... &
This way, multiple jobs are submitted to Hadoop in the same step---but unfortunately, the EMR interface won't be able to track them. To do this, you should use the Hadoop web interfaces as shown here, or simply ssh to the master instance and explore with mapred job.

submit hadoop job on cloudera

I am wondering if we can setup a cloudera cluster on amazon and kick off a hadoop job from my local linux without ssh into amazon's node.
Is there anything like a client to do this communication?
The tips from the following tutorial really work. You should be able to put a working Hadoop Cluster in under 20 minutes, from cold iron to production ready, using just his guidance:
Hadoop Quickstart: Build a Cluster In The Cloud In 20 Minutes
Really worth checking it.
You can install an Hadoop client in your local linux and use the "hadoop jar" command with your own jar. Specify the option mapred.job.tracker in the command line and the client will push your jar to the jobtracker and duplicate it in all the tasktrackers that will be used for this job.
