Getting YARN action application ID in subsequent action - hadoop

I am running OOZIE workflow and doing map only distributed model fitting within map-reduce action. As there are many mappers, I have written a code which compiles YARN logs of all mapper tasks using yarn logs -applicationId application_x where application_x is parent application ID of all map tasks. Now I want to make this summarization part of workflow so I need to get application_x dynamically which is application ID of previous action. Is there any way by which I can get this?

I have not tested this, but I think you can get this with a workflow EL function:
wf:actionExternalId(String node)
It returns the external Id for an action node, or an empty string if
the action has not being executed or it has not completed yet.
So in a node after the map reduce job has completed, you should be able to use something likeL
wf:actionExternalId('mapred-node-name')
I suspect it will return job_xxx instead of application_xxx, but you can probably handle that OK.

Related

Oozie Re-run the workflow from the node, where it was last killed

I am using the MR action in oozie. And want to use the suspend/resume functionality provided by oozie on MR.
The problem - However, when we call 'suspend' on Oozie that doesn't kill the hadoop job instead it only updates the oozie state, while the hadoop job continues to run to completion. I want the hadoop job to be killed immediately when the oozie workflow is suspended. Guess that's not possible right ?
So, I figured I can kill the oozie workflow and re-run it.
But now the problem is , if there are 3 action nodes in my workflow, I don't want to re-run all of them.
Am I smoking here or is it possible to access the last successful oozie node ( which is most likely stored in the mysql database) ?
Is there a better approach to my connived use-case ?
Appreciate your help in advance,
P.S. I am not posting my workflow because it's a simple workflow with several map-reduce actions, no rocket science here.
The oozie command line interface allows for the flag "-rerun" to be used in the following context:
oozie job -oozie http://localhost:11000/oozie -config job.properties -rerun 14-20090525161321-oozie-joe
There is one issue in the automation of reruns, and this is that you must supply a list of action nodes to skip inside of the job.properties file using the oozie.wf.rerun.skip.nodes variable.
This following is taken from source
Only one of following two configurations is mandatory. Both should not be defined at the same time
oozie.wf.rerun.skip.nodes
oozie.wf.rerun.failnodes
Skip nodes are comma separated list of action names. They can be any action nodes including decision node.
The valid value of oozie.wf.rerun.failnodes is true or false.
If secured hadoop version is used, the following two properties needs to be specified as well
mapreduce.jobtracker.kerberos.principal
dfs.namenode.kerberos.principal.
In order to get the automation, it might be worth you time to try appending to the properties file on completion of an action using either a shell script or implementing it directly into your MR jobs.

What is the difference between job.submit and job.waitForComplete in Apache Hadoop?

I have read the documentation so I know the difference.
My question however is that, is there any risk in using .submit instead of .waitForComplete if I want to run several Hadoop jobs on a cluster in parallel ?
I mostly use Elastic Map Reduce.
When I tried doing so, I noticed that only the first job being executed.
If your aim is to run jobs in parallel then there is certainly no risk in using job.submit(). The main reason job.waitForCompletion exists is that it's method call returns only when the job gets finished, and it returns with it's success or failure status which can be used to determine that further steps are to be run or not.
Now, getting back at you seeing only the first job being executed, this is because by default Hadoop schedules the jobs in FIFO order. You certainly can change this behaviour. Read more here.

Oozie/Hadoop: How do I define an input dataset when it's more complex than just a static file?

I'm trying to run an existing Hadoop job using Oozie (I'm migrating from AWS).
In AWS Mapreduce I programmatically submit jobs, so before the job is submitted, my code programmatically find the input.
My input happens to be the last SUCCESSFUL run of another job. To find the last SUCCESSFUL run I need to scan an HDFS folder, sort by the timestamp embedded in the folder naming convention, and find the most recent folder with an _SUCCESS file in it.
How to do this is beyond my oozie-newbie comprehension.
Can someone simply describe for me what I need to configure in Oozie so I have some idea of what I'm attempting to reach for here?
Take a look to the following configuration for oozie: https://github.com/cloudera/cdh-twitter-example/blob/master/oozie-workflows/coord-app.xml
There is a tag called "done-flag" there you can put the _SUCCESS file in order to trigger a workflow or for your case a map reduce job. There are also parameter for scheduling the job
${coord:current(1 + (coord:tzOffset() / 60))}
....

Submitting the same coordinator job multiple times in oozie

I have a coordinator job in Oozie. It calls the workflow with a java action node.
If I submit this job only once, then it works perfectly. However, if I submit this job twice with the same start and end time, but a different arg1 to the Main class, then both the job instances hang in the "RUNNING" state and the logs look like this:
>>> Invoking Main class now >>>
Heart beat
Heart beat
Heart beat
Heart beat
...
If I kill one of the jobs, then the other one starts running again.
The documentation states that it is possible to submit multiple instances of the same coordinator job with different parameters: http://archive.cloudera.com/cdh/3/oozie/CoordinatorFunctionalSpec.html#a6.3._Synchronous_Coordinator_Application_Definition
"concurrency: The maximum number of actions for this job that can be running at the same time. This value allows to materialize and submit multiple instances of the coordinator app, and allows operations to catchup on delayed processing. The default value is 1 ."
So what am I doing wrong? I even saw two instances of the workflow action from the same job being in the "RUNNING" state which ran fine once the other job was killed.
Ok I found the issue. It was related to HBase concurrency and not enough task slots in the cluster. Setting the following property in the mapred-site.xml file fixes the issue:
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>50 </value>
It was similar to this issue : https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/v0BHtQ0hlBg

Hadoop reuse Job object

I have a pool of Jobs from which I retrieve jobs and start them. The pattern is something like:
Job job = JobPool.getJob();
job.waitForCompletion();
JobPool.release(job);
I get a problem when I try to reuse a job object in the sense that it doesn't even run (most probably because it's status is : COMPLETED). So, in the following snippet the second waitForCompletion call prints the statistics/counters of the job and doesn't do anything else.
Job jobX = JobPool.getJob();
jobX.waitForCompletion();
JobPool.release(jobX);
//.......
Job jobX = JobPool.getJob();
jobX.waitForCompletion(); // <--- here the job should run, but it doesn't
Am I right when I say that the job doesn't actually run because hadoop sees its status as completed and it doesn't event try to run it ? If yes, do you know how to reset a job object so that I can reuse it ?
The Javadoc includes this hint that the jobs should only run once
The set methods only work until the job is submitted, afterwards they will throw an IllegalStateException.
I think there's some confusion about the job, and the view of the job. The latter is the thing that you have got, and it is designed to map to at most one job running in hadoop. The view of the job is fundamentally light weight, and if creating that object is expensive relative to actually running the job... well, I've got to believe that your jobs are simple enough that you don't need hadoop.
Using the view to submit a job is potentially expensive (copying jars into the cluster, initializing the job in the JobTracker, and so on); conceptually, the idea of telling the jobtracker to "rerun " or "copy ; run ", makes sense. As far as I can tell, there's no support for either of those ideas in practice. I suspect that hadoop isn't actually guaranteeing retention policies that would support either use case.

Resources