Submitting the same coordinator job multiple times in oozie - hadoop

I have a coordinator job in Oozie. It calls the workflow with a java action node.
If I submit this job only once, then it works perfectly. However, if I submit this job twice with the same start and end time, but a different arg1 to the Main class, then both the job instances hang in the "RUNNING" state and the logs look like this:
>>> Invoking Main class now >>>
Heart beat
Heart beat
Heart beat
Heart beat
...
If I kill one of the jobs, then the other one starts running again.
The documentation states that it is possible to submit multiple instances of the same coordinator job with different parameters: http://archive.cloudera.com/cdh/3/oozie/CoordinatorFunctionalSpec.html#a6.3._Synchronous_Coordinator_Application_Definition
"concurrency: The maximum number of actions for this job that can be running at the same time. This value allows to materialize and submit multiple instances of the coordinator app, and allows operations to catchup on delayed processing. The default value is 1 ."
So what am I doing wrong? I even saw two instances of the workflow action from the same job being in the "RUNNING" state which ran fine once the other job was killed.

Ok I found the issue. It was related to HBase concurrency and not enough task slots in the cluster. Setting the following property in the mapred-site.xml file fixes the issue:
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>50 </value>
It was similar to this issue : https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/v0BHtQ0hlBg

Related

How to monitor, send alert for Long running Jobs in Hadoop

I have a requirement, where I need to monitor hadoop job (Hive/Map Reduce, spark ) that are running for long, may be say 3 hr duration in the cluster. I know I can view all these jobs in UI, but I need to monitor it every hourly or 30 min and send email/alerts if job is running for more then 3 hours. Is there a way to do this.
My environment is HDP 2.6
Thanks in Advance....
You can look into Oozie. Oozie allows you to configure alerts if a job exceeds it's expected run-time.
In order to use this feature you'd have to submit your job as an Oozie workflow.
http://oozie.apache.org/docs/4.2.0/DG_Overview.html
https://oozie.apache.org/docs/4.3.0/DG_SLAMonitoring.html#SLA_Definition_in_Workflow
as tk421 mentions - oozie is the "right" way to do this in the context of hadoop.
however, if you do not require all the overhead, something simple like an on-demand watchdog timer might be sufficient (ie: wdt.io) . Basically the workflow is send the start signal, start the job, and send an end signal when the job completes. IF the second signal does not come in within the allotted amount of time an email / sms alert is dispatched.
This method would work for non-hadoop workflows as well.

How to execute same job cocurrently in Spring xd

We have the below requirement,
In spring xd, we have a job lets assume the job name as MyJob
which will be invoked by another process using the rest service of spring xd, lets assume process name as OutsideProcess (non-spring xd process).
OutsideJob invokes MyJob when ever a file added to a location (lets assume FILES_LOC) to which OutsideJob is listening.
In this scenario, lets assume that MyJob takes 5minutes to complete the job.
At 10:00 AM, there is a file copied to FILES_LOC, then OutsideProcess will trigger MyJob immediately. (approximately it will be completed at 10:05AM)
At 10:01 AM, another file copied to FILES_LOC, then OutsideProcess will trigger one more instance of MyJob at 10:01AM. But the second instance is getting queued and starts the execution once the first instance completes its execution (approximately at 10:05AM).
If we invoke the different jobs at the same time they are getting executed concurrenctly, but the same job multiple instances are not getting executed concurrenctly.
Please let me know how can I execute the same job with multiple instances concurrently.
Thanks in advance.
The only thing I can think of is dynamic deployment of the job and triggering it right away. You can use SpringXD Rest template to create the job definition on the fly and launch them after sleeping a few seconds. And make sure you undeploy/destroy the job when the job completes successfully.
Another solution could be to create a few module instances of your job with different names and use them as your slave processes. You can query status of these job module instances and launch the one that is finished or queue the one that is least recently launched.
Remember you can run jobs with partition support if applicable. This way you will finish your job faster and be able to run more jobs.

Quartz one time job on application startup

I am trying to integrate a Quartz job in my spring application. I got this example from here. The example shows jobs executing at repeated intervals using a simpletrigger and at a specific time using a crontrigger.
My requirement is to run the job only once on application startup. I removed the property repeatInterval, but the application throws an exception :
org.quartz.SchedulerException: Repeat Interval cannot be zero
Is there any way to schedule a job just once ?
Thanks..
Found the answer here
Ignoring the repeatInterval and setting repeatCount = 0 does what I wanted.
Spring SimpleTriggerFactoryBean does the job: if you don't specify the start time, it will set it to 'now'.
Yet I think that long-running one-time job should be considered an anti-pattern, since it will not work even in 2-node cluster: if the node that runs the job goes down, there will be no one that would restart the job.
I prefer to have a job that repeats e.g. every hour, but annotated with #DisallowConcurrentExecution. This way you guarantee that precisely one job will be running, both when the node that originally hosted the job is up, and after it goes down.

Hadoop reuse Job object

I have a pool of Jobs from which I retrieve jobs and start them. The pattern is something like:
Job job = JobPool.getJob();
job.waitForCompletion();
JobPool.release(job);
I get a problem when I try to reuse a job object in the sense that it doesn't even run (most probably because it's status is : COMPLETED). So, in the following snippet the second waitForCompletion call prints the statistics/counters of the job and doesn't do anything else.
Job jobX = JobPool.getJob();
jobX.waitForCompletion();
JobPool.release(jobX);
//.......
Job jobX = JobPool.getJob();
jobX.waitForCompletion(); // <--- here the job should run, but it doesn't
Am I right when I say that the job doesn't actually run because hadoop sees its status as completed and it doesn't event try to run it ? If yes, do you know how to reset a job object so that I can reuse it ?
The Javadoc includes this hint that the jobs should only run once
The set methods only work until the job is submitted, afterwards they will throw an IllegalStateException.
I think there's some confusion about the job, and the view of the job. The latter is the thing that you have got, and it is designed to map to at most one job running in hadoop. The view of the job is fundamentally light weight, and if creating that object is expensive relative to actually running the job... well, I've got to believe that your jobs are simple enough that you don't need hadoop.
Using the view to submit a job is potentially expensive (copying jars into the cluster, initializing the job in the JobTracker, and so on); conceptually, the idea of telling the jobtracker to "rerun " or "copy ; run ", makes sense. As far as I can tell, there's no support for either of those ideas in practice. I suspect that hadoop isn't actually guaranteeing retention policies that would support either use case.

Should I use oozie to run a MapReduce task forever?

I have a mapReduce task (https://github.com/flopezluis/testing-hadoop) that reads the files in a folder and it appends them to a zip. I need to run this task forever, so when it finishes to process them, it should run again. I'm reading about oozie but I'm not sure whether it's the best fit, because maybe it's too big for my problem.
In case oozie is the best solution. If I write a coordinator to run every 10 minutes, what happens if the task takes more than 10 minutes, the coordinator waits to run the task again?
Explanation of the task
The folder is always the same. There are differences zips files, one for key. The idea is to create the zip file step by step. I think this faster than create the zip file after all the files are procesed.
The files contain something like this:
<info operationId="key1">
DATA1
</info>
<info operationId="key1">
DATA2
</info>
<info operationId="key2">
DATA3
</info>
So the zips will be like this:
key1.zip --> data1, data2
key3.zip --> data3
Thanks
You can use oozie for this. Oozie has a setting that will tell limit how many instances of a job can be running at once. If you first job isn't finished after then minutes than it will wait to run the next job.
From the Oozie documentation:
6.1.6. Coordinator Action Execution Policies
The execution policies for the actions of a coordinator job can be defined in the coordinator application.
• Timeout: A coordinator job can specify the timeout for its coordinator actions, this is, how long the coordinator action will be in WAITING or READY status before giving up on its execution.
• Concurrency: A coordinator job can specify the concurrency for its coordinator actions, this is, how many coordinator actions are allowed to run concurrently ( RUNNING status) before the coordinator engine starts throttling them.
• Execution strategy: A coordinator job can specify the execution strategy of its coordinator actions when there is backlog of coordinator actions in the coordinator engine. The different execution strategies are 'oldest first', 'newest first' and 'last one only'. A backlog normally happens because of delayed input data, concurrency control or because manual re-runs of coordinator jobs.
http://archive.cloudera.com/cdh/3/oozie-2.3.0-CDH3B4/CoordinatorFunctionalSpec.html#a6.1.6._Coordinator_Action_Execution_Policies
Just wanted to also comment that you could have the coordination job triggered off data arrival with a DataSet, but I am not that familar with DataSets.
If all you need is to execute the same hadoop job repeatedly on different input files, Oozie might be an overkill. Install and config Oozie on your testbed will also take some time. Writing a script that submits the hadoop job repeatedly might be enough.
But anyway, Oozie can do that. If you set the concurrency to 1, there will be at most 1 oozie coordinator action (which should be a workflow that contains only one hadoop job in your case) in running status. But you can increase the concurrency threshold to allow more actions executing concurrently.

Resources