I have a mapReduce task (https://github.com/flopezluis/testing-hadoop) that reads the files in a folder and it appends them to a zip. I need to run this task forever, so when it finishes to process them, it should run again. I'm reading about oozie but I'm not sure whether it's the best fit, because maybe it's too big for my problem.
In case oozie is the best solution. If I write a coordinator to run every 10 minutes, what happens if the task takes more than 10 minutes, the coordinator waits to run the task again?
Explanation of the task
The folder is always the same. There are differences zips files, one for key. The idea is to create the zip file step by step. I think this faster than create the zip file after all the files are procesed.
The files contain something like this:
<info operationId="key1">
DATA1
</info>
<info operationId="key1">
DATA2
</info>
<info operationId="key2">
DATA3
</info>
So the zips will be like this:
key1.zip --> data1, data2
key3.zip --> data3
Thanks
You can use oozie for this. Oozie has a setting that will tell limit how many instances of a job can be running at once. If you first job isn't finished after then minutes than it will wait to run the next job.
From the Oozie documentation:
6.1.6. Coordinator Action Execution Policies
The execution policies for the actions of a coordinator job can be defined in the coordinator application.
• Timeout: A coordinator job can specify the timeout for its coordinator actions, this is, how long the coordinator action will be in WAITING or READY status before giving up on its execution.
• Concurrency: A coordinator job can specify the concurrency for its coordinator actions, this is, how many coordinator actions are allowed to run concurrently ( RUNNING status) before the coordinator engine starts throttling them.
• Execution strategy: A coordinator job can specify the execution strategy of its coordinator actions when there is backlog of coordinator actions in the coordinator engine. The different execution strategies are 'oldest first', 'newest first' and 'last one only'. A backlog normally happens because of delayed input data, concurrency control or because manual re-runs of coordinator jobs.
http://archive.cloudera.com/cdh/3/oozie-2.3.0-CDH3B4/CoordinatorFunctionalSpec.html#a6.1.6._Coordinator_Action_Execution_Policies
Just wanted to also comment that you could have the coordination job triggered off data arrival with a DataSet, but I am not that familar with DataSets.
If all you need is to execute the same hadoop job repeatedly on different input files, Oozie might be an overkill. Install and config Oozie on your testbed will also take some time. Writing a script that submits the hadoop job repeatedly might be enough.
But anyway, Oozie can do that. If you set the concurrency to 1, there will be at most 1 oozie coordinator action (which should be a workflow that contains only one hadoop job in your case) in running status. But you can increase the concurrency threshold to allow more actions executing concurrently.
Related
I have a requirement, where I need to monitor hadoop job (Hive/Map Reduce, spark ) that are running for long, may be say 3 hr duration in the cluster. I know I can view all these jobs in UI, but I need to monitor it every hourly or 30 min and send email/alerts if job is running for more then 3 hours. Is there a way to do this.
My environment is HDP 2.6
Thanks in Advance....
You can look into Oozie. Oozie allows you to configure alerts if a job exceeds it's expected run-time.
In order to use this feature you'd have to submit your job as an Oozie workflow.
http://oozie.apache.org/docs/4.2.0/DG_Overview.html
https://oozie.apache.org/docs/4.3.0/DG_SLAMonitoring.html#SLA_Definition_in_Workflow
as tk421 mentions - oozie is the "right" way to do this in the context of hadoop.
however, if you do not require all the overhead, something simple like an on-demand watchdog timer might be sufficient (ie: wdt.io) . Basically the workflow is send the start signal, start the job, and send an end signal when the job completes. IF the second signal does not come in within the allotted amount of time an email / sms alert is dispatched.
This method would work for non-hadoop workflows as well.
I have a question about Apache Oozie and more specifically on the CDH distribution.
What happens to a coordinator when the workflow it uses has been modified?
For example the workflow now uses an extra parameter which is automatically filled in by a variable. This would in theory not require any changes on the coordinator.
Do running coordinators still use the configuration of the initial workflow or do they dynamically adapt to the new one. If they still use the old configuration do I then need to define a new coordinator or is resubmitting the same coordinator enough?
This is how it works: Every submitted coordinator has a fixed set of variables and parameters (config file). The -change option allows you to change the following attributes of the coordinator:
endtime: the end time of the coordinator job.
concurrency: the concurrency of the coordinator job.
pausetime: the pause time of the coordinator job.
Everything with the exception of the Coordinator coordinator name, frequency, start time, end time and timezone can be changed with the -update option. For details see the official documentation:
http://oozie.apache.org/docs/4.3.0/DG_CommandLineTool.html#Updating_coordinator_definition_and_properties
In the config file you are usually pointing to a coordinator file in hdfs which then points to a workflow file in hdfs. If you change either of these in hdfs, the next time the coordinator triggers it will use the new/modified files. The same holds true, for all files that are being used in workflow actions e.g. shell scripts, Jar-files, ...
I have a sequence of mapreduce jobs that need to be run. I was wondering if there is any advantage of using Oozie for that, instead of having "one big driver" that will run that sequence?
I know that Oozie can be used to run multiple actions of different type, e.g. pig script, shell script, mr job, but I'm concretely interested should I split my two jobs and run them using Oozie, or have a single jar to do that?
Oozie is a scheduler - crude, poorly documented, but a scheduler.
If you don't need scheduling per se, or if CRON on an edge node is sufficient
if you want to handle your workflow logic by yourself (e.g. conditional
branching, parallel executions w/ waiting for stragglers, calling
generic sub-workflows w/ ad hoc parameters, e-mail alerts on errors,
<insert your pet feature here>) or don't need any fancy logic
if you handle your executions logs and state history by yourself, or don't care about history
... well, don't use a scheduler.
PS: you also have Luigi (Spotify) and Azkaban (LinkedIn) as alternative Hadoop schedulers.
[edit] extra point to consider: if your "driver" crashes for whatever reason, you may not have a chance to send an alert; but if run from Oozie, the crash will be detected eventually (may take as much as 30 min. in a corner case e.g. AM job self-destruction due to YARN RM failover)
I would like to store/alter a flag (this will change occasionally) at the end of a mapreduce job. This job will be schedule to run every 30 mins. So at first it will store the flag and then when a validation fails in the job it will alter the flag (I would like to keep this state for the next job), which will be checked at each execution of the job. I'm not too sure what is the best way to store this flag?
To chain MapReduce jobs check this out: https://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
However, if you require the jobs to run every x mins, try Oozie for scheduling them. If you are on AWS check out DataPipeline, it does exactly what you want.
I have a coordinator job in Oozie. It calls the workflow with a java action node.
If I submit this job only once, then it works perfectly. However, if I submit this job twice with the same start and end time, but a different arg1 to the Main class, then both the job instances hang in the "RUNNING" state and the logs look like this:
>>> Invoking Main class now >>>
Heart beat
Heart beat
Heart beat
Heart beat
...
If I kill one of the jobs, then the other one starts running again.
The documentation states that it is possible to submit multiple instances of the same coordinator job with different parameters: http://archive.cloudera.com/cdh/3/oozie/CoordinatorFunctionalSpec.html#a6.3._Synchronous_Coordinator_Application_Definition
"concurrency: The maximum number of actions for this job that can be running at the same time. This value allows to materialize and submit multiple instances of the coordinator app, and allows operations to catchup on delayed processing. The default value is 1 ."
So what am I doing wrong? I even saw two instances of the workflow action from the same job being in the "RUNNING" state which ran fine once the other job was killed.
Ok I found the issue. It was related to HBase concurrency and not enough task slots in the cluster. Setting the following property in the mapred-site.xml file fixes the issue:
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>50 </value>
It was similar to this issue : https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/v0BHtQ0hlBg