Oozie: Does oozie generate output-events? - hadoop

In oozie, input-events are pretty straightforward, if the specifies file/folder is not present, the coordinator job is kept in WAITING state. But I could not understand what output-events does.
As per my understanding, the files/folders specified in output-events tag should be created by oozie in case all specified actions are successful. But that does not happen. I cannot find any relevant logs either. Nor are the documentations clear about this.
So, the question is, does Oozie really create files/folders specified in output-events? Or does it just mention that these particular files/folders are created during the workflow and the responsibility of creation is on jobs, not on Oozie?
Relevant piece of code can be found at https://gist.github.com/venkateshshukla/de0dc395797a7ffba153

The official Oozie documentation for Oozie Coordinator is not very clear on the exact purpose of the output-events element. However, the book "Apache Oozie: The Workflow Scheduler for Hadoop" mentions the following:
During reprocessing of a coordinator, Oozie tries to help the retry attempt by cleaning up the output directories by default. For this, it uses the <output-events> specification in the coordinator XML to remove the old output before running the new attempt. Users can override this default behavior using the –noCleanup option.
So, in summary:
No, files specified in output-events are not automatically created by Oozie, you need to create those files in your Oozie workflow actions.
The output-events configuration is for giving Oozie information on what files will be created by your Oozie workflow actions, which Oozie would use to cleanup files when rerunning/reprocessing a coordinator.

Always the actions generate the data, these settings are just for control.
You'll find some examples here

Related

How to get the scheduler of an already finished job in Yarn Hadoop?

So I'm in this situation, where I'm modifying the mapred-site.xml and specific configuration files of different schedulers for Hadoop, and I just want to make sure that the modifications I have made to the default scheduler(FIFO), has actually taken place.
How can I check the scheduler applied to a job or a queue of jobs already submitted to hadoop using job id ?
Sorry if this doesn't make that much sense, but I've looked around quite extensively to wrap my head around it, and read many documentations, and yet I still cannot seem to find this fundamental piece of information.
I'm simply trying the word count as a job, changing scheduler settings in mapped-site.xml and yarn-site.xml.
For instance I'm changing property "yarn.resourcemanager.scheduler.class" to "org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler" based on this link : see this
I'm also moving appropriate jar files specific to the schedulers to the correct directory.
For your reference, I'm using the "yarn" runtime mode, and Cloudera and Hadoop 2.
Thanks a ton for your help

Max limit of oozie workflows

Does anyone have any idea on what's the maximum limit of oozie workflows that can execute in parallel?
I'm running 35 workflows in parallel (or that's what oozie UI mentions that they all got started in parallel). All the subworkflows perform ingestion of files from local to HDFS & do some validation checks henceforth on the metadata of file. Simple as that.
However, I see some subworkflows get failed during execution; the step in which they fail tries to put the files into HDFS location, i.e., the process wasn't able to execute hdfs dfs -put command. However, when I rerun these subworkflows they run successfully.
Not sure what caused them to execute and fail on hdfs dfs -put.
Any clues/suggestions on what could be happening?
First limitation does not depends on Oozie, but on resources available in YARN to execute Oozie actions as each action is executed in one map. But this limit will not fail your workflow: they will just wait for resources.
The major limit we've faced, leading to troubles, was on the callable queue of oozie services. Sometime, on heavy loads created by plenty of coordinator submitting plenty of worklow, Oozie was loosing more time in processing its internal callable queue than running workflows :/
Check oozie.service.CallableQueueService settings for informations about this.

How do I add files to distributed cache in an oozie job

I am implementing an oozie workflow where, in the first job I am reading data from a database using sqoop and writing it to hdfs. In the second job I need to read a large amount of data and use the files I just wrote in job one to process the large data. Here's what I thought of or tried:
Assuming job one writes the files to some directory on hdfs, adding the files to distributed cache in the driver class of job two will not work as oozie workflow knows just about the mapper and reducer classes of the job. (Please correct me if I am wrong here)
I also tried to write to the lib directory of the workflow hoping that the files would then be automatically added to distributed cache but I understood that the lib directory should be read only when the job is running.
I also thought if I could add the files to distributed cache in the setup() of job 2 then I could access them in the mapper/reducer. I am not aware of how one can add files in setup(), is it possible?
How else can I read the output files of the previous job in the subsequent job from distributed cache. I am already using the input directory of job two to read the data that needs to be processed so I cannot use that.
I am using Hadoop 1.2.1, Oozie 3.3.2 on Ubuntu 12.04 virtual machine.
Add the below properties to add files or archives to your map-reduce action . Refer to this documentation for details.
<file>[FILE-PATH]</file>
...
<archive>[FILE-PATH]</archive>
You can also give input at java command line as shown below.
<main-class>org.apache.oozie.MyFirstMainClass</main-class>
<java-opts>-Dblah</java-opts>
<arg>argument1</arg>
<arg>argument2</arg>

How to trigger Oozie jobs on particular condition?

I have a folder where all my application log files gets stored. If new log file is created in the folder, immediately my oozie should trigger a Flume job which will put my log file into HDFS.
How to trigger Oozie job when new log file is created in the folder ?
Any help on this topic is greatly appreciated !!!
That's not how Oozie works. Oozie is a scheduler, a bit like CRON. First, you specify how often a workflow should run and then you can add a requirement for files being available as an additional requirement.
I think its more of how you place the files in HDFS. You could always have a parameterized oozie job, which could be invoked using Oozie Java API and passing in the name of the file created on HDFS from the client writing to HDFS itself unless streaming.
Every time a oozie workflow is initiated, it runs on a separate thread and this would allow you to call multiple oozie instances with different parameters.

Best practices for using Oozie for Hadoop

I have been using Hadoop quite a while now. After some time I realized I need to chain Hadoop jobs, and have some type of workflow. I decided to use Oozie , but couldn't find much of information about best practices. I would like to hear it from more experienced folks.
Best Regards
The best way to learn oozie is to download the examples tar file that comes with the distribution and run each of them. It has an example for mapreduce, pig , streaming workflow as well as sample coordinator xmls.
First run the normal workflows and once you debug that , move to running the workflows with coordinator so that you can take it step by step. Lastly one best practice would be to make most of your variables in workflow and coordinator be to configurable and supplied through a component.properties file so that you don't have touch the xml often.
http://yahoo.github.com/oozie/releases/3.1.0/DG_Examples.html
There are documents about Oozie on github and apache.
https://github.com/yahoo/oozie/wiki
http://yahoo.github.com/oozie/releases/3.1.0/DG_Examples.html
http://incubator.apache.org/oozie/index.html
Apache document is being updated and should be live soon.

Resources