How to trigger Oozie jobs on particular condition? - hadoop

I have a folder where all my application log files gets stored. If new log file is created in the folder, immediately my oozie should trigger a Flume job which will put my log file into HDFS.
How to trigger Oozie job when new log file is created in the folder ?
Any help on this topic is greatly appreciated !!!

That's not how Oozie works. Oozie is a scheduler, a bit like CRON. First, you specify how often a workflow should run and then you can add a requirement for files being available as an additional requirement.

I think its more of how you place the files in HDFS. You could always have a parameterized oozie job, which could be invoked using Oozie Java API and passing in the name of the file created on HDFS from the client writing to HDFS itself unless streaming.
Every time a oozie workflow is initiated, it runs on a separate thread and this would allow you to call multiple oozie instances with different parameters.

Related

Max limit of oozie workflows

Does anyone have any idea on what's the maximum limit of oozie workflows that can execute in parallel?
I'm running 35 workflows in parallel (or that's what oozie UI mentions that they all got started in parallel). All the subworkflows perform ingestion of files from local to HDFS & do some validation checks henceforth on the metadata of file. Simple as that.
However, I see some subworkflows get failed during execution; the step in which they fail tries to put the files into HDFS location, i.e., the process wasn't able to execute hdfs dfs -put command. However, when I rerun these subworkflows they run successfully.
Not sure what caused them to execute and fail on hdfs dfs -put.
Any clues/suggestions on what could be happening?
First limitation does not depends on Oozie, but on resources available in YARN to execute Oozie actions as each action is executed in one map. But this limit will not fail your workflow: they will just wait for resources.
The major limit we've faced, leading to troubles, was on the callable queue of oozie services. Sometime, on heavy loads created by plenty of coordinator submitting plenty of worklow, Oozie was loosing more time in processing its internal callable queue than running workflows :/
Check oozie.service.CallableQueueService settings for informations about this.

Oozie: Does oozie generate output-events?

In oozie, input-events are pretty straightforward, if the specifies file/folder is not present, the coordinator job is kept in WAITING state. But I could not understand what output-events does.
As per my understanding, the files/folders specified in output-events tag should be created by oozie in case all specified actions are successful. But that does not happen. I cannot find any relevant logs either. Nor are the documentations clear about this.
So, the question is, does Oozie really create files/folders specified in output-events? Or does it just mention that these particular files/folders are created during the workflow and the responsibility of creation is on jobs, not on Oozie?
Relevant piece of code can be found at https://gist.github.com/venkateshshukla/de0dc395797a7ffba153
The official Oozie documentation for Oozie Coordinator is not very clear on the exact purpose of the output-events element. However, the book "Apache Oozie: The Workflow Scheduler for Hadoop" mentions the following:
During reprocessing of a coordinator, Oozie tries to help the retry attempt by cleaning up the output directories by default. For this, it uses the <output-events> specification in the coordinator XML to remove the old output before running the new attempt. Users can override this default behavior using the –noCleanup option.
So, in summary:
No, files specified in output-events are not automatically created by Oozie, you need to create those files in your Oozie workflow actions.
The output-events configuration is for giving Oozie information on what files will be created by your Oozie workflow actions, which Oozie would use to cleanup files when rerunning/reprocessing a coordinator.
Always the actions generate the data, these settings are just for control.
You'll find some examples here

how to load text files into hdfs through oozie workflow in a cluster

I am trying to load text/csv files in hive scripts with oozie and schedule it on daily basis. Text files are in local unix file system.
I need to put those text files into hdfs before executing the hive scripts in a oozie workflow.
In a real time cluster we don't know job will run on which node.it will run randomly in any one of the node in cluster.
can any one provide me the solution.
Thanks in advance.
Not sure I understand what you want to do.
The way I see it, it can't work:
Oozie server has access to HDFS files only (same as Hive)
your data is on a local filesystem somewhere
So why don't you load your files into HDFS beforehand? The transfer may be triggered either when the files are available (post-processing action in the upstream job) or at fixed time (using Linux CRON).
You don't even need the Hadoop libraries on the Linux box if the WebHDFS service is active on your NameNode - just use CURL and a HTTP upload.

How do I add files to distributed cache in an oozie job

I am implementing an oozie workflow where, in the first job I am reading data from a database using sqoop and writing it to hdfs. In the second job I need to read a large amount of data and use the files I just wrote in job one to process the large data. Here's what I thought of or tried:
Assuming job one writes the files to some directory on hdfs, adding the files to distributed cache in the driver class of job two will not work as oozie workflow knows just about the mapper and reducer classes of the job. (Please correct me if I am wrong here)
I also tried to write to the lib directory of the workflow hoping that the files would then be automatically added to distributed cache but I understood that the lib directory should be read only when the job is running.
I also thought if I could add the files to distributed cache in the setup() of job 2 then I could access them in the mapper/reducer. I am not aware of how one can add files in setup(), is it possible?
How else can I read the output files of the previous job in the subsequent job from distributed cache. I am already using the input directory of job two to read the data that needs to be processed so I cannot use that.
I am using Hadoop 1.2.1, Oozie 3.3.2 on Ubuntu 12.04 virtual machine.
Add the below properties to add files or archives to your map-reduce action . Refer to this documentation for details.
<file>[FILE-PATH]</file>
...
<archive>[FILE-PATH]</archive>
You can also give input at java command line as shown below.
<main-class>org.apache.oozie.MyFirstMainClass</main-class>
<java-opts>-Dblah</java-opts>
<arg>argument1</arg>
<arg>argument2</arg>

How can I use Oozie to copy remote files into HDFS?

I have to copy remote files into HDFS. I want to use Oozie because I need to run this job everyday at a specific time.
Oozie can help you create a workflow. Using oozie you can invoke an external action capable of copying files from your source to HDFS, but oozie will not do it automatically.
Here are a few suggestions:
Use a custom program to write files to hdfs, for example using a SequenceFile.Writer.
Flume might help.
Use an integration component like camel-hdfs to move files to hdfs.
ftp files to hdfs node and then copy from local disk to hdfs.
Investigate more options that might be a good fit for your case.

Resources