I am using Oozie workflows to import many tables from different oracle servers. Currently, I have developed a workflow for each of these tables that I want to sqoop into Hadoop. This does a basic sqoop, then does some transformation and creation of hive tables.
Where I have got stuck is, I want to schedule one workflow to run which, is fine. (I have done that), then I want the rest of the workflows to execute on completion of the previous one.
I have been looking at bundles but have not managed to find a solution. I hope some of you can help.
Thanks.
You can create a parent or wrapping workflow that calls each workflow in series (as part of the ok state transition). This is documented as a sub-workflow action:
http://oozie.apache.org/docs/3.3.2/WorkflowFunctionalSpec.html#a3.2.6_Sub-workflow_Action
Related
What I would like to do is make workflow and job metadata such as start date, end date and status available in a hive table to be consumed by a BI tool for visualization purposes. I would like to be able to monitor for example if a certain workflow fails on certain hours, success rate, ...
For this purpose I need access to the same data Hue is able to show in the job browser and Oozie dashboard. What I am looking for specifically for workflows for example is the name, submitter, status, start and end time. The reason that I want this is that in my opinion this tool lacks a general overview and good search.
The idea is that once I locate this data I will directly -or trough some processing steps- load it into Hive.
Questions that I would like to see answered:
Is this data stored in HDFS or is it scattered in local data nodes?
If it is stored in HDFS. Where can I find it? If it is stored in local data nodes, how does Hue find and show this?
Assuming I can access the data. In what format would I expect this data. Is this stored in general log files or can I expect somewhat structured data?
I am using CDH 5.8
If jobs are submitted through other ways than Oozie , my approach won't be helpful.
We have collected all the logs from the oozie server through the Oozie Java API and iterated over the coordinator information to get the required info.
You need to think, what kind of information you need to retrieve.
If you have all jobs submitted through Bundle then come from bundle to coordinator then to workflow to find out the info.
If you want to get all the coordinator info then simply call the api with the number of coordinator to bring and fetch required info.
And then we have loaded the fetched result into a hive table and there one can filter results for failed or time out coordinators & various other parameters.
You can start looking into the example given from Oozie site:-
https://oozie.apache.org/docs/3.2.0-incubating/DG_Examples.html#Java_API_Example]
If you want to track the status of your jobs scheduled in oozie, you should use oozie RESTful API or JavaAPI. I didn't work with Hue version for operation Oozie, but I guess it still uses rest api behind the scene. It provides you with all necessary information and you can create some service which would consume this data and push it into Hive table.
Another option is to access Oozie database. As you probably know Oozie keeps all the data about the scheduled jobs within some RDBMS like MqSql or Postgres. You can consume this information through some JDBC connector. An interesting way would actually be to try to link this information directly into Hive as a set of external tables though JDBCStorageHandler. Not sure if it work, but it worth to try.
I needed some clarifications regarding the oozie launcher job.
1) Is the launcher job launched per workflow application (with several actions) or per action within a workflow application?
2) Use Case: I have workflows that contain multiple shell actions (which internally execute spark, hive, pig actions etc.). The reason for using shell is because additional parameters like partition date can be computed using custom logic and passed to hive using .q files
Example Shell File:
hive -hiveconf DATABASE_NAME=$1 -hiveconf MASTER_TABLE_NAME=$2 -hiveconf SOURCE_TABLE_NAME=$3 -hiveconf -f $4
Example .q File:
use ${hiveconf:DATABASE_NAME};
insert overwrite into table ${hiveconf:MASTER_TABLE_NAME} select * from ${hiveconf:SOURCE_TABLE_NAME};
I set the oozie.launcher.mapreduce.job.queuename and mapreduce.job.queuename to different queues to avoid starvation of task slots in a single queue. I also omitted the <capture-output></capture-output> in the corresponding shell action. However, I still see the launcher job occupying a lot of memory from the launcher queue.
Is this because the launcher job caches the log ouput that comes from hive?
Is it necessary to give the launcher job enough memory when executing a shell action the way I am?
What would happen if I explicitly limited the launcher job memory?
I would highly appreciate it if someone could outline the responsibilities of the oozie launcher job.
Thanks!
Is the launcher job launched per workflow application (with several actions) or per action within a workflow application?
The launcher job is launched per action in the workflow.
I would highly recommend you to use respective oozie actions, Hive, Pig etc. Because it allows oozie to handle your workflow and actions in a better manner.
I am looking for a workflow tool to run complex map-reduce jobs. I have Oozie in mind but also want to explore Cascading. Is there any sample code or example that chains existing M/R jobs using cascading API? Also, can you provide the comparison Oozie Vs Cascading?
Cascading and Oozie are not in the same category.
Oozie is a workflow scheduler.
Cascading is an API for creating workflows. It is agnostic about schedulers, i.e., it should run with whatever scheduler system that you use.
There is perhaps some confusion because the Oozie docs mention a "DAG", and both run atop Hadoop.
Also, Cascading has a notion of "data availability" in the checkpoint support, which is supported in Oozie, albeit differently.
Personally i play around with both to some extend, what i found interesting with cascading is
1)concise and expressive in terms of simple keywords like flow,tap,pipe etc.,
2)amazing TDD based approach for local development and research
3)nice planner view(.dot file) and will be useful once the project is grown, so maintenance is ease.
4)DSL based approach using groovy,scala,cloujre. so no need to worry about learning any new language or rather hadoop.
5)simple cloud deployment(e.g. amazon support as raw jar deployment).
6)you can call anything like existing pig or hive or pure other MR jar as long as they expose java api.
7)amazing for ML and NLP related works.
I am a beginner to Hadoop.
As per my understanding, Hadoop framework runs the Jobs in FIFO order (default scheduling).
Is there any way to tell the framework to run the job at a particular time?
i.e Is there any way to configure to run the job daily at 3PM like that?
Any inputs on this greatly appreciated.
Thanks, R
What about calling the job from external java schedule framework, like Quartz? Then you can run the job as you want.
you might consider using Oozie (http://yahoo.github.com/oozie/). It allows (beside other things):
Frequency execution: Oozie workflow specification supports both data
and time triggers. Users can specify execution frequency and can wait
for data arrival to trigger an action in the workflow.
It is independent of any other Hadoop schedulers and should work with any of them, so probably nothing in you Hadoop configuration will change.
How about having a script to execute your Hadoop job and then using at command to execute at some specified time.if you want the job to run regularly, you could setup a cron job to execute your script.
I'd use a commercial scheduling app if Cron does not cut it and/or a custom workflow solution. We use a solution called jams but keep in mind it's .net-oriented.
I have been using Hadoop quite a while now. After some time I realized I need to chain Hadoop jobs, and have some type of workflow. I decided to use Oozie , but couldn't find much of information about best practices. I would like to hear it from more experienced folks.
Best Regards
The best way to learn oozie is to download the examples tar file that comes with the distribution and run each of them. It has an example for mapreduce, pig , streaming workflow as well as sample coordinator xmls.
First run the normal workflows and once you debug that , move to running the workflows with coordinator so that you can take it step by step. Lastly one best practice would be to make most of your variables in workflow and coordinator be to configurable and supplied through a component.properties file so that you don't have touch the xml often.
http://yahoo.github.com/oozie/releases/3.1.0/DG_Examples.html
There are documents about Oozie on github and apache.
https://github.com/yahoo/oozie/wiki
http://yahoo.github.com/oozie/releases/3.1.0/DG_Examples.html
http://incubator.apache.org/oozie/index.html
Apache document is being updated and should be live soon.