I want to schedule a oozie job based on folder i.e.
I have a folder in HDFS location and every day one file will be add in that folder with the format of date.txt (exp :20160802.txt ).
I want to schedule a OOZIE batch, if any new file add in that folder.
Please help me on this ,how can I schedule in my use case scenario.
Thanks in advance.
Oozie workflow jobs are run based on regular time intervals and/or data availability. And, in some cases, they can be triggered by an external event. Coordinator comes into play here.
You can use oozie coordinator to check data dependency and trigger a oozie workflow with Coordinator EL functions
In your case every day your file is getting added to hdfs with timestamp.So with dataset you can achieve.
From Documentation
Example A dataset produced once every day at 00:15 PST8PDT and done-flag is set to empty:
<dataset name="logs" frequency="${coord:days(1)}"
initial-instance="2009-02-15T08:15Z" timezone="America/Los_Angeles">
<uri-template>
hdfs://foo:9000/app/logs/${market}/${YEAR}${MONTH}/${DAY}/data
</uri-template>
<done-flag></done-flag>
</dataset>
The dataset would resolve to the following URIs and Coordinator looks for the existence of the directory itself:
[market] will be replaced with user given property. hdfs://foo:9000/usr/app/[market]/2009/02/15/data
hdfs://foo:9000/usr/app/[market]/2009/02/16/data
hdfs://foo:9000/usr/app/[market]/2009/02/17/data
Please read the documentation many examples are given there.Its good.
1.About Coordinators
2.DataSet
Related
I have a spark code that appends data from a hive table to parquet files partitioned on dates. The code runs absolutely correct when executed from the spark shell and the parquet files show the exact same number of rows as present in the hive table for the corresponding date.
However, when the same code is executed by putting the code in a jar file, which is called upon by a spark submit command, and the spark submit command is scheduled to execute daily at 9 AM via Nifi, the number of rows in the parquet partition files are coming out to be less. We are on the P_NO_SLA queue, and below are some of the facts and observations we have:
•Data on the source hive table gets updated by 4 AM approx
•Initially our Nifi job was scheduled to start running at 4:45 AM but the number of records did not match. On doing a manual update from the spark shell post 6 AM, the data was an exact match.
•Hence, we scheduled the job to run at 7 AM. On doing this, when the number of records were too less (approx. 20000 on weekends) as compared to weekdays (in the range of 150000 to >200000 records), the data got updated correctly via the Nifi Job. Again a manual run was done to backfill the missing data.
•Again, we postponed the job to 9 AM. Post doing this, there were 2 days when the number of records matched (between 160000 to 200000), however, since Jul-31, the data hasn't matched at all, irrespective of the number of records on any of the days, and we are having to do a manual backfill everyday.
We are unable to figure out any specific reason that maybe causing the code to run correctly from the spark shell at any time, but giving incorrect results from Nifi when Nifi is just schedculed to execute the spark submit command to run the jar file containing the same spark code.
Please help me with understanding why this would be happening and how I can fix this.
P.S.: I have checked the Nifi log files, and could not find any of the scheduled jobs giving an error.
Following 1 and 2:
Different types of files enter my NFS directory from time to time. I would like to use OOZIE or any other HDFS solution to trigger the file arrival event and to copy the file into specific location at the HDFS in accordance to its type. What is the best way to do it?
Best way is very subjective term. It largely depends on, what kind of data, frequency and what sorts of things should happen once the data arrive at specific location.
Apache flume can monitor specific folder for data availability and push it down to any sink like HDFS as-is. Flume is good for streaming data.But it does only one specific job- just moving data from place to place.
But on other hand, look up Oozie Coordinators. Coordinators have data availability trigger and with oozie you can perform all sort of ETL operations after data arrives using tools like spark,hive,pig etc and push it down to hdfs using shell actions. You can schedule jobs to run during specific times,frequency or have job send you an email if something goes wrong...
I need to perform the following workflow on my hadoop cluster.
New files are added into an hdfs directory, /export/ (multiple times a day)
Files are in two formats: *_A.csv and *_B.csv
Copy all *_A.csv into /hive/dumptable_a/
Copy all *_B.csv into /hive/dumptable_b/
Run hive insert query to load partitioned table A from dumptable_a
Run hive insert query to load partitioned table B from dumptable_b
Delete data from /hive/dumptable_a/ and /hive/dumptable_b/
Can oozie be set up to monitor /export/ for new files, and kick off the workflow?
If oozie cannot do this, or if it is not the right tool, what is the best alternative?
Yes, as Rahul mentioned, please look at Oozie file based coordinator, where you can find an example on how to use the <datasets> and <input-events> elements.
Or you can look at an example in oozie documentation here
I need to process a lot of files with some specifically date. I find only one solution, which is to launch a job N times with each time a different dataset. The partitions used is based on yyyy, mm, dd. I have a java action which generate the good partition to use for each data.
My question is, how can I create a loop to launch my script N times ? I work today with oozie workflow.
Thanks
This sounds like a use case for coordinators.
You can declare Datasets and let oozie automatically start a workflow when a specfic dataset instance is available.
I ran a pig work-flow using oozie. The job completed successfully but now I want to know on which slave or master the job ran. My input file is a 1.4GB file which is distributed on the nodes (1 master and 2 slaves).
And I also want to figure out how much time did the pig executed on each node.
Thank you in advance
Point your web browser to "JobTracker_Machine:50030" and it will presetn you the MapReduce webUI. Here you'll find all the jobs you have run(Running, Completed and Retired). Click on the job which you want to analyze and it will give you all the information you need including the node where a particular task has run and the time taken to finish the task.
HTH
Go to the Oozie Web console and click on the workflow (which contains the pig node).Clicking on the worklfow job will lead to a dialog box (for your workflow) containing details of all the action nodes in the workflow. Select the pig node (which you want to analyse) and a detailed dialog box will appear containing the Job Tracker's URL of that pig job.
There you will find all the details you are looking for.