I have requirement to run pig jobs in sequence without manual interaction .
Could you please advise me is there anyway to automate pig jobs by using pig / some other way
Assume jobs :
JOB001
JOB002
JOB003
JOB004
JOB001 -- Is my 1 st JOB --> after successful run 'JOB001' It should trigger 'JOB002'
JOB002 --> after successful run 'JOB002' It should trigger 'JOB003'
JOB003 --> after successful run 'JOB003' It should trigger 'JOB004'.
Oozie is the tool for you.
Simply create a workflow connecting one Pig job to another.
oozie is designed to schedule the hadoop jobs. see here for running pig jobs in oozie. Quick start help you to begin with oozie.
Related
Can you please someone help me to get the oozie error logs to hive table when jobs get failed. suggest me the approach how to do as i am new to this.
Write a shell script to pick up oozie job logs using:
oozie job -oozie http://localhost:11000 -info <wfid>
oozie job -oozie http://localhost:11000 -log <wfid>
redirect results of it into a file, which you may load into a hive table. Then use a oozie shell action trigger this step at failure.
I have a problem with sqoop and spark streaming jobs running together.
When i start spark streaming job and sqoop , the sqoop job stay on "accepted" mode and can't start. However,after killing spark job ,the sqoop job can run properly .
I really dont know what is the problem .
I want to run PIG in local mode, which is very easy
pig -x local file.pig
My requirement is to run PIG in local mode from OOZIE?
Is it possible as i think OOZIE will automatically launch map task first?
It's possible. When a pig script is run by Oozie, it's run as a one-map map-reduce job, which only runs the pig script, which in turn runs other map-reduce jobs (when pig is run in mapred mode).
It seems, that Pig action configuration doesn't allow running in local mode, but you can still run Pig script in local mode using shell action type. You only have to make sure, that your script, input and output data are in HDFS.
I don't think, we can run pig in local mode from oozie. Comment which Vishal wrote makes sense. In some cases, where there is lesser amount of data, Its better to go for pig in local mode. To run in local mode, you can run by writing a shell script and scheduling that in crontab.If you try this through oozie. Upto my knowledge It won't suit well , because Oozie is meant to run in HDFS.
If you want oozie to run on some data . It expects that data to be in HDFS (i.e distributed).And You must have the pig script as well in hdfs.I rembered seeing post from AlanGates where he mentioned PIG is designed to process data from/to HDFS and hive is for local to HDFS or HDFS to HDFS.
I have to run in hadoop 1.0.4 many (maybe 12) jobs. I want tha five first to run in parallel, and when all finish to run 4 other jobs in parallel and at last to run the last 3 again to run in parallel. How can i set it in hadoop 1.0.4 as i see that all jobs run one each other and not in parallel.
JobControl API can be used for MR job dependency. For complex work flows, Oozie or Azkaban is recommended. Here is Oozie vs Azkaban,
I have been trying to find info on how to submit hadoop jobs through the command line.
I am aware of the command - hadoop jar jar-file main-class input output
There is also another command about which I am trying to find info, but havent been able to - hadoop job -submit job-file
What is a "job-file" and how do I create one? What is the basic difference between command (a.) and (b.) ? Which is a better option?
Thanks in advance.
Here is an Example of Job-file for running the wordcount Map-reduce job .
Similarly you can write job-file for your Map-Reduce jobs .
mapred.input.dir=data/file1.txt
mapred.output.dir=output
mapred.job.name=wordcount
mapred.mapper.class=edu.uci.ics.hyracks.examples.wordcount.WordCount$Map
mapred.combiner.class=edu.uci.ics.hyracks.examples.wordcount.WordCount$Reduce
mapred.reducer.class=edu.uci.ics.hyracks.examples.wordcount.WordCount$Reduce
mapred.input.format.class=org.apache.hadoop.mapred.TextInputFormat
mapred.output.format.class=org.apache.hadoop.mapred.TextOutputFormat
mapred.mapoutput.key.class=org.apache.hadoop.io.Text
mapred.mapoutput.value.class=org.apache.hadoop.io.IntWritable
mapred.output.key.class=org.apache.hadoop.io.Text
mapred.output.value.class=org.apache.hadoop.io.IntWritable
For me the "Hadoop Jar" is better coz , configuration done in job-file can be easily done in the program itself .
Thanks