How to launch N times a script Pig using oozie? - hadoop

I need to process a lot of files with some specifically date. I find only one solution, which is to launch a job N times with each time a different dataset. The partitions used is based on yyyy, mm, dd. I have a java action which generate the good partition to use for each data.
My question is, how can I create a loop to launch my script N times ? I work today with oozie workflow.
Thanks

This sounds like a use case for coordinators.
You can declare Datasets and let oozie automatically start a workflow when a specfic dataset instance is available.

Related

Append DataSets (.ds) using UNIX

I'm currently working on DataStage IBM and here's my problem:
I have to get a n numbers of datasets that's going to be in a folder and I have to append them in one DataSet (.ds).
Since I don't know how many datasets I will have and neither they full name, I can't use a DataStage job to deal with them. All I know is they will have the same metadata (because they will be generated in the same job).
I think I have to use a Shell Cmd to append them but I'm not a UNIX guy.
Thank you for everyone who reads so far.
You can use the same job. Specify Append mode (rather than Override) for the target Data Set; each time you run the job data will be added to the same Data Set. Be careful not to inadvertently create duplicates by processing the same source data twice. Use parameters to specify the source.

Hadoop schedule jobs to run sequentially (one job after other)?

Lets say I am resource constrained in my Hadoop environment and I don't want to schedule really long running jobs (ie takes days to complete). I am analyzing vast amount of past time series data. I want to schedule mapreduce jobs that take a day's worth of data at a time (which takes an hour to crunch).
So how do I schedule such that new job is submitted as soon as previous job is completed?
If you want a quick and simple approach you could just write a shell script that calls hadoop jar in sequence for each job you want to run.
If you want a more robust approach you could use Apache Oozie to define a workflow of jobs that will run your jobs in sequence. If you are new to Hadoop you may find it easiest to define and run your Oozie workflow using the Hue GUI.

Is there an Alternative for HBaseStorage in PIG

I am using HBaseStorage with -caching option in pig script as follows
HBaseStorage('countDetails:ansCount countDetails:divCount countDetails:unansCount countDetails:engCount countDetails:ineffCount countDetails:totalCount', '-caching 1000');
I can see this was reflecting in my job.xml
but I can see there is no time difference in it I am processing 10 million records and store data around 160mb in to HBase.
When I store the result in hdfs its taking 3 mins to process the same job takes 30mins to store into HBase.
I even tried by setting
SET hbase.client.scanner.caching 1000;
Please let me know how can I reduce the time.
Is there any alternative for HBaseStorage?
http://apmblog.compuware.com/2013/02/19/speeding-up-a-pighbase-mapreduce-job-by-a-factor-of-15/
the above blog says that I have to set hbase.client.scanner.caching in bootstrap scrip
I don't know how to do that
will it be enough If I set it in Hbase-conf.
Please help me out of this
hbase.client.scanner.caching points to number of rows that will be fetched when calling next on a scanner if it is not served from (local, client) memory.
Higher caching values will enable faster scanners but will eat up more memory and some calls of next may take longer and longer time when the cache is empty. Do not set this value such that the time between invocations is greater than the scanner timeout;
i.e. hbase.regionserver.lease.period This property is 1 min by default. Clients must
report in within this period else they are considered dead.
In my experience HBase doesn't perform very well with Pig. It you don't have requirement for random look-up then use only HDFS otherwie HBase MR job would be better option. Also, In Hadoop MR job, you can connect to Hbase(This option gave me the best performance).

Want to compare two consecutive jobs on Hadoop

I want to know if I can compare two consecutive jobs in Hadoop. If not I would appreciate if anyone can tell me how to proceed with that. To be precise, I want to compare the jobs in terms of what exactly two jobs did? The reason behind doing this is to create a statistics about how many jobs executed on Hadoop were similar in terms of the behavior. For example how many times same sorting function was executed on the same input.
For example if first job did something like SortList(A) and some other job did SortList(A)+Group(result(SortList(A)). Now, I am wondering if in Hadoop there is some mapping being stored somewhere like JobID X-> SortList(A).
So far, I thought of this problem as finding the entry point in Hadoop and try to understand how job is created and what information is being kept with a jobID and in what form (in a code form or some description) , but I was not able to figure it out successfully.
Hadoop's Counters might be a good place to start. You can define your own counter names (like each counter name is a data set you are working on) and increment that counter each time you perform a sort on it. Finding which data set you are working on, however, may be the more difficult task.
Here's a tutorial I found:
http://philippeadjiman.com/blog/2010/01/07/hadoop-tutorial-series-issue-3-counters-in-action/
No. Hadoop jobs are just programs. They can have any side effects. They can write ordinary files, hdfs file, or a database. Nothing in hadoop is recording all of their activities. All hadoop is manage the schedule and the flow of data.

Periodic hadoop jobs running (best practice)

Customers able to upload urls in any time to database and application should processes urls as soon as possible. So i need periodic hadoop jobs running or run hadoop job automatically from other application(any script identifies new links were added, generates data for hadoop job and runs job). For PHP or Python script, i could set up cronjob, but what is best practice for periodic hadoop jobs running (prepare data for hadoop, upload data, run hadoop job and move data back to database?
Take a look at Oozie, the new workflow system from Y!, which can run jobs based on different triggers. A good overflow is presented by Alejandro here: http://www.slideshare.net/ydn/5-oozie-hadoopsummit2010
If you want urls to be processed as soon as possible, you'll have them processed each at a time. My recommendation is to wait for some number of links (or MB of links, or for example 10 min, every day).
And batch process them (I do my processing daily, but that jobs takes few hours)

Resources