Oozie make-like behavior - hadoop

I'm currently developing a set of map reduce tasks that have to be run in a particular order. I'm looking to use Oozie to manage the dependencies and running of this workflow. There's one key feature that I need, though, and I can't find any documentation that suggests that it is possible.
Basically, I am looking for a way to setup an action that checks to see if its output file is newer than the input file (and associated map-reduce code) has changed before executing the action. If so, it would skip executing the action. This way, I could make a change to a script and have only that stage of the workflow (and any that depend on its output) run.
Does anyone know how I'd go about doing this?

How about using shell action in oozie where in you can run a shell script which actually checks for difference in the content of the defined file. And then on success of this action goto the map-red action and continue your job else goto fail case and kill your job.
Hope this idea helps you , If this is what you are looking for

Related

How to detect errors in an oozie workflow before actually running it?

I am new to big data and was wondering if there is a way to detect errors in an oozie workflow before running it? Something like a compilation error sort of. Please help.
Thanks,
Annie
Use MiniOozie to run unit tests on the workflow. See Documentation. You can run the workflow as a unit test. There are some examples too.
Oozie typically runs multiple scripts, in sequence, with defined start and stops.
I think if you go through the worflow, one script at a time, that is really the only way to do a complete run before you schedule it.
You can validate the workflow.xml using the command line tool as below.
It performs an XML Schema validation on the specified workflow XML file.
oozie validate /path-to-workflow/workflow.xml
oozie validate -oozie http://<oozie-host>:11000/oozie /path-to-workflow/workflow.xml

Job parameterization in Jenkins

I want one job pass param to another job. The problem is that I want one job obtain parameter after execution shell command.
Any ideas how to manage it?
You should look into EnvInject Plugin - https://wiki.jenkins-ci.org/display/JENKINS/EnvInject+Plugin
You can use it to load parameters from a file (which can be created by your shell step). You may than use the loaded variable to trigger the next job / step.

After submitting a .m batch job with Slurm, can I edit my .m file without changing my original submission?

Say I want to run a job on the cluster: job1.m
Slurm handles the batch jobs and I'm loading Mathematica to save the output file job1.csv
I submit job1.m and it is sitting in the queue. Now, I edit job1.m to have different variables and parameters, and tell it to save data to job1_edited.csv. Then I re-submit job1.m.
Now I have two batch jobs in the queue.
What will happen to my output files? Will job1.csv be data from the original job1.m file? And will job1_edited.csv be data from the edited file? Or will job1.csv and job1_edited.csv be the same output?
:(
Thanks in advance!
I am assuming job1.m is a Mathematica job, run from inside a Bash submission script. In that case, job1.m is read when the job starts so if it is modified after submission but before job start, the modified version will run. If it is modified after the job starts, the original version will run.
If job1.m is the submission script itself (so you run sbatch job1.m), that script is copied in a spool directory specific to the job so if it is modified after the job is submitted, it still will run the original version.
In any case, it is better, for reproducibility and traceability, to make use of a workflow manager such as Fireworks, or Bosco

How to make whenever skip files in progress

We're using the Ruby gem whenever to manage large batches of import jobs. But what if a file is still being imported when the next cron job occurs?
For example:
12am: whenever starts an import cron job for import.csv
2am: import.csv is still being imported, but the next cron job is scheduled in whenever.
Would whenever skip that file or try to run it again? Any suggestions to make sure it doesn't try to process the same file twice?
Whenever is merely a frontend for the crontab. Whenever doesn't actually launch any of the processes, it writes a crontab that handles the actual scheduling and launching. Whenever cannot do what you're asking.
The crontab cannot do what you want either. It launches the process and that's it.
You need to implement the checking yourself in the process launched by cron. A common way of doing this could be a lockfile, and I'm sure there are libraries for this (ie http://rubygems.org/gems/lockfile).
Depending on your situation you might be able to create other checks before launching the import.
Well, this isn't really an issue of whenever
However, you could rename the file you want to import when you start processing (12am to 2am is a reasonable amount of time to do that) and move it to an archive directory once you are done processing so there is no confusion.
The next time the task runs it should look for all files that do not match a naming pattern (as already suggested in one of the comments)
And you might want to add an additional task that checks for imports that might have failed (e.g. a file has a naming pattern including the exact time but after a whole day it is still not archived) and either create some kind of notification or just trigger the task again/rename the task so it is picked up again (depending on how well your rollback works)

How to trigger a hudson job by another job which is in a different hudson

I have job A in Hudson A and Job B in Hudson B. I want to trigger job A by Job B.
In your job B configuration, check the Trigger builds remotely (e.g., from scripts) checkbox and provide a token.
The help text there shows you the URL you can call to trigger a build from remote scripts (e.g. from a shell script in Hudson job A).
However, that would trigger job B no matter what the result of job A is.
Morechilli's answer is probably the best solution.
I haven't used Hudson but I would guess your simplest approach would be to use the URL trigger:
http://wiki.hudson-ci.org/display/HUDSON/URL+Change+Trigger
I think there is a latest build url that could be used for this.
In the latest versions of Hudson, the lastSuccessfultBuild/ HTML page will contain the elapased time since it was built, which will be different for each call. This causes the URL Change Trigger to spin.
One fix is to use the xml, json, or python APIs to request only a subset of the information. Using the 'tree' request parameter, the following URL will return an XML document containing only the build number of the last successful build.
http://SERVER:PORT/job/JOBNAME/lastSuccessfulBuild/api/xml?tree=number
Using this URL restored the behavior I expected from the URL Change Trigger.
Personally, I find the easiest way to do this is to watch the build timestamp:
PROJECT_NAME/lastSuccessfulBuild/buildTimestamp
I'm using wget to trigger the build:
wget --post-data 'it-just-need-to-be-a-POST-request'
--auth-no-challenge --http-user=myuser --http-password=mypassword
http://jenkins.xx.xx/xxx/job/A/build?delay=0sec
There's other ways how you can trigger a build, see the REST and other APIs of jenkins.
But this works great on unix.

Resources