How to run selected Azkaban jobs in paralell via a script? - hadoop

Since there are too many jobs on Azkaban, I have to test new jobs one by one manually.
Assume I upload some new jobs and is it possible to write a Python (or any other language) script to fetch the dependencies between these jobs and then run them on Azkaban in paralell?
For instance, there are there jobs a, b, c and b dependents on a. They are supposed to be scheduled like:
Starts to run job a and job c
When job a finishes, starts to run job b.
I did not find any helpful info or API on the Azkaban official website (Maybe I missed useful info).
Any help is appreciated.

Related

Pass variable from downstream Jenkins job to "parent" job

I've got Jenkins job A that triggers job B and afterwards executes a shell script.
Inside this shell script of Jenkins job A I want to use a variable set by Jenkins Job B.
How can I do this?
This can be accomplished in many ways. One way would be to configure Job A to have a build step, that triggers Job B, and fetches variables in a document after Job B has finished. Then Job A can read those variables and use them in later steps.
There are several things to consider here though. First of all this requires Job B to finish before Job A can/should continue, so if you are thinking of parallel job execution this isn't ideal. Secondly, when dealing with env variables you will need a plugin to make variables available outside of the build step (exporting isn't enough), check out the EnvInject plugin. And thirdly, if job configuration is becoming complex, there probably is a better way of doing it. With Jenkinsfile and previously pipelining plugins, Job orchestration has improved a lot, and passing parameters around and such is much easier in this new, shiny world. That being said, here is an example of something that works like what you are asking about.
Job A
As a build step, trigger Job B, and let Job A halt while Job B finishes
As the next build step, copy an artifact from another build (Job B latest stable), using the Copy Artifact Plugin.
Do something with the file, for example just printing it's content, it's now accessible in Job A.
Job B
Export a variable and save it to a file
Archive the written file and make it accessible to Job A.
This isn't pretty, but it works at least.
P.s. I'd recommend checking out the Jenkinsfile (https://jenkins.io/doc/book/pipeline/jenkinsfile/) options, it simplifies a lot after the initial learning curve.

Oozie for multiple mapreduce jobs

I have a sequence of mapreduce jobs that need to be run. I was wondering if there is any advantage of using Oozie for that, instead of having "one big driver" that will run that sequence?
I know that Oozie can be used to run multiple actions of different type, e.g. pig script, shell script, mr job, but I'm concretely interested should I split my two jobs and run them using Oozie, or have a single jar to do that?
Oozie is a scheduler - crude, poorly documented, but a scheduler.
If you don't need scheduling per se, or if CRON on an edge node is sufficient
if you want to handle your workflow logic by yourself (e.g. conditional
branching, parallel executions w/ waiting for stragglers, calling
generic sub-workflows w/ ad hoc parameters, e-mail alerts on errors,
<insert your pet feature here>) or don't need any fancy logic
if you handle your executions logs and state history by yourself, or don't care about history
... well, don't use a scheduler.
PS: you also have Luigi (Spotify) and Azkaban (LinkedIn) as alternative Hadoop schedulers.
[edit] extra point to consider: if your "driver" crashes for whatever reason, you may not have a chance to send an alert; but if run from Oozie, the crash will be detected eventually (may take as much as 30 min. in a corner case e.g. AM job self-destruction due to YARN RM failover)

How to execute same job cocurrently in Spring xd

We have the below requirement,
In spring xd, we have a job lets assume the job name as MyJob
which will be invoked by another process using the rest service of spring xd, lets assume process name as OutsideProcess (non-spring xd process).
OutsideJob invokes MyJob when ever a file added to a location (lets assume FILES_LOC) to which OutsideJob is listening.
In this scenario, lets assume that MyJob takes 5minutes to complete the job.
At 10:00 AM, there is a file copied to FILES_LOC, then OutsideProcess will trigger MyJob immediately. (approximately it will be completed at 10:05AM)
At 10:01 AM, another file copied to FILES_LOC, then OutsideProcess will trigger one more instance of MyJob at 10:01AM. But the second instance is getting queued and starts the execution once the first instance completes its execution (approximately at 10:05AM).
If we invoke the different jobs at the same time they are getting executed concurrenctly, but the same job multiple instances are not getting executed concurrenctly.
Please let me know how can I execute the same job with multiple instances concurrently.
Thanks in advance.
The only thing I can think of is dynamic deployment of the job and triggering it right away. You can use SpringXD Rest template to create the job definition on the fly and launch them after sleeping a few seconds. And make sure you undeploy/destroy the job when the job completes successfully.
Another solution could be to create a few module instances of your job with different names and use them as your slave processes. You can query status of these job module instances and launch the one that is finished or queue the one that is least recently launched.
Remember you can run jobs with partition support if applicable. This way you will finish your job faster and be able to run more jobs.

Chaining Spring Batch Job

I have two jobs defined in two different xmls. Say Job A & Job B.
I need to call Job B on successful completion of Job A.
What is the best approach of doing this.
I am pretty new to spring-batch so looking for the best approach to handle this.
You can create a superjob and execute your job A and Job B as steps in this super job specifying that Job B should be executed on successful completion of Job A

Job step loop for LoadLeveler job scripts?

I'm using LoadLeveler to submit jobs on an IBM/BlueGene architecture. I read the documentation made from IBM and also gave Google a try, but I cannot find how to do the following, which I expect should be there:
One can use the
queue
keyword to tell LoadLeveler that a new job step is described, so I could do something like
first_step
queue
second_step
queue
but what I fail to find is a way that does something like
loop job_id = 1,10
do_job_with_given_job_id
end
Do I have to write a "normal" shell script that in turn calls a load level script for a bunch of times, or is there some built in loop mechanism? I know that other job managers can do this.
When this comes up, we normally just recommend that one writes a shell script which generates the job submission script or scripts; that's what I do for my own jobs. Do these steps have dependancies on each other?
Also, just out of curiosity, which schedulers/resource managers can queue multiple jobs within a loop in a submission script? Not the PBS-based ones...

Resources