How to define job execution order in Talend Open Studio - etl

Is there any way to define jobs execution order in Talend open studio?
For example: job1 -> job2 -> job3 ...
There is a component named tParallelize in the Talend suite, but it's not available for Talend Open Studio.

The great thing about Talend is that any job can be called from within another job using tRunJob component, so what you want can be achieved by creating a master job, and inside it, calling your jobs in the order you want:
tRunJob_1 (job1)
|
OnSubjobOk
|
tRunJob_2 (job2)
|
...
|
tRunJob_n (jobX)
This ensures that your jobs are called in that order, and call the next job only if the previous job executed successfully.
tParellelize is used to run jobs in parallel, so it's not the same thing.

Related

How to run selected Azkaban jobs in paralell via a script?

Since there are too many jobs on Azkaban, I have to test new jobs one by one manually.
Assume I upload some new jobs and is it possible to write a Python (or any other language) script to fetch the dependencies between these jobs and then run them on Azkaban in paralell?
For instance, there are there jobs a, b, c and b dependents on a. They are supposed to be scheduled like:
Starts to run job a and job c
When job a finishes, starts to run job b.
I did not find any helpful info or API on the Azkaban official website (Maybe I missed useful info).
Any help is appreciated.

Pass variable from downstream Jenkins job to "parent" job

I've got Jenkins job A that triggers job B and afterwards executes a shell script.
Inside this shell script of Jenkins job A I want to use a variable set by Jenkins Job B.
How can I do this?
This can be accomplished in many ways. One way would be to configure Job A to have a build step, that triggers Job B, and fetches variables in a document after Job B has finished. Then Job A can read those variables and use them in later steps.
There are several things to consider here though. First of all this requires Job B to finish before Job A can/should continue, so if you are thinking of parallel job execution this isn't ideal. Secondly, when dealing with env variables you will need a plugin to make variables available outside of the build step (exporting isn't enough), check out the EnvInject plugin. And thirdly, if job configuration is becoming complex, there probably is a better way of doing it. With Jenkinsfile and previously pipelining plugins, Job orchestration has improved a lot, and passing parameters around and such is much easier in this new, shiny world. That being said, here is an example of something that works like what you are asking about.
Job A
As a build step, trigger Job B, and let Job A halt while Job B finishes
As the next build step, copy an artifact from another build (Job B latest stable), using the Copy Artifact Plugin.
Do something with the file, for example just printing it's content, it's now accessible in Job A.
Job B
Export a variable and save it to a file
Archive the written file and make it accessible to Job A.
This isn't pretty, but it works at least.
P.s. I'd recommend checking out the Jenkinsfile (https://jenkins.io/doc/book/pipeline/jenkinsfile/) options, it simplifies a lot after the initial learning curve.

How to execute same job cocurrently in Spring xd

We have the below requirement,
In spring xd, we have a job lets assume the job name as MyJob
which will be invoked by another process using the rest service of spring xd, lets assume process name as OutsideProcess (non-spring xd process).
OutsideJob invokes MyJob when ever a file added to a location (lets assume FILES_LOC) to which OutsideJob is listening.
In this scenario, lets assume that MyJob takes 5minutes to complete the job.
At 10:00 AM, there is a file copied to FILES_LOC, then OutsideProcess will trigger MyJob immediately. (approximately it will be completed at 10:05AM)
At 10:01 AM, another file copied to FILES_LOC, then OutsideProcess will trigger one more instance of MyJob at 10:01AM. But the second instance is getting queued and starts the execution once the first instance completes its execution (approximately at 10:05AM).
If we invoke the different jobs at the same time they are getting executed concurrenctly, but the same job multiple instances are not getting executed concurrenctly.
Please let me know how can I execute the same job with multiple instances concurrently.
Thanks in advance.
The only thing I can think of is dynamic deployment of the job and triggering it right away. You can use SpringXD Rest template to create the job definition on the fly and launch them after sleeping a few seconds. And make sure you undeploy/destroy the job when the job completes successfully.
Another solution could be to create a few module instances of your job with different names and use them as your slave processes. You can query status of these job module instances and launch the one that is finished or queue the one that is least recently launched.
Remember you can run jobs with partition support if applicable. This way you will finish your job faster and be able to run more jobs.

Spring Batch - I have multiple jobs to be executed sequentially and pass results of first job to second

Spring Batch jobs I have multiple jobs to be executed sequentially.
I need to pass results of job1 to job2 where it will be processed and then pass the data of job2 to job3. and so on. and may use results of job1 till job5(last job) and write output.
Job1 - is reading from db and and storing results in Hashmap
Job2 read from file and use the job1 hashmap for procesing results.
So please can anyone suggest the best solution for this. I am able to pass data between steps using ExecutionContext & JobPromotionListener, but not sure how to do same between multiple jobs
Instead of having 5 jobs for your batch processing you should have 5 steps of the same job. It is the best way to perform what you are trying to achieve.
Spring Batch framework keeps the state of every step execution so in case one of your steps fails you can relaunch your job that will only process remaining steps. Of course there are customisation options to control how and when a step can be considered failing or relaunchable.
Spring batch does not provide out of box support for the same.Seems like all you need is to configure some steps that can be executed sequentially.you can also divide the steps using the readers ,processor and writers

Should I use oozie to run a MapReduce task forever?

I have a mapReduce task (https://github.com/flopezluis/testing-hadoop) that reads the files in a folder and it appends them to a zip. I need to run this task forever, so when it finishes to process them, it should run again. I'm reading about oozie but I'm not sure whether it's the best fit, because maybe it's too big for my problem.
In case oozie is the best solution. If I write a coordinator to run every 10 minutes, what happens if the task takes more than 10 minutes, the coordinator waits to run the task again?
Explanation of the task
The folder is always the same. There are differences zips files, one for key. The idea is to create the zip file step by step. I think this faster than create the zip file after all the files are procesed.
The files contain something like this:
<info operationId="key1">
DATA1
</info>
<info operationId="key1">
DATA2
</info>
<info operationId="key2">
DATA3
</info>
So the zips will be like this:
key1.zip --> data1, data2
key3.zip --> data3
Thanks
You can use oozie for this. Oozie has a setting that will tell limit how many instances of a job can be running at once. If you first job isn't finished after then minutes than it will wait to run the next job.
From the Oozie documentation:
6.1.6. Coordinator Action Execution Policies
The execution policies for the actions of a coordinator job can be defined in the coordinator application.
• Timeout: A coordinator job can specify the timeout for its coordinator actions, this is, how long the coordinator action will be in WAITING or READY status before giving up on its execution.
• Concurrency: A coordinator job can specify the concurrency for its coordinator actions, this is, how many coordinator actions are allowed to run concurrently ( RUNNING status) before the coordinator engine starts throttling them.
• Execution strategy: A coordinator job can specify the execution strategy of its coordinator actions when there is backlog of coordinator actions in the coordinator engine. The different execution strategies are 'oldest first', 'newest first' and 'last one only'. A backlog normally happens because of delayed input data, concurrency control or because manual re-runs of coordinator jobs.
http://archive.cloudera.com/cdh/3/oozie-2.3.0-CDH3B4/CoordinatorFunctionalSpec.html#a6.1.6._Coordinator_Action_Execution_Policies
Just wanted to also comment that you could have the coordination job triggered off data arrival with a DataSet, but I am not that familar with DataSets.
If all you need is to execute the same hadoop job repeatedly on different input files, Oozie might be an overkill. Install and config Oozie on your testbed will also take some time. Writing a script that submits the hadoop job repeatedly might be enough.
But anyway, Oozie can do that. If you set the concurrency to 1, there will be at most 1 oozie coordinator action (which should be a workflow that contains only one hadoop job in your case) in running status. But you can increase the concurrency threshold to allow more actions executing concurrently.

Resources