how to define job execution order in Talend? - etl

I am new to Talend and I am trying to design a data flow that transfers the data from postgres to neo4j database. I am using the "Talend Open Studio for Big Data" open source tool version 6.2.1 . I need to implement an ordered job flow in which Job1 and Job2 are executed independently and Job 3 is started if and only if Job1 & Job2 both are completed successfully!
I have used the tRunJob component to implement the order however after executing the flow , I noticed that Job3 does not wait for the completion of 2 previous jobs and starts its execution. What am I doing wrong here? Is this the right way to design ordered and dependent jobs in Talend?
P.S. each of the tRunJobs have their own sub flow for example the User_Import is :

To manage subjobs synchronization use tParallelize

Related

Create One Time Scheduled Job, Run When Others Not Running

I want to create a Scheduled Job in Oracle11g Express.
I am new to Job Scheduling and my search so far points to chains but as I want to create a Job out of a function that runs irregulary at a yet unknown date I believe that Chains aren't working for my case.
The Job will be created when a Procedure finishes, which determines its Scheduled Date X.
The Job will perform some critical changes which is why I don't want it to start while other regular scheduled jobs are running.
I want it to wait until the other jobs finish.
I only want to have it run once and than drop the job.
Is there some good practice for this case or some Option I have missed?

Load and process data in parallel inside Hadoop

i am using hadoop to process bigdata, i first load data to hdfs and then execute jobs, but it is sequential. Is it possible to do it in parallel. For example,
running 3 jobs and 2 process of load data from others jobs at same time on my cluster.
Cheers
It is possible to run the all job's in parallel in hadoop if your cluster and jobs satisfies the below criteria:
1) Hadoop Cluster should have capability to run reasonable number of map/reduce task(depends on jobs) in parallel(i.e. should have enough map/reduce slots).
2) If jobs that is currently being run , depends on the data which is loaded through another process, we cannot run data load and job in parallel.
If you process satisfies the above condition, you can all the jobs in parallel.
Using Oozie you can schedule all the process to run in parallel. Fork and Join properties in Oozie allows you to accomplish the task to run in parallel.
If your cluster has enough resources to run the jobs in parallel, then yes. But be sure that the work of each job, doesn't interfere with the others. Like load the data at the same time that another job in execution should be using it, that won't work as you expected.
If there is not enough resources, then hadoop will enqueue the jobs until the resources are available, depending on the Scheduler configured.

How to write datastage performance stats on a DB2 table?

My DataStage version is 8.5.
I have to populate a table in DB2 with the datastage performance data, something like job_name, start_time, finish_time and execution_date.
There is a master sequence with A LOT of jobs. The sequence itself runs once a day.
After every run of this sequence i must gather performance values and load them into a table on DB2, for reporting purposes.
I'm new on datastage and i dont have any idea of how to make it work. My Data stage's environment is Windows, so i cant work on it using shell scripts.
There is some way to get this info into datastage ?
i tried to build a server routine and get data using the DSGetJobInfo, but i got stuck into parameters issues (how to pass xx jobs as a list to that).
Sorry about my english, not my native language.
Thanks in advance.
Is your server also on Windows ? I am confused since you said "My Datastage "
most of thetime the servers are installed on linux / unix and clients are windows.
The best command to use would be (same should work on windows and linux servers both)
dsjob -jobinfo [project name ] [Job name ]
output would be something like-
Job Status : RUN OK (1)
Job Controller : not available
Job Start Time : Tue Mar 17 09:03:37 2015
Job Wave Number : 9
User Status : not available
Job Control : 0
Interim Status : NOT RUNNING (99)
Invocation ID : not available
Last Run Time : Tue Mar 17 09:09:00 2015
Job Process ID : 0
Invocation List : [job name]
Job Restartable : 0
After this years i found some ways to get a job's metadata, but none of them are good as i wanted, all of them are kind of clunky to implement, and fail often. I found 3 ways to get job metadata:
Query directly from xmeta, on tables that match the DATASTAGEX(*) naming
Query from DSODB, DSODB is the database from the operations console tool, it have all log information about job runs, but operations console must be enabled to have data (turn on the appwatcher process)
For this both above you can build an ETL that reads from these databases and write wherever you want.
And the last solution:
Call an after-job subroutine that call a script witch writes job's results on a custom table.
If this data is needed only to report and analyse, those first two solutions are just fine. For a more especific behavior, the third one is necessary.
What you are asking is the ETL audit process , which is one of the mainstays in ETL development . I am surprised that your ETL design does not already have one
Querying XMETA - In my experience across multiple Datastage environments . I have not seen companies use XMETA DB to pull out job performance information
Why ?? Because , Datastage jobs are not recommend to access XMETA DB , considering that XMETA holds the important metadata information about DS. Maybe your Datastage administrator will also not agree to provide access for XMETA .
The old and most trusted way of capturing run- meta information is to develop multliple- instance, run time column propagation transformations and also few audit tables in the database of your choice .
My idea:
1.Create table like - ETL-Run_Stats which has fields like JOB_NAME , STARTED_TS , FINISHED_TS , STATUS etc .
2. Now create your multiple instance jobs and include them in your DS master sequences .
If your DS sequence looks like this now
START ------> MAIN_DSJOB -------> SUCCESS
After your Audit jobs your DS sequence should look like this
START ----> AUDIT_JOB(started) -------> MAIN_DSJOB ------> AUDIT_JOB(finished) -------> SUCCESS
You can include as much functionalities you need in your AUDIT jobs to capture more runtime information
I am suggesting this only because your DS version is really old - version 8.5 .
With the newer versions of DS -- there are lot of in-built features to access this information. Maybe you can convince your Manager to upgrade DS :)
Let me know how it works

How to find the node where the pig job is running

I ran a pig work-flow using oozie. The job completed successfully but now I want to know on which slave or master the job ran. My input file is a 1.4GB file which is distributed on the nodes (1 master and 2 slaves).
And I also want to figure out how much time did the pig executed on each node.
Thank you in advance
Point your web browser to "JobTracker_Machine:50030" and it will presetn you the MapReduce webUI. Here you'll find all the jobs you have run(Running, Completed and Retired). Click on the job which you want to analyze and it will give you all the information you need including the node where a particular task has run and the time taken to finish the task.
HTH
Go to the Oozie Web console and click on the workflow (which contains the pig node).Clicking on the worklfow job will lead to a dialog box (for your workflow) containing details of all the action nodes in the workflow. Select the pig node (which you want to analyse) and a detailed dialog box will appear containing the Job Tracker's URL of that pig job.
There you will find all the details you are looking for.

Periodic hadoop jobs running (best practice)

Customers able to upload urls in any time to database and application should processes urls as soon as possible. So i need periodic hadoop jobs running or run hadoop job automatically from other application(any script identifies new links were added, generates data for hadoop job and runs job). For PHP or Python script, i could set up cronjob, but what is best practice for periodic hadoop jobs running (prepare data for hadoop, upload data, run hadoop job and move data back to database?
Take a look at Oozie, the new workflow system from Y!, which can run jobs based on different triggers. A good overflow is presented by Alejandro here: http://www.slideshare.net/ydn/5-oozie-hadoopsummit2010
If you want urls to be processed as soon as possible, you'll have them processed each at a time. My recommendation is to wait for some number of links (or MB of links, or for example 10 min, every day).
And batch process them (I do my processing daily, but that jobs takes few hours)

Resources