Running same job parallely in datastage - parallel-processing

How can we run same job with different job parameters PARALLEL in sequencer in data stage..?
If its possible by instantiating then how...?
Thanks in advance!

You can use the facility of Allow Multiple instance option. Which creates multiple instance based on the Invocation ID, so multiple instance of same job can run parallaly in datastage server with different parameters. The Allowable multiple instance is available in the Job property.

You create a sequence job and put the job twice and assign each one to a parameter

Related

Should I create SYNC jobs only in SQLake?

Should we always be just creating sync jobs as our general rule of thumb in Upsolver SQLake?
In most cases, yes, you want to use sync jobs.The only case that you don't want to use sync job is when you have an input to table that you don't want to wait.
Example: you have 5 jobs that write to a table and some jobs that read from that table. If you don't want the entire pipeline to stuck if one of the 5 jobs is stuck, then your pipeline needs to be unsync (or at least this specific job that you think it may stuck to be unsync)
Note: unsync is not a keyword. If create JOB by default creates unsync job. CREATE SYNC job creates SYNC job.

Apache NiFi - Generic Execution Flows?

I'm trying to figure out if the following scenarios are possible:
I have hundreds of tables that need to use the same flow, but have different intervals, different source hostnames and destinations?
How to build such a flow? also I can't figure out how to use dynamic hosts/schemas/table names...
We maintain a table with with all this info but how to execute it with NiFi?
If I need to load a file on multiple clusters (each table different clusters) in parallel - how can this be achieved?
tnx!
The solution I found is to use an external schedular (like airflow)
and to use ListenHttp processor.
Then you can send to that listener any data you wish parse it and use it as parameters/attributes in the rest of the flow.

How can I execute a MapReduce jar programatically from another Java Program?

I have my MapReduce program (DataProfiler.jar) which performs some data profiling by taking in the table name and column name as command line parameters.
hadoop -jar DataProfiler.jar -D tableName=MyTable -D columnFamilyName=CF1
Is there a way I could wrap this in another java program. So that I can execute this jar for all the tables (by connecting to the database and getting a list of all the tables and columns).
Thanks!
I would suggest, instead of calling the MapReduce jar from the simple java program, You can write a logic MapReduce driver class in a little generic way.
Let's call this class JobRunner. You can define member variables which will hold the information about table and columns you need to process. Then you can setup the MapReduce configuration and start the job. Technically you are achieving the same but in a little different way. I think it would be a better approach then calling a jar and starting the MapReduce job.

Passing Parameters to MapReduce Program

I need to pass some parameters to map program. The values for these parameters need to be fetched from database and these values are dynamic. I know how to pass the parameters using Configuration API. If I write JDBC code to retrieve these values from database in the driver or client and then set the values to configuration API, Then how many times this code will be executed. The driver code will be distributed and executed on each data node where hadoop framework identifies to run the MR program ?
What is the best way to do this ?
Yes driver code will be executed on each machine.
I suggest to fetch the data outside the map-reduce program and then pass it as a parameter.
Say you have a script to execute then you just fetch the data from database in a variable and then pass that variable to the hadoop job.
I think this will do your work.
If the data you need is big (more than a few kilobytes), Configuration may not be suitable. A better alternative is to use Sqoop to fetch those data from database to your HDFS. Then use hadoop distribute cache so in your map or reduce code, you can just get those data without any parameters passing in.
You can retrieve the values from DB in the driver code. The driver code will execute only once per Job.

Job action string too long

I'm trying to create a job that will sync two databases in the midnight. There are 10 tables that need to be synced. And it's a very long PL SQL script. When I set this script to JOB ACTION and try to create the job I get "string value too long for attribute job action". What do you suggest I do? Should I seperate the scipt into 10? Isn't there a way to make the job run the code as a script. If I do it manualy all 10 anonymous blocks get executed one after another. I need something that will kind of press F5 for me in the midnight.
What you need is a DBMS_Scheduler chain, in which each action is a separate step and they can be executed at the same time.
http://docs.oracle.com/cd/B19306_01/appdev.102/b14258/d_sched.htm

Resources