Setting global properties in Composed Task which would be accessible in each of subtasks - spring

Setting 'common' properties for child tasks is not working
The SCDF version I'm using is 2.9.6.
I want to make CTR A-B-C, each of tasks does follows:
A : sql select on some source DB
B : process DB data that A got
C : sql insert on some target DB
Simplest way to make this work seems to define shared work directory folder Path "some_work_directory", and pass it as application properties to A, B, C. Under {some_work_directory}, I just store each of task result as file, like select.result, process.result, insert.result, and access them consequently. If there is no precedent data, I could assume something went wrong, and make tasks exit with 1.
================
I tried with a composed task instance QWER, with two task from same application "global" named as A, B. This simple application prints out test.value application property to console, which is "test" in default when no other properties given.
If I tried to set test.value in global tab on SCDF launch builder, it is interpreted as app.*.test.value in composed task's log. However, SCDF logs on child task A, B does not catch this configuration from parent. Both of them fail to resolve input given at launch time.
If I tried to set test.value as row in launch builder, and pass any value to A, B like I did when task is not composed one, this even fails. I know this is not 'global' that I need, it seems that CTR is not working correctly with SCDF launch builder.
The only workaround I found is manually setting app.QWER.A.test.value=AAAAA and app.QWER.B.test.value=BBBBB in launch freetext. This way, input is converted to app.QWER-A.app.global4.test.value=AAAAA, app.QWER-B.app.global4.test.value=BBBBB, and print well.
I understand that, by this way, I could set detailed configurations for each of child task at launch time. However, If I just want to set some 'global' that tasks in one CTR instance would share, there seems to be no feasible way.
Am I missing something? Thanks for any information in advance.

CTR will orchestrate the execution of a collection of tasks. There is no implicit data transfer between tasks. If you want the data from A to be the input to B and then output of B becomes the input of C you can create one Task / Batch application that have readers and writers connected by a processor OR you can create a stream application for B and use JDBC source and sink for A and C.

Related

How to get the current max-workers value?

In my build.gradle script, I have an Exec task to start a multithreaded process, and I would like to limit the number of threads in the process to the value of max-workers, which can be specified on the command line with --max-workers, or in gradle.properties with org.gradle.workers.max, and possibly other ways. How should I get this value to then pass on to my process?
I have already tried reading the org.gradle.workers.max property, but that property doesn't exist if it is not set by the user and instead max-workers is defined by some other means.
You may be able to access this value using API project.getGradle() .getStartParameter() .getMaxWorkerCount().

Spring Batch - Is it possible to write a custom restart in spring batch

I have been writing a spring batch wherein I have to perform some error-handling.
I know that spring batch has its own way of handling errors and restarts.
However, when the batch fails and restarts again, I want to pass my own values, conditions and parameters (for restart) which need to be followed before starting/executing the first step.
So, is it possible to write such custom restart in spring batch?
UPDATE1:(Providing a better explanation for above question.)
Let's say that the input to my reader in Step 1 is in the following format:
Table with following columns:
CompanyName1 -> VehicleId1
CN1 -> VID2
CN1 -> VID3
.
.
CN1 -> VID30
CN2 -> VID1
CN2 -> VID2
.
.
CNn -> VIDn
The reader reads this table row by row for a chunk size 1 (so in this case the row retrieved will be CN -> VID ) processes it and writes it to a File object.
This process goes on until all the CN1 type data is written into the File object. When the reader sends the row with company Name of type CN2 , the File object that was created earlier (for company name of type CN1) will be stored in a remote location. Then the process of File Object creation will continue for CN2 until we encounter CN3, in which case CN2 File Object will be sent for storage to a remote location
and the process will continue.
Now, once you understand this, here's a catch.
Let's say the data is currently being written by the writer for company Name 2 (CN2) and vehicle ID is VID20 (CN2 -> VID20)
in the File object. Then, due to some reason we had to stop the job/the job fails. In that case, the instance that will be saved will be CN2 -> VID20. So, next time when the job runs, it will start from CN2->VID20
As you might have guessed, all the 19 entries before CN2->VID20 which were written in the File Object were deleted permanently when the file Object got destroyed and these entries were never sent through the File to remote location.
So my question here is this:
Is there a way where I can write my custom restart for the batch where I could tell the job to start from CN2->VID1 instead of CN2->VID20?
If you could think of any other way to handle this scenario then such suggestions are also welcome.
Since you want to write the data of each company in a separate file, I would use the company name as a job parameter. With this approach:
Each job will read the data of a single company and write it to a file
If a job fails, you can restart it and it would resume where it left off (Spring Batch will truncate the file to last known written offset before writing new data). So there is no need for a custom restart "policy".
No need to set the chunk-size to 1, this is not efficient. You can use a reasonable chunk size with this approach
If the number of companies is small enough, you can run jobs manually. Otherwise, you can always get the distinct values with something like select distinct(company_name) from yourTable and write a script/loop to launch batch jobs with different parameters. Bottom line, make each job do one thing and do it well.

How to reserve resource for a certain task in spark?

I have two kinds of tasks in spark : A and B
In spark.scheduler.pool, I have two pools: APool and BPool.
I want task A to be executed aways in APool while B is in BPool.
The resources in APool is preserved to A.
Because task B may take too much resources to execute. Every time when B is executing, A needs to wait. I want no matter when the task is submitted, there will always be some resource for A to execute.
I am using spark with java in standalone mode. I submit the job like javaRDD.map(..).reduce... The javaRDD is a sub-clesse extended form JavaRDD. Task A and B have different RDD class like ARDD and BRDD. They run in the same spark application.
The procedure is like: The app start up -> spark application created, but no job runs -> I click "run A" on the app ui, then ARDD will run. -> I click "run B" on the app ui, then BRDD will run in the same spark application as A.

Hive execution hook

I am in need to hook a custom execution hook in Apache Hive. Please let me know if somebody know how to do it.
The current environment I am using is given below:
Hadoop : Cloudera version 4.1.2
Operating system : Centos
Thanks,
Arun
There are several types of hooks depending on at which stage you want to inject your custom code:
Driver run hooks (Pre/Post)
Semantic analyizer hooks (Pre/Post)
Execution hooks (Pre/Failure/Post)
Client statistics publisher
If you run a script the processing flow looks like as follows:
Driver.run() takes the command
HiveDriverRunHook.preDriverRun()
(HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS)
Driver.compile() starts processing the command: creates the abstract syntax tree
AbstractSemanticAnalyzerHook.preAnalyze()
(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK)
Semantic analysis
AbstractSemanticAnalyzerHook.postAnalyze()
(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK)
Create and validate the query plan (physical plan)
Driver.execute() : ready to run the jobs
ExecuteWithHookContext.run()
(HiveConf.ConfVars.PREEXECHOOKS)
ExecDriver.execute() runs all the jobs
For each job at every HiveConf.ConfVars.HIVECOUNTERSPULLINTERVAL interval:
ClientStatsPublisher.run() is called to publish statistics
(HiveConf.ConfVars.CLIENTSTATSPUBLISHERS)
If a task fails: ExecuteWithHookContext.run()
(HiveConf.ConfVars.ONFAILUREHOOKS)
Finish all the tasks
ExecuteWithHookContext.run() (HiveConf.ConfVars.POSTEXECHOOKS)
Before returning the result HiveDriverRunHook.postDriverRun() ( HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS)
Return the result.
For each of the hooks I indicated the interfaces you have to implement. In the brackets
there's the corresponding conf. prop. key you have to set in order to register the
class at the beginning of the script.
E.g: setting the PreExecution hook (9th stage of the workflow)
HiveConf.ConfVars.PREEXECHOOKS -> hive.exec.pre.hooks :
set hive.exec.pre.hooks=com.example.MyPreHook;
Unfortunately these features aren't really documented, but you can always look into the Driver class to see the evaluation order of the hooks.
Remark: I assumed here Hive 0.11.0, I don't think that the Cloudera distribution
differs (too much)
a good start --> http://dharmeshkakadia.github.io/hive-hook/
there are examples...
note: hive cli from console show the messages if you execute from hue, add a logger and you can see the results in hiveserver2 log role.

Run #Scheduled task only on one WebLogic cluster node?

We are running a Spring 3.0.x web application (.war) with a nightly #Scheduled job in a clustered WebLogic 10.3.4 environment. However, as the application is deployed to each node (using the deployment wizard in the AdminServer's web console), the job is started on each node every night thus running multiple times concurrently.
How can we prevent this from happening?
I know that libraries like Quartz allow coordinating jobs inside clustered environment by means of a database lock table or I could even implement something like this myself. But since this seems to be a fairly common scenario I wonder if Spring does not already come with an option how to easily circumvent this problem without having to add new libraries to my project or putting in manual workarounds.
We are not able to upgrade to Spring 3.1 with configuration profiles, as mentioned here
Please let me know if there are any open questions. I also asked this question on the Spring Community forums. Thanks a lot for your help.
We only have one task that send a daily summary email. To avoid extra dependencies, we simply check whether the hostname of each node corresponds with a configured system property.
private boolean isTriggerNode() {
String triggerHostmame = System.getProperty("trigger.hostname");;
String hostName = InetAddress.getLocalHost().getHostName();
return hostName.equals(triggerHostmame);
}
public void execute() {
if (isTriggerNode()) {
//send email
}
}
We are implementing our own synchronization logic using a shared lock table inside the application database. This allows all cluster nodes to check if a job is already running before actually starting it itself.
Be careful, since in the solution of implementing your own synchronization logic using a shared lock table, you always have the concurrency issue where the two cluster nodes are reading/writing from the table at the same time.
Best is to perform the following steps in one db transaction:
- read the value in the shared lock table
- if no other node is having the lock, take the lock
- update the table indicating you take the lock
I solved this problem by making one of the box as master.
basically set an environment variable on one of the box like master=true.
and read it in your java code through system.getenv("master").
if its present and its true then run your code.
basic snippet
#schedule()
void process(){
boolean master=Boolean.parseBoolean(system.getenv("master"));
if(master)
{
//your logic
}
}
you can try using TimerManager (Job Scheduler in a clustered environment) from WebLogic as TaskScheduler implementation (TimerManagerTaskScheduler). It should work in a clustered environment.
Andrea
I've recently implemented a simple annotation library, dlock, to execute a scheduled task only once over multiple nodes. You can simply do something like below.
#Scheduled(cron = "59 59 8 * * *" /* Every day at 8:59:59am */)
#TryLock(name = "emailLock", owner = NODE_NAME, lockFor = TEN_MINUTE)
public void sendEmails() {
List<Email> emails = emailDAO.getEmails();
emails.forEach(email -> sendEmail(email));
}
See my blog post about using it.
You don't neeed to synchronize your job start using a DB.
On a weblogic application you can get the instanze name where the application is running:
String serverName = System.getProperty("weblogic.Name");
Simply put a condition two execute the job:
if (serverName.equals(".....")) {
execute my job;
}
If you want to bounce your job from one machine to the other, you can get the current day in the year, and if it is odd you execute on a machine, if it is even you execute the job on the other one.
This way you load a different machine every day.
We can make other machines on cluster not run the batch job by using the following cron string. It will not run till 2099.
0 0 0 1 1 ? 2099

Resources