spring xd pass data between jobs in composed job - spring-xd

What is the best way to pass parameters between Spring XD jobs within a composed job? Can I get parent's (composed job's) execution context to set job parameters to make them available in the next jobs?

In short...don't. There is no common component between the jobs within a composed job. That feature is really intended for job orchestration. You're better off handling any shared state between the jobs on your own.

Related

Identify a spring-batch job instance with incrementer

let me discribe shortly what I want and what I - maybe - know.
I want spring-batch to run a async job; in future more jobs.
The job gets two parameters: an external id and a year.
The job should be able to be restarted after completion because the user wants to run a job with the same parameters again and again.
Only one job should be executed with the same parameters at the same time.
From outside (web interface) it should be possible to query if a job is running by job name and parameters.
The querier could be different from the job starter so an instance or execution id is not present.
I know that a job instance is the representation of the job(name) and the parameters and - like you commented - I cannot rerun a job with the same parameters if the instance/execution is marked completed - except I use a incrementer.
But this changes the parameters by adding a run.id. Now a job is restartable but I and sping-batch itself are not able to identify a running job instance (by name and original parameters) anymore because every job run results in a new instance.
And the question "why would one would restart a successfully completed job instance?" is easy to answer: The user outside don't know about job/instance/execution. The user will start some data processing for a year again and again. And it's my task to make it possible :).
So it would be nice if spring-batch can let the user know "the job with your original parameters is still running".
Question:
What would be a good solution for my needs?
I didn't tried something but thought about it. Maybe I can write an own JobDao for my query? But this will not solve the run-instance-at-same-time problem. Or I can customize the JdbcJobInstanceDao or SimpleJobRepository? Maybe I must add a own job_key which contains only the original parameters?
To correctly understand the answer I am going to give to your question, it is important to know the difference and understand the relation between a job, a job instance and a job execution in Spring Batch. The The Domain Language of Batch section of the reference documentation explains that in details with examples.
The job should be able to be restarted after completion.
This is not possible by design, or more precisely, a job instance cannot be restarted after completion by design (Think of it like "why would one would restart a successfully completed job instance?").
From outside (web interface) it should be possible to query if an instance is running by job name and parameters. There querier could be different from the job starter so an instance or execution id is not present.
The JobExplorer is the API you are looking for. You can ask for job instances and job executions as needed.
Question: What would be a good solution for my needs?
In your case, you receive an external ID and a year as a job execution request. Those two parameters can be used as identifying parameters to define job instances. With this in place, if a job instance is failed, you can restart it by using the same parameters.
I see no need for an incrementer in your case. The incrementer is useful for jobs for which the instances can be defined as a "sequence" that can be "incremented". I see no need to create a custom DAO or JobRepository neither, you should be able to implement your requirement with the built-in components by correctly defining what a job instance is.
For my use-case I have to check if a execution for a job/parameters-combination is running. The parameters here are without run.id of an incrementor. This check must be done before a job run and by explicit rest call. Normally spring-batch checks for running executions but because of the used incrementor every job instance is unique and it will never find any.
So I created a bean with a check method and made use of jobExplorer.findRunningJobExecutions(jobName);. The result can then compared with the used paramters by iterating over JobExecution.getJobParameters().getParameters().
The bean can be used in the rest-method and in an own implemention of JobLauncher.run().
Another solution would be to store the increment separately for a job/parameters-combination. But I don't want to do this not least because I think a framework like spring-batch should do this for me or supports me by reusing/restarting a completed job instance.

Spring Scheduling Quartz and thousands of jobs

According to the business logic of my Spring Boot application with Quartz Scheduling and MongoDB as Job persistent storage, every user of the system can create the postponed job that must be executed at some point in time. The user chooses the time when it must be executed.
Right now I'm thinking about the approach where every user will create a dedicated JobDetail for every postponed job, something like this:
schedulerFactoryBean.getScheduler().addJob(jobDetail(), true, true);
The issue I can potentially see here, that with this approach I can quickly create thousands of jobs in Quartz scheduler. Previously I never scheduled such amount of jobs in Spring Scheduling with Quartz and don't know how the system will handle it. Is it a good idea to implement the system in such way and will Spring Scheduling Quartz handle such amount of jobs without problems?
Yes, Quartz itself can handle thousands of jobs and triggers without any issues.
If you are going to have many jobs executing concurrently, just make sure that you configure Quartz with a sufficient number of worker threads. The number of worker threads should be typically equal to the maximum number of jobs that can be running concurrently + some small buffer (10% or so) just in case.
From what you write I assume that your jobs will be one-off jobs, i.e. each job will be executed only once. If that is the case, Quartz can automatically discard your jobs as soon as they finish executing unless your jobs are marked as durable. Quartz automatically removes non-durable jobs if they are not scheduled to run in the future. This feature may help you reduce the total number of registered jobs.
I hope this helps. If not, please ask.

Multiple Spring Batch instances for robustness and scalability

My batch use case looks like a common pattern, yet I'm not sure if Spring Batch is designed to work as I expect. Many thanks in advance for clarifications and suggestions.
There can be a number of spring-batch based applications responsible for tasks processing. Processing requests come via REST resource, hence ThreadPoolTaskExecutor is used (as discussed here). JobRegistry is based on JDBC, all instances share exactly the same configuration.
What I want to achieve is a situation where each node can process any job that has been submitted. This way I can scale my solution out as load grows - I would simply add new instances that would process queued requests. Solution is also robust: if any node dies, its tasks are automatically handled by a different node.
But it does not work this way with Spring Batch apparently. Each node seems to handle the tasks that were handled on that very node. Even if node A has 1000 items in the queue and node B does nothing, it will not take any of the A's load.
That's because how SimpleJobLauncher works - it simply queues a task into taskExecutor after creating a respective JobExecution:
JobExecution jobExecution = this.jobRepository.createJobExecution(job.getName(), jobParameters);
this.taskExecutor.execute(new Runnable(job, jobParameters, jobExecution) ....
I don't think I need job partitioning - it does not seem to be the usecase. So how do I achieve robustness and scalability with Spring Batch?
Thanks
f

Chaining Spring Batch Job

I have two jobs defined in two different xmls. Say Job A & Job B.
I need to call Job B on successful completion of Job A.
What is the best approach of doing this.
I am pretty new to spring-batch so looking for the best approach to handle this.
You can create a superjob and execute your job A and Job B as steps in this super job specifying that Job B should be executed on successful completion of Job A

Is hadoop's job ThreadSafe?

Anyone knows if org.apache.hadoop.mapreduce.Job is thread-safe? In my application I create a thread for each job, and then waitForCompletion. And I have another monitor thread that checks every job's state with isComplete.
Is that safe? Are jobs thread-safe? Documentation doesn't seem to mention anything about it...
Thanks
Udi
Unlike the others, I also use threads to submit jobs in parallel and wait for their completion. You just have to use a job class instance per thread. If you share same job instances over multiple threads, you have to take care of the synchronization by yourself.
Why would you want to write a separate thread for each job? What exactly is your use case?
You can run multiple jobs in your Hadoop cluster. Do you have dependencies between the multiple jobs?
Suppose you have 10 jobs running. 1 job fails then would you need to re-run the 9 successful tasks.
Finally, job tracker will take care of scheduling multiple jobs on the Hadoop cluster. If you do not have dependencies then you should not be worried about thread safety. If you have dependencies then you may need to re-think your design.
Yes they are.. Actually the files is split in blocks and each block is executed on a separate node. all the map tasks run in parallel and then are fed to the the reducer after they are done. There is no question of synchronization as you would think about in multi threaded program. In multi threaded program all the threads are running on the same box and since they share some of the data you have to synchronize them
Just in case you need another kind of parallelism on the map task level, you should override run() method in your mapper and work with multiple threads there. Default implementation calls setup(), then map() times number of records to process, and finally it calls cleanup() method once.
Hope this helps someone!
If you are checking whether the jobs have finished I think you are a bit confused about how Map reduce works. You ought to be letting Hadoop do that for itself.

Resources