let me discribe shortly what I want and what I - maybe - know.
I want spring-batch to run a async job; in future more jobs.
The job gets two parameters: an external id and a year.
The job should be able to be restarted after completion because the user wants to run a job with the same parameters again and again.
Only one job should be executed with the same parameters at the same time.
From outside (web interface) it should be possible to query if a job is running by job name and parameters.
The querier could be different from the job starter so an instance or execution id is not present.
I know that a job instance is the representation of the job(name) and the parameters and - like you commented - I cannot rerun a job with the same parameters if the instance/execution is marked completed - except I use a incrementer.
But this changes the parameters by adding a run.id. Now a job is restartable but I and sping-batch itself are not able to identify a running job instance (by name and original parameters) anymore because every job run results in a new instance.
And the question "why would one would restart a successfully completed job instance?" is easy to answer: The user outside don't know about job/instance/execution. The user will start some data processing for a year again and again. And it's my task to make it possible :).
So it would be nice if spring-batch can let the user know "the job with your original parameters is still running".
Question:
What would be a good solution for my needs?
I didn't tried something but thought about it. Maybe I can write an own JobDao for my query? But this will not solve the run-instance-at-same-time problem. Or I can customize the JdbcJobInstanceDao or SimpleJobRepository? Maybe I must add a own job_key which contains only the original parameters?
To correctly understand the answer I am going to give to your question, it is important to know the difference and understand the relation between a job, a job instance and a job execution in Spring Batch. The The Domain Language of Batch section of the reference documentation explains that in details with examples.
The job should be able to be restarted after completion.
This is not possible by design, or more precisely, a job instance cannot be restarted after completion by design (Think of it like "why would one would restart a successfully completed job instance?").
From outside (web interface) it should be possible to query if an instance is running by job name and parameters. There querier could be different from the job starter so an instance or execution id is not present.
The JobExplorer is the API you are looking for. You can ask for job instances and job executions as needed.
Question: What would be a good solution for my needs?
In your case, you receive an external ID and a year as a job execution request. Those two parameters can be used as identifying parameters to define job instances. With this in place, if a job instance is failed, you can restart it by using the same parameters.
I see no need for an incrementer in your case. The incrementer is useful for jobs for which the instances can be defined as a "sequence" that can be "incremented". I see no need to create a custom DAO or JobRepository neither, you should be able to implement your requirement with the built-in components by correctly defining what a job instance is.
For my use-case I have to check if a execution for a job/parameters-combination is running. The parameters here are without run.id of an incrementor. This check must be done before a job run and by explicit rest call. Normally spring-batch checks for running executions but because of the used incrementor every job instance is unique and it will never find any.
So I created a bean with a check method and made use of jobExplorer.findRunningJobExecutions(jobName);. The result can then compared with the used paramters by iterating over JobExecution.getJobParameters().getParameters().
The bean can be used in the rest-method and in an own implemention of JobLauncher.run().
Another solution would be to store the increment separately for a job/parameters-combination. But I don't want to do this not least because I think a framework like spring-batch should do this for me or supports me by reusing/restarting a completed job instance.
Related
In my Spring Boot application, based on the Cron job(runs every 5 minutes) I need to process 2000 products in my database.
Right now the process time of these 2000 products takes more than 5 minutes. I ran into the issue where the second Cron job runs when the first one is not completed yet.
Is there in Spring/Cron out of the box functionality that will allow to synchronize these jobs and wait for the previous job completion before starting the next one?
Please advise how to properly implement such kind of system. Anyway, the following technologies are also available Neo4j, MongoDB, Kafka. Please advise how to properly design/implement this functionality using the Spring/Cron separately or even together with the mentioned technologies.
1) You may try to use #Scheduled(fixedDelay = 5*60*1000). It will guarantee that next invocation will happen strictly in 5 minutes after previous one is finished. But this may break your scheduling requirements
2) You can limit the underlying ThreadExecutor's pool size to 1 thread, so next invocation will have to wait until previous is finished, but this, again, can break the logic, since it would affect all periodic tasks invoked by #Scheduled
3) You can use Quartz instead of spring's native #Scheduled. It's more complicated to configure, but allows to achieve the desired behaviour via #DisallowConcurrentExecution annotation or via setting JobDetail::isConcurrentExectionDisallowed in your job details
Is there any way to get the name and status of all steps in a Job from a JobExecution instance? Something similar to JobExecution#getStepExecutions(), but that method only returns the completed steps when I call it.
I need to know if a certain step is going to be part of a job or not and if it has completed. I need to know this in for example JobExecutionListener#beforeJob.
Steps aren't registered until you are actually about start them. Otherwise flow control (e.g. going to Step B vs C based on the exit code of Step A) wouldn't work.
So yes, you can get all steps that have been registered, but they won't all be registered at job startup.
We have the below requirement,
In spring xd, we have a job lets assume the job name as MyJob
which will be invoked by another process using the rest service of spring xd, lets assume process name as OutsideProcess (non-spring xd process).
OutsideJob invokes MyJob when ever a file added to a location (lets assume FILES_LOC) to which OutsideJob is listening.
In this scenario, lets assume that MyJob takes 5minutes to complete the job.
At 10:00 AM, there is a file copied to FILES_LOC, then OutsideProcess will trigger MyJob immediately. (approximately it will be completed at 10:05AM)
At 10:01 AM, another file copied to FILES_LOC, then OutsideProcess will trigger one more instance of MyJob at 10:01AM. But the second instance is getting queued and starts the execution once the first instance completes its execution (approximately at 10:05AM).
If we invoke the different jobs at the same time they are getting executed concurrenctly, but the same job multiple instances are not getting executed concurrenctly.
Please let me know how can I execute the same job with multiple instances concurrently.
Thanks in advance.
The only thing I can think of is dynamic deployment of the job and triggering it right away. You can use SpringXD Rest template to create the job definition on the fly and launch them after sleeping a few seconds. And make sure you undeploy/destroy the job when the job completes successfully.
Another solution could be to create a few module instances of your job with different names and use them as your slave processes. You can query status of these job module instances and launch the one that is finished or queue the one that is least recently launched.
Remember you can run jobs with partition support if applicable. This way you will finish your job faster and be able to run more jobs.
A certain number of jobs needs to be executed in a sequence, such that result of one job is input to another. There's also a loop in one part of job chain. Currently, I'm running this sequency using wait for completition, but I'm going to start this sequence from web service, so I don't want to get stuck waiting for response. I wan't to start the sequence and return.
How can I do that, considering that job's depend on each other?
The typical approach I follow is to use Oozie work flow to chain the sequence of jobs with passing the dependent inputs to them accordingly.
I used a shell script to invoke the oozie job .
I am not sure about the loops within the oozie workflow. but the below link speaks about the way to implement loops within the workflow.Hope it might help you.
http://zapone.org/bernadette/2015/01/05/how-to-loop-in-oozie-using-sub-workflow/
Apart from this the JobControl class is also a good option if the jobs need to be in sequence and it requires less efforts to implement.It would be easy to do loop since it would be fully done with Java code.
http://gandhigeet.blogspot.com/2012/12/hadoop-mapreduce-chaining.html
https://cloudcelebrity.wordpress.com/2012/03/30/how-to-chain-multiple-mapreduce-jobs-in-hadoop/
I have a pool of Jobs from which I retrieve jobs and start them. The pattern is something like:
Job job = JobPool.getJob();
job.waitForCompletion();
JobPool.release(job);
I get a problem when I try to reuse a job object in the sense that it doesn't even run (most probably because it's status is : COMPLETED). So, in the following snippet the second waitForCompletion call prints the statistics/counters of the job and doesn't do anything else.
Job jobX = JobPool.getJob();
jobX.waitForCompletion();
JobPool.release(jobX);
//.......
Job jobX = JobPool.getJob();
jobX.waitForCompletion(); // <--- here the job should run, but it doesn't
Am I right when I say that the job doesn't actually run because hadoop sees its status as completed and it doesn't event try to run it ? If yes, do you know how to reset a job object so that I can reuse it ?
The Javadoc includes this hint that the jobs should only run once
The set methods only work until the job is submitted, afterwards they will throw an IllegalStateException.
I think there's some confusion about the job, and the view of the job. The latter is the thing that you have got, and it is designed to map to at most one job running in hadoop. The view of the job is fundamentally light weight, and if creating that object is expensive relative to actually running the job... well, I've got to believe that your jobs are simple enough that you don't need hadoop.
Using the view to submit a job is potentially expensive (copying jars into the cluster, initializing the job in the JobTracker, and so on); conceptually, the idea of telling the jobtracker to "rerun " or "copy ; run ", makes sense. As far as I can tell, there's no support for either of those ideas in practice. I suspect that hadoop isn't actually guaranteeing retention policies that would support either use case.