Is StepExecutionContext and Writer instance variables are thread safe in Spring Batch? - thread-safety

I am deriving a value in the Step listener and sharing the same in Itemwriter through StepExecutionContext. If there are multiple instances running for that job, is it thread safe having those params (SPECIFIC FOR THAT JOB) Spring Batch's StepExecutionContext ?
Another Question - I have a counter variable as an instance variable in the Itemwriter and increment the same in the write(). Will this counter var be thread safe.

I believe it is more than thread-safety.
If you want to have single instance of reader/writer/listener, you gotta make sure it is stateless.
If you need to keep state in those artifact, you gotta make sure it is going to be a new instance for every job execution, one of the way is to declare them as step scoped.

Related

Identify a spring-batch job instance with incrementer

let me discribe shortly what I want and what I - maybe - know.
I want spring-batch to run a async job; in future more jobs.
The job gets two parameters: an external id and a year.
The job should be able to be restarted after completion because the user wants to run a job with the same parameters again and again.
Only one job should be executed with the same parameters at the same time.
From outside (web interface) it should be possible to query if a job is running by job name and parameters.
The querier could be different from the job starter so an instance or execution id is not present.
I know that a job instance is the representation of the job(name) and the parameters and - like you commented - I cannot rerun a job with the same parameters if the instance/execution is marked completed - except I use a incrementer.
But this changes the parameters by adding a run.id. Now a job is restartable but I and sping-batch itself are not able to identify a running job instance (by name and original parameters) anymore because every job run results in a new instance.
And the question "why would one would restart a successfully completed job instance?" is easy to answer: The user outside don't know about job/instance/execution. The user will start some data processing for a year again and again. And it's my task to make it possible :).
So it would be nice if spring-batch can let the user know "the job with your original parameters is still running".
Question:
What would be a good solution for my needs?
I didn't tried something but thought about it. Maybe I can write an own JobDao for my query? But this will not solve the run-instance-at-same-time problem. Or I can customize the JdbcJobInstanceDao or SimpleJobRepository? Maybe I must add a own job_key which contains only the original parameters?
To correctly understand the answer I am going to give to your question, it is important to know the difference and understand the relation between a job, a job instance and a job execution in Spring Batch. The The Domain Language of Batch section of the reference documentation explains that in details with examples.
The job should be able to be restarted after completion.
This is not possible by design, or more precisely, a job instance cannot be restarted after completion by design (Think of it like "why would one would restart a successfully completed job instance?").
From outside (web interface) it should be possible to query if an instance is running by job name and parameters. There querier could be different from the job starter so an instance or execution id is not present.
The JobExplorer is the API you are looking for. You can ask for job instances and job executions as needed.
Question: What would be a good solution for my needs?
In your case, you receive an external ID and a year as a job execution request. Those two parameters can be used as identifying parameters to define job instances. With this in place, if a job instance is failed, you can restart it by using the same parameters.
I see no need for an incrementer in your case. The incrementer is useful for jobs for which the instances can be defined as a "sequence" that can be "incremented". I see no need to create a custom DAO or JobRepository neither, you should be able to implement your requirement with the built-in components by correctly defining what a job instance is.
For my use-case I have to check if a execution for a job/parameters-combination is running. The parameters here are without run.id of an incrementor. This check must be done before a job run and by explicit rest call. Normally spring-batch checks for running executions but because of the used incrementor every job instance is unique and it will never find any.
So I created a bean with a check method and made use of jobExplorer.findRunningJobExecutions(jobName);. The result can then compared with the used paramters by iterating over JobExecution.getJobParameters().getParameters().
The bean can be used in the rest-method and in an own implemention of JobLauncher.run().
Another solution would be to store the increment separately for a job/parameters-combination. But I don't want to do this not least because I think a framework like spring-batch should do this for me or supports me by reusing/restarting a completed job instance.

Is Spring batch's default behavior is to process next item only after first item finished?

After reading this article about the possibilities of scaling and parallel processing in Spring-Batch we were wondering, what is the out-of-the-box behavior of Spring-batch?
Let's say our job has reader, 5 steps and a writer.
Will Spring-batch read one item, pass it through all the 5 steps, write it and only then move on to the next item? Something like a giant for loop?
Or is there some parallelism, so while item A is moved on to step 2, item B is read and handled to step 1?
I think you are misunderstanding how Spring Batch works. Let me start with that, then go into parallelism.
A chunk based step in Spring Batch consists of an ItemReader, an optional ItemProcessor, then an ItemWriter. Each of these obviously supports composition (Spring Batch provides some components for using composition in both the ItemProcessor and ItemWriter phases). Within that step, Spring Batch reads items until a given condition is met (typically chunk size). Then that list is iterated over, passing each item to the ItemProcessor. Finally, a list of all of the results from the ItemProcessor calls is passed in a single call to the ItemWriter. The concept of reading once, then doing multiple steps, then writing really isn't how Spring Batch works. The closest we get would be a single ItemReader, then using composition to create a chain of ItemProcessor calls, then a single call to an ItemWriter.
With that being said, Spring Batch provides a number of parallelism options. There are five different options for scaling Spring Batch jobs. I won't go into details about each because that's beyond the scope of this and clearly discussed in other StackOverflow questions as well as the documentation. However, the list is as follows:
Multithreaded steps - Here each chunk (block of items processed within a transaction) is executed within a different thread using Spring's TaskExecutor abstraction.
Parallel steps - Here a batch job executes multiple, independent steps in parallel again using Spring's TaskExecutor abstraction to control the theads used.
AsyncItemProcessor/AsyncItemWriter - Here each call to the ItemProcessor is called in it's own thread. The resulting Future is passed to the AsyncItemWriter which unwraps the Future and the results are persisted.
Partitioning - Spring Batch allows you to partition a data set into multiple partitions that are then executed in parallel either via local threading mechanisms or remotely.
Remote chunking - The last option is to have a master reading the data, then sending it to a pool of workers for processing and writing.

Spring integration: testing poller dependent logic

I wonder how could I write spring tests to assert logic chain which is triggered by 'SourcePollingChannelAdapter'.
What comes to my mind:
use Thread.sleep() which is really bad idea for tests
Have another test version of spring context where I will replace all pollable channels with direct ones. This requires much work.
Are there any common ways to force trigger poller within test?
Typically we use QueueChannel in our tests and wait for the messages via its receive(10000) method. This way, independently of the source of data, our test method thread is blocked until data has arrived.
The SourcePollingChannelAdapter is triggered by the TaskScheduler, therefore the whole flow logic is done within a separate thread from the test method. I mean that your idea about replacing channels won't help. The Thread.sleep() might have value, but QueueChannel.receive(10000) is much reliable because we really maximum wait only for those 10 seconds.
Another way to block test-case comes from the standard CountDownLatch, which you would countDown() somewhere in the flow and wait for it in the test method.
There is some other way to test: have some loop with short sleep period in between iteration and check some condition to exit and verify. That may be useful in case of poller and database in the end. So, we would perform SELECT in that loop until desired state.
You can find some additional info in the Reference Manual.

Throttle calls to ItemProcessor in Spring batch

I've created spring batch which reads from flat file and process the data using ItemProcessor before writing in the DB using ItemWriter, everything so for works fine.
The problem now I need to control the number of times "Process" method is called for processing the data, my itemprocessor calls some API with details, the API will take some time to respond (not sure about the timeout), Hence, i should not overload the API with new messages. I need to control the calls to API, e.g X number of call in y Sec if it reaches, i need to wait for Z sec before resuming the activity.
I am not sure how to achieve this in spring batch, I am looking at implementing chunklistener in processor to track the calls. However, I am looking for a better approach.
You do not need a listener to do this.
If you do not define an asynchronous taskexecutor in your stop, then the whole processing is completely sequentially.
It will read an item, process an item, reads the next item, processes it until it as read and processed as many items as you defined in your commitsize (-> the size of your chunks). After that, it will put those items into a list and forward this list to the writer. This process will be executed, until all elements have been read, processed, and finally written.
If you would like to process your chunks in parallel, then you can define an asynchronous taskexecutor.
If you define an AsyncTaskExeutor in your step, you are able to configure the number of threads this TaskExecutor manages/creates. Moreover, you can also define the throttlelimit of your step which defines the number of chunks that can be be processed in parallel.

how to set information on a thread context using ASM?

Each time a new thread is being created / picked out of a threadpool, I want to be able to set some context information about it.
I'll appreciate for guidelines about what is the best approach to do it using ASM.
Thanks,
Nadav
This requires you to instrument the methods that start a thread or that is traversed when a thread pool triggers a new task. This makes this straigt forward:
You can instrument the Thread::start method for setting such a value for a Thread.
You can instrument the ThreadPoolExecutor::beforeExecute method to set a Thread state for a ThreadPoolExecutor. It is not possible to do this for generic Executors as they do not necessarily need to be backed by a Thread.

Resources