How to use chunk processing with Spring Batch? - spring

I'm using Spring Batch for the first time. I tried some examples and read through documentation. But I have still questions:
Can I skip one phase in chunk oriented processing? For example: I fetch data from database, process it and determine, that I need more, can I skip write phase and execute next step's read phase? Should I use Tasklet instead?
How to implement a conditional flow?
Thank you very much,
Florian

Skip chunks simply by throwing an exception that has been declared as "skippable exception". You can do it as follows:
<step id="step1">
<tasklet>
<chunk reader="reader" writer="writer"
commit-interval="10" skip-limit="10">
<skippable-exception-classes>
<include class="com.myapp.batch.MyException"/>
</skippable-exception-classes>
</chunk>
</tasklet>
</step>
Conditional flow can easily be implemented deciding on the ExitStatus of a step-execution:
<job id="job">
<step id="step1" parent="s1">
<next on="*" to="stepB" />
<next on="FAILED" to="stepC" />
</step>
<step id="stepB" parent="s2" next="stepC" />
<step id="stepC" parent="s3" />
</job>
Read the documentation to gain deeper knowledge on these topics: http://docs.spring.io/spring-batch/reference/html/configureStep.html

Related

Defining 2 splits to run a set of steps in parallel

I have a job configuration where I load a set of files in parallel, after the set of files is loaded I also want to load another set of files in parallel, but only after the first set is completely loaded. The 2nd set has referential fields to the first set. I thought I can use a second split but never got it working, in the xsd it seems you can define more than one split and obviously a flow does not help me with my requirement.
So how do I define 2 sets of parallel flows which run in sequence to each?
<job>
<split>
<flow>
<step next="step2"/>
<step id="step2"/>
</flow>
<flow>
<step ...>
</flow>
</split>
<split ../>
Asoub was right, it is simply possible, I did a simple config and it worked. So seems the original issue I got has some other issue which causes problems when defining 2 splits.
Simple config I used:
<batch:job id="batchJob" restartable="true">
<batch:split id="x" next="y">
<batch:flow>
<batch:step id="a">
<batch:tasklet allow-start-if-complete="true">
<batch:chunk reader="itemReader" writer="itemWriter" commit-interval="2"/>
</batch:tasklet>
</batch:step>
</batch:flow>
<batch:flow>
<batch:step id="b">
<batch:tasklet allow-start-if-complete="true">
<batch:chunk reader="itemReader" writer="itemWriter" commit-interval="2"/>
</batch:tasklet>
</batch:step>
</batch:flow>
</batch:split>
<batch:split id="y" next="e">
<batch:flow>
<batch:step id="c">
<batch:tasklet allow-start-if-complete="true">
<batch:chunk reader="itemReader" writer="itemWriter" commit-interval="2"/>
</batch:tasklet>
</batch:step>
</batch:flow>
<batch:flow>
<batch:step id="d">
<batch:tasklet allow-start-if-complete="true">
<batch:chunk reader="itemReader" writer="itemWriter" commit-interval="2"/>
</batch:tasklet>
</batch:step>
</batch:flow>
</batch:split>
<batch:step id="e">
<batch:tasklet allow-start-if-complete="true">
<batch:chunk reader="itemReader" writer="itemWriter" commit-interval="2"/>
</batch:tasklet>
</batch:step>
</batch:job>
INFO: Job: [FlowJob: [name=batchJob]] launched with the following parameters: [{random=994444}]
Nov 23, 2016 11:33:24 PM org.springframework.batch.core.job.SimpleStepHandler handleStep
INFO: Executing step: [a]
Nov 23, 2016 11:33:24 PM org.springframework.batch.core.job.SimpleStepHandler handleStep
INFO: Executing step: [b]
Nov 23, 2016 11:33:24 PM org.springframework.batch.core.job.SimpleStepHandler handleStep
INFO: Executing step: [c]
Nov 23, 2016 11:33:24 PM org.springframework.batch.core.job.SimpleStepHandler handleStep
INFO: Executing step: [d]
Nov 23, 2016 11:33:24 PM org.springframework.batch.core.job.SimpleStepHandler handleStep
INFO: Executing step: [e]
Nov 23, 2016 11:33:25 PM org.springframework.batch.core.launch.support.SimpleJobLauncher run
INFO: Job: [FlowJob: [name=batchJob]] completed with the following parameters: [{random=994444}] and the following status: [COMPLETED]
As I said in comments, "So how do I define 2 sets of parallel flows which run in sequence to each?" doesn't make sense per se, you can't start two step in parrallel and sequentially.
Still I think you want to "start loading file2 in step2 when file1 in step1 as finished loading". Which means that loading a file occurs in the middle of a step. I see two way of solving this.
Let's say this is your configuration:
<job id="job1">
<split id="split1" task-executor="taskExecutor" next="step3">
<flow>
<step id="step1" parent="s1"/>
</flow>
<flow>
<step id="step2" parent="s2"/>
</flow>
</split>
<step id="step3" parent="s4"/> <!-- not important here -->
</job>
<beans:bean id="taskExecutor" class="org.spr...SimpleAsyncTaskExecutor"/>
But this will start both of your step in parrallel immediatly. You need to prevent the start of step 2. So, you need to use a Delegate in your step2's reader that will immediatly stop from loading file2, and waits for a signal to start the reading. And somewhere in the code of the step1, where you consider loading to be done, you launch a signal to step2's delegate reader to start loading file2.
The second solution is: you create your own SimpleAsyncTaskExecutor which will start step1 and wait for the signal from step1 to start step2. It's basically the first solution, but you wait for the signal in your custom Executor rather than in a Delegate reader. (you can copy source code from SimpleAsyncTaskExecutor to get an idea)
This comes at a cost, if the step1 never reaches the part where it signal step2 to start loading, your batch will hang forever. Maybe an exception in loading could cause this. As for signal mechanisms, Java has a lot of way to do this (wait() and notifiy(), locks, semaphore, non-standard library maybe).
I don't think there is some king of parrallel step trigger in spring batch (but if there is, someone posts it).
I've already answered a little while asking to your question, you need 2 splits: the first one loads the set of files A, and second, set of files B.
<job id="job1">
<split id="splitForSet_A" task-executor="taskExecutor" next="splitForSet_B">
<flow><step id="step1" parent="s1"/></flow>
<flow><step id="step2" parent="s2"/></flow>
<flow><step id="step3" parent="s3"/></flow>
</split>
<split id="splitForSet_B" task-executor="taskExecutor" next="stepWhatever">
<flow><step id="step4" parent="s4"/></flow>
<flow><step id="step5" parent="s5"/></flow>
<flow><step id="step6" parent="s6"/></flow>
</split>
<step id="stepWhatever" parent="sx"/>
</job>
Steps 1, 2 and 3 will run in parrallel (and load fileset A), then, once they're all over, the second split (splitForSet_B) will start and run steps 4, 5 and 6 in parrallel. A split is basicaly a step that contains steps running in parrallel.
You just need to specify in each steps what file you will be using (so it will be different for steps in first split from steps in second split.
I'd use two partitioned steps. Each partitioner would be responsible for identifying the files in its respective set for the concurrent child-steps to process
<job>
<step name="loadFirstSet">
<partition partitioner="firstSetPartitioner">
<handler task-executor="asyncTaskExecutor" />
<step name="loadFileFromSetOne>
<tasklet>
<chunk reader="someReader" writer="someWriter" commit-interval="#{jobParameters['commit.interval']}" />
</tasklet>
</step>
</partition>
</step>
<step name="loadSecondSet">
<partition partitioner="secondSetPartitioner">
<handler task-executor="asyncTaskExecutor" />
<step name="loadFileFromSecondSet>
<tasklet>
<chunk reader="someOtherReader" writer="someOtherWriter" commit-interval="#{jobParameters['another.commit.interval']}" />
</tasklet>
</step>
</partition>
</step>
</job>

In spring batch is there a way to force the execution to a step if the skip limit has been surpassed

I have a simple batch process with a skip limit set. When the skip limit is surpassed the job fails and it never gets to step two. I would like the process to go to step 3 if the skip limit has passed.
<job id="jobA" incrementer="runIdIncrementer" >
<step id="step1" next="step2">
<tasklet>
<chunk commit-interval="10" reader="dReader" writer="dWriter" skip-limit="100">
<skippable-exception-classes>
<include class="java.lang.Exception"/>
</skippable-exception-classes>
</chunk>
<listeners>
<listener ref="skipListener"/>
</listeners>
</tasklet>
</step>
<step id="step2" next="step3">
<tasklet>
<chunk commit-interval="10" reader="sReader" writer="sWriter"/>
</tasklet>
</step>
<step id="step3">
<tasklet ref="cleanUpStep"/>
</step>
</job>
Is there a way to do this? I have tried setting "next" but an error is thrown stating cant have next attribute and a transition element.
Any help would be great.
You could add a StepExecutionListener to your step. The afterStep(StepExecution stepExecution) will be executed even if the step failed. In this method, you can get the exit status of the step and change it: stepExecution.setExitStatus(ExitStatus.COMPLETED).
You might want to check if the error comes from the skip-limit being exceeded, maybe stepExecution.getFailureExceptions() and search for SkipLimitExceededException (or something like that). You could also get the number of skiped item and compare it with your max (However, if it's an Error on the 100's skip, maybe you should do something else ...)
Note: Skipping a step after having too much exceptions doesn't sound like good design, but as long as you're aware of what you're doing ...
I have managed to fix it using "next". I was including next attribute while also having
next="step2" set in the step declaration. The fix was to remove this from here and then you can add in the next attribute.
<step id="step1">
<tasklet>
<chunk commit-interval="10" reader="dReader" writer="dWriter" skip-limit="100">
<skippable-exception-classes>
<include class="java.lang.Exception"/>
</skippable-exception-classes>
</chunk>
<listeners>
<listener ref="skipListener"/>
</listeners>
</tasklet>
<next on="*" to="step2" />
</step>
The above code will continue to step2 even if the skip limit has been reached. <step id="step1" next="step2"> would cause the job to fail if the limit was reached.

how to read data from multiple tables in db using spring batch

I tried reading data from one table and writing to other table using spring batch but now my requirement is to read data from mutiple tables and write to a file, so we can achieve this by defining mutiple jobs but I want to do it using single job means single reader and single writer and single processor.
Please provide me some references for this scenario.
Not possible by the classes provided by the spring batch but you can make a way our of it.
Just before the chunk processing add one step, make a custom tasklet where you will assign different sql and different output file and make them run in loop as long as there are sqls to execute.
It might sound difficult but I have worked on same situation, Here is some idea how you can do it -
<flow id="databaseReadWriteJob">
<step id="step1_setReaderWriter">
<tasklet ref="setReaderWriter" />
<next on="FAILED" to="" />
<next on="*" to="dataExtractionStep" />
</step>
<step id="dataExtractionStep">
<tasklet>
<chunk reader="dbReader" writer="flatFileWriter" commit-interval="${commit-interval}" />
</tasklet>
<next on="FAILED" to="" />
<next on="*" to="step3_removeProcessedSql" />
</step>
<step id="step3_removeProcessedSql">
<tasklet ref="removeProcessedSql" />
<next on="NOOP" to="step1_setReaderWriter" />
<next on="*" to="step4_validateNumOfSuccessfulSteps" />
</step>
</flow>
and here is the bean for setReaderWriter
<beans:bean id="setReaderWriter" class="SetReaderWriter">
<beans:property name="reader" ref="dbReader" />
<beans:property name="flatFileWriter" ref="flatFileWriter" />
<beans:property name="fileSqlMap" ref="jobSqlFileMap" />
<beans:property name="fileNameBuilder" ref="localFileNameBuilder" />
<beans:property name="sourceFolder" value="${dataDir}" />
<beans:property name="dateDiff" value="${dateDiff}" />
Anything you need to add dynamically in Reader or Writer. Above sqlMap is the map of sql as key and Output file as value of that.
I hope it could help.

Spring Batch saturating memory

UPDATE:
I try to add some details because it's very important for me to solve this problem.
I made a batch which generates pdf documents from data which is present in some tables and saves pdf in a table. The batch is ok but the data to process is huge, so i decided to divide input data in 8 groups and process indipendently the 8 groups with 8 parallel steps.
Each step has it's own reader (named "readerX" for the step "X") and has the same processor and writer which is used by the other steps.
Elaboration goes well, but my client says that this batch uses too much memory (he looks at the "Working Set" parameter in perfmon). In particular the batch begins with 300Mb of used memory, then the used memory reaches 7GB, then decreases to 2GB and the batch finish with 1/2GB of allocated memory.
I paste the code of the job here, hoping someone could help me to find the problem (i guess i made some mistake in adapting the job to parallel processing).
I'm new to spring batch so i apologize for the "bad look".
<job id="myJob"
xmlns="http://www.springframework.org/schema/batch">
<step id="step1" next="step2">
<tasklet ref="task1" />
</step>
<step id="step2" next="step3">
<tasklet ref="task2" />
</step>
<step id="step3" next="decider">
<tasklet ref="task3" />
</step>
<decision id="decider" decider="StepExecutionDecider">
<next on="CASE X" to="split1" />
<end on="*"/>
</decision>
<split id="split1" task-executor="taskExecutor" next="endStep">
<flow>
<step id="EXEC1">
<tasklet><chunk reader="reader1" processor="processor" writer="writer" commit-interval="100"/>
<listeners>
<listener ref="Listner" />
</listeners>
</tasklet>
</step>
</flow>
<flow>
<step id="EXEC2">
<tasklet><chunk reader="reader2" processor="processor" writer="writer" commit-interval="100"/>
<listeners>
<listener ref="Listner" />
</listeners>
</tasklet>
</step>
</flow>
<flow>
<step id="EXEC3">
<tasklet><chunk reader="reader3" processor="processor" writer="writer" commit-interval="100"/>
<listeners>
<listener ref="Listner" />
</listeners>
</tasklet>
</step>
</flow>
<flow>
<step id="EXEC4">
<tasklet><chunk reader="reader4" processor="processor" writer="writer" commit-interval="100"/>
<listeners>
<listener ref="Listner" />
</listeners>
</tasklet>
</step>
</flow>
<flow>
<step id="EXEC5">
<tasklet><chunk reader="reader5" processor="processor" writer="writer" commit-interval="100"/>
<listeners>
<listener ref="Listner" />
</listeners>
</tasklet>
</step>
</flow>
<flow>
<step id="EXEC6">
<tasklet><chunk reader="reader6" processor="processor" writer="writer" commit-interval="100"/>
<listeners>
<listener ref="Listner" />
</listeners>
</tasklet>
</step>
</flow>
<flow>
<step id="EXEC7">
<tasklet><chunk reader="reader7" processor="processor" writer="writer" commit-interval="100"/>
<listeners>
<listener ref="Listner" />
</listeners>
</tasklet>
</step>
</flow>
<flow>
<step id="EXEC8">
<tasklet><chunk reader="reader8" processor="processor" writer="writer" commit-interval="100"/>
<listeners>
<listener ref="Listner" />
</listeners>
</tasklet>
</step>
</flow>
</split>
<step id="endStep" next="decider">
<tasklet ref="task4" >
<listeners>
<listener ref="Listner" />
</listeners>
</tasklet>
</step>
</job>
<bean id="taskExecutor" class="org.springframework.core.task.SimpleAsyncTaskExecutor"/>
<bean id="reader1" class="class of the reader">
<property name="idReader" value="1"/> // Different for the 8 readers
<property name="subSet" value="10"/> // Different for the 8 readers
<property name="dao" ref="Dao" />
<property name="bean" ref="Bean" />
[...] // Other beans
</bean>
Thanks
If your getting an OOM eventually, first start by looking at the heap.
Start the JVM with -XX:+HeapDumpOnOutOfMemoryError to obtain the HPROF which you can then look at to see object allocation, sizes etc. When the JVM exits with an OOM, this file will be generated (may take some time depending on size).
If your able to run with a larger memory foot print such as your clients machine, take a snapshot of the heap when its consuming a large amount such as the 7GB you mentioned (or any other value considered high - 4, 5, 6 etc). You should be able to invoke this while running via tools such as jconsole that come part of the JDK.
With the HPROF file, you can then inspect that with JDK provided tools such as jhat or a more GUI based tool such as the eclipse memory analyzer. This should give you a good (and relatively easy) way of finding out whats holding on to what and provide a starting point for decreasing footprint.
Using a profiler and optimizing code i successfully limited memory consumption. Thanks to all!!!
The batch is ok but the data to process is huge, so i decided to divide input data in 8 groups and process independently the 8 groups with 8 parallel steps.
If you are processing in parallel on the same machine it won't reduce the memory foot print. All the data exists in memory at the same time. If you want to decrease memory use you have to execute the steps one after the other.

Asynchronous Spring Batch Job multiple steps flow control

I have a spring batch job configured to run asynchronously (being started from a web service and using annotations to configure the methods to be asynchronous) my first step runs successfully.
My issue is that I have multiple steps configured and the flow is determined on the status of the step i.e. on completed it moves to step 2 but on a failure it moves to a failure handling step which sends a mail. When a remove the annotations the flow appears to work as expected. However when I use the annotations to run the job asynchronously which ever step is configured to execute on completion gets executed.
flow configuration sample:
<batch:job id="batchJob" restartable="true">
<batch:step id="step1">
<batch:tasklet ref="task1">
<batch:listeners>
<batch:listener ref="failureHandler" />
</batch:listeners>
</batch:tasklet>
<batch:next on="HAS_ERRORS" to="reportError" />
<batch:end on="*" />
<batch:next on="COMPLETED" to="step2" />
</batch:step>
<batch:step id="step2">
<batch:tasklet ref="task2">
<batch:listeners>
<batch:listener ref="failureHandler" />
</batch:listeners>
</batch:tasklet>
<batch:next on="HAS_ERRORS" to="reportError" />
<batch:end on="*" />
</batch:step>
<batch:step id="reportError">
<batch:tasklet ref="failError">
<batch:listeners>
<batch:listener ref="failureHandler" />
</batch:listeners>
</batch:tasklet>
<batch:end on="*" />
</batch:step>
</batch:job>
I have attempted to return an ExitStatus and a BatchStatus which has been ignored.
I have implemented a step execution listener but I have not yet implemented a messaging mechanism to communicate across steps and I do not see anything in the step execution context which gives me an indication of the outcome of the step.
The question I have is whether or not there is a method or mechanism that I may have overlooked to get the status of a step once it's completed? or is a messaging mechanism outside of the batch process an accepted way of proceeding?
It feels wrong that I cannot see the status of the batch step once it's completed when it's asynchronous(I get the expected results/failures when I remove the #Async annotation) I think there might be something I'm missing in my understanding I've spent some time looking into it so a pointer in the right direction would be appreciated.
I do not have access to this particular code any more.
I believe the issue is caused by the annotations overriding the XML configuration which defined the expected flow.
By overriding this we change the actual flow which we expected.

Resources