Less number of threads are running parallel - Spring Batch Remote Partitioning - spring

I am working on a Spring Batch project where I have a file of 2 million records. I am doing some processing on it and then saving it to database. Processing is time costly. So I am using Spring Batch Remote partitioning.
First I am manually splitting the file into 15 files and then using multiResourcePartitioner I am assigning each file to a single thread. But what I noticed is that in the start only 4 threads are running parallel and after some time number of threads running parallel are decreasing with time.
This is the configuration:
<batch:job id="GhanshyamESCatalogUpdater">
<batch:step id="GhanshyamCatalogUpdater2" >
<batch:partition step="slave" partitioner="rangePartitioner">
<batch:handler grid-size="15" task-executor="taskExecutor" />
</batch:partition>
</batch:step>
<batch:listeners>
<batch:listener ref="jobFailureListener"/>
</batch:listeners>
</batch:job>
<bean id="rangePartitioner" class="org.springframework.batch.core.partition.support.MultiResourcePartitioner" scope="step">
<property name="resources" value="file:#{jobParameters['job.partitionDir']}/x*">
</property>
</bean>
<step id="slave" xmlns="http://www.springframework.org/schema/batch">
<tasklet>
<chunk reader="gsbmyntraXmlReader" writer="gsbmyntraESWriter" commit-interval="1000" />
</tasklet>
</step>
This is the Task Executor:
<bean id="taskExecutor"
class="org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor">
<property name="corePoolSize" value="100" />
<property name="allowCoreThreadTimeOut" value="true" />
<property name="WaitForTasksToCompleteOnShutdown" value="true" />
</bean>

Related

Spring Batch: is this a tasklet or chunk?

I'm a little bit confused!
Spring Batch provides two different ways for implementing a job: using tasklets and chunks.
So, when I have this:
<tasklet>
<chunk
reader = 'itemReader'
processor = 'itemProcessor'
writer = 'itemWriter'
/>
</tasklet>
What kind of implementation is this? Tasklet? Chunk?
That's a chunk type step, because inside the <tasklet> element is a <chunk> element that defines a reader, writer, and/or processor.
Below is an example of a job executing first a chunk and second a tasklet step:
<job id="readMultiFileJob" xmlns="http://www.springframework.org/schema/batch">
<step id="step1" next="deleteDir">
<tasklet>
<chunk reader="multiResourceReader" writer="flatFileItemWriter"
commit-interval="1" />
</tasklet>
</step>
<step id="deleteDir">
<tasklet ref="fileDeletingTasklet" />
</step>
</job>
<bean id="fileDeletingTasklet" class="com.mkyong.tasklet.FileDeletingTasklet" >
<property name="directory" value="file:csv/inputs/" />
</bean>
<bean id="multiResourceReader"
class=" org.springframework.batch.item.file.MultiResourceItemReader">
<property name="resources" value="file:csv/inputs/domain-*.csv" />
<property name="delegate" ref="flatFileItemReader" />
</bean>
Thus you can see that the distinction is actually on the level of steps, not for the entire job.

Reading one file and writing to two different files doesn't works when commit-interval is more than 1

I am reading one file and based on some business logic writing to two different files. I am using ClassifierCompositeItemWriter to write to two different files.
It works only when the commit-interval is 1 or else some records get written to different output file.
Below is the code snippet,
<batch:job id="interestJob">
<batch:step id="verifyFile" parent="VerifyFile">
<batch:fail on="FAILED"/>
<batch:next on="*" to="processInterest"/>
</batch:step>
<batch:step id="processInterest">
<batch:tasklet>
<batch:chunk reader="itemReader" processor="itemProcessor" writer="itemWriter" commit-interval="50" skip-limit="1000000">
<batch:streams>
<batch:stream ref="masterCarditemWriter"/>
<batch:stream ref="visaitemWriter"/>
</batch:streams>
</batch:tasklet>
</batch:step>
</batch:job>
<bean id="itemWriter" class="org.springframework.batch.item.support.ClassifierCompositeItemWriter">
<property name="classifier" ref="classifier" />
</bean>
<bean id="classifier" class="org.springframework.batch.classify.BackToBackPatternClassifier">
<property name="routerDelegate">
<bean class="com.scotiabank.sco.report.batch.dda.interest.MyClassifier" />
</property>
<property name="matcherMap">
<map>
<entry key="visa" value-ref="visaitemWriter" />
<entry key="master" value-ref="masterCarditemWriter" />
</map>
</property>
</bean>
public class MyClassifier {
#Classifier
public String classify(Interest dda) {
return dda.getCardType();
}
}

Spring batch ItemProcessor order of processing the items

Here is my spring configuration file.
<batch:job id="empTxnJob">
<batch:step id="stepOne">
<batch:partition partitioner="partitioner" step="worker" handler="partitionHandler" />
</batch:step>
</batch:job>
<bean id="asyncTaskExecutor" class="org.springframework.core.task.SimpleAsyncTaskExecutor" />
<bean id="partitionHandler" class="org.springframework.batch.core.partition.support.TaskExecutorPartitionHandler" scope="step">
<property name="taskExecutor" ref="asyncTaskExecutor" />
<property name="step" ref="worker" />
<property name="gridSize" value="${batch.gridsize}" />
</bean>
<bean id="partitioner" class="com.spring.mybatch.EmpTxnRangePartitioner">
<property name="empTxnDAO" ref="empTxnDAO" />
</bean>
<batch:step id="worker">
<batch:tasklet transaction-manager="transactionManager">
<batch:chunk reader="databaseReader" writer="databaseWriter" commit-interval="25" processor="itemProcessor">
</batch:chunk>
</batch:tasklet>
</batch:step>
<bean name="databaseReader" class="org.springframework.batch.item.database.JdbcCursorItemReader" scope="step">
<property name="dataSource" ref="dataSource" />
<property name="sql">
<value>
<![CDATA[
select *
from
emp_txn
where
emp_txn_id >= #{stepExecutionContext['minValue']}
and
emp_txn_id <= #{stepExecutionContext['maxValue']}
]]>
</value>
</property>
<property name="rowMapper">
<bean class="com.spring.mybatch.EmpTxnRowMapper" />
</property>
<property name="verifyCursorPosition" value="false" />
</bean>
<bean id="databaseWriter" class="org.springframework.batch.item.database.JdbcBatchItemWriter">
<property name="dataSource" ref="dataSource" />
<property name="sql">
<value><![CDATA[update emp_txn set txn_status=:txnStatus where emp_txn_id=:empTxnId]]></value>
</property>
<property name="itemSqlParameterSourceProvider">
<bean class="org.springframework.batch.item.database.BeanPropertyItemSqlParameterSourceProvider" />
</property>
</bean>
<bean id="itemProcessor" class="org.springframework.batch.item.support.CompositeItemProcessor" scope="step">
<property name="delegates">
<list>
<ref bean="processor1" />
<ref bean="processor2" />
</list>
</property>
</bean>
My custom range partitioner will split it based on primary key of emp_txn records.
Assume that an emp(primary key - emp_id) can have multiple emp_txn(primary key - emp_txn_id) to be processed. With my current setup, Its possible in ItemProcessor(either processor1 or processor 2) that 2 threads can process the emp_txn for same employee(i.e., for same emp_id).
Unfortunately the back end logic that process(in processor2) the emp_txn is not capable of handling transactions for same emp in parallel. Is there a way in spring batch to control the order of such processing?
With the use case you are describing, I think you're partitioning by the wrong thing. I'd partition by emp instead of emp-txn. That would group the emp-txns and you could order them there. It would also prevent the risk of emp-txns from being processed out of order based on which thread gets to it first.
To answer your direct question, no. There is no way to order items going through processors in separate threads. Once you break the step up into partitioning, each partition works independently.

Spring batch - Disappearing threads While using Partioner

I am using Spring batch to process voluminous data daily. So we are ready to go with Spring batch Partioning concept.
Below is my configuration :`
<job id="test" xmlns="http://www.springframework.org/schema/batch">
<step id="masterStep">
<partition step="step2" partitioner="multiPartioner">
<handler grid-size="3" task-executor="taskExecutor" />
</partition>
</step>
</job>
<bean id="multiPartioner"
class="org.springframework.batch.core.partition.support.MultiResourcePartitioner"
scope="step">
<property name="resources" value="file:#{jobParameters[fileDirectory]}/*" />
</bean>
<bean id="taskExecutor"
class="org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor">
<property name="corePoolSize" value="10" />
<property name="maxPoolSize" value="10" />
</bean>
<step id="step2" xmlns="http://www.springframework.org/schema/batch">
<tasklet transaction-manager="transactionManager">
<chunk reader="multiResourceItemReader" writer="testWriter"
commit-interval="20000">
</chunk>
</tasklet>
</step>
When I try to specify the corePoolSize as 4 , it is working fine without any issues. But if I increase the count of corePoolSize to 10 , it is executing but after some time say 20 mins nothing was executed . No Logs or no error . No status about what is happening . It was idle no execution further.
Please help me to resolve this issue.

Multiple input file Spring Batch

I'm trying to develop a batch which can process a directory containing files with Spring Batch.
I looked at the MultiResourcePartitioner and tryied somethind like :
<job parent="loggerParent" id="importContractESTD" xmlns="http://www.springframework.org/schema/batch">
<step id="multiImportContractESTD">
<batch:partition step="partitionImportContractESTD" partitioner="partitioner">
<batch:handler grid-size="5" task-executor="taskExecutor" />
</batch:partition>
</step>
</job>
<bean id="partitioner" class="org.springframework.batch.core.partition.support.MultiResourcePartitioner">
<property name="keyName" value="inputfile" />
<property name="resources" value="file:${import.contract.filePattern}" />
</bean>
<step id="partitionImportContractESTD" xmlns="http://www.springframework.org/schema/batch">
<batch:job ref="importOneContractESTD" job-parameters-extractor="defaultJobParametersExtractor" />
</step>
<bean id="defaultJobParametersExtractor" class="org.springframework.batch.core.step.job.DefaultJobParametersExtractor"
scope="step" />
<!-- Job importContractESTD definition -->
<job parent="loggerParent" id="importOneContractESTD" xmlns="http://www.springframework.org/schema/batch">
<step parent="baseStep" id="initStep" next="calculateMD5">
<tasklet ref="initTasklet" />
</step>
<step id="calculateMD5" next="importContract">
<tasklet ref="md5Tasklet">
<batch:listeners>
<batch:listener ref="md5Tasklet" />
</batch:listeners>
</tasklet>
</step>
<step id="importContract">
<tasklet>
<chunk reader="contractReader" processor="contractProcessor" writer="contractWriter" commit-interval="${commit.interval}" />
<batch:listeners>
<batch:listener ref="contractProcessor" />
</batch:listeners>
</tasklet>
</step>
</job>
<!-- Chunk definition : Contract ItemReader -->
<bean id="contractReader" class="com.sopra.banking.cirbe.acquisition.batch.AcquisitionFileReader" scope="step">
<property name="resource" value="#{stepExecutionContext[inputfile]}" />
<property name="lineMapper">
<bean id="contractLineMappe" class="org.springframework.batch.item.file.mapping.PatternMatchingCompositeLineMapper">
<property name="tokenizers">
<map>
<entry key="1*" value-ref="headerTokenizer" />
<entry key="2*" value-ref="contractTokenizer" />
</map>
</property>
<property name="fieldSetMappers">
<map>
<entry key="1*" value-ref="headerMapper" />
<entry key="2*" value-ref="contractMapper" />
</map>
</property>
</bean>
</property>
</bean>
<!-- MD5 Tasklet -->
<bean id="md5Tasklet" class="com.sopra.banking.cirbe.acquisition.batch.AcquisitionMD5Tasklet">
<property name="file" value="#{stepExecutionContext[inputfile]}" />
</bean>
But what I get is :
Caused by: org.springframework.expression.spel.SpelEvaluationException: EL1008E:(pos 0): Field or property 'stepExecutionContext' cannot be found on object of type 'org.springframework.beans.factory.config.BeanExpressionContext'
What I'm looking for is a way to launch my job importOneContractESTD for each files contained in file:${import.contract.filePattern}. And each files is shared between the step calculateMD5 (which puts me the processed file md5 into my jobContext) and the step importContract (which read the previous md5 from the jobContext to add it as data to each line processed by the contractProcessor)
If I only try to call importOneContractESTD with one file given as a parameter (eg replacing #{stepExecutionContext[inputfile]} for ${my.file}), it works... But I want to try to use spring batch to manage my directory rather than my calling shell script...
Thanks for your ideas !
Add scope="step" when you need to access stepExecutionContext
like here:
<bean id="md5Tasklet" class="com.sopra.banking.cirbe.acquisition.batch.AcquisitionMD5Tasklet" scope="step">
<property name="file" value="#{stepExecutionContext[inputfile]}" />
</bean>
More info here.

Resources