Processing huge data with spring batch partitioning

Processing huge data with spring batch partitioning - spring

I am implementing spring batch job for processing millions of records in a DB table using partition approach as follows -
Fetch a unique partitioning codes from table in a partitioner and set the same in execution context.
Create a chunk step with reader,processor and writer to process records based on particular partition code.
Is this approach is proper or is there any better approach for situation like this? As some partition codes can have more number of records than others,so those with more records might take more time to process than the ones with less records.
Is it possible to create partition/thread to process like thread1 process 1-1000,thread2 process 1001-2000 etc ?
How do I control number of threads getting created as partition codes can be around 100, I would like to create only 20 thread and process in 5 iteration?
What happens if one partition fails, will all processing stop and reverted back?
Following are configurations -
<bean id="MyPartitioner" class="com.MyPartitioner" />
<bean id="itemProcessor" class="com.MyProcessor" scope="step" />
<bean id="itemReader" class="org.springframework.batch.item.database.JdbcCursorItemReader" scope="step" >
<property name="dataSource" ref="dataSource"/>
<property name="sql" value="select * from mytable WHERE code = '#{stepExecutionContext[code]}' "/>
<property name="rowMapper">
<bean class="com.MyRowMapper" scope="step"/>
</property>
</bean>
<bean id="taskExecutor" class="org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor" >
<property name="corePoolSize" value="20"/>
<property name="maxPoolSize" value="20"/>
<property name="allowCoreThreadTimeOut" value="true"/>
</bean>
<batch:step id="Step1" xmlns="http://www.springframework.org/schema/batch">
<batch:tasklet transaction-manager="transactionManager">
<batch:chunk reader="itemReader" processor="itemProcessor" writer="itemWriter" commit-interval="200"/>
</batch:tasklet>
</batch:step>
<batch:job id="myjob">
<batch:step id="mystep">
<batch:partition step="Step1" partitioner="MyPartitioner">
<batch:handler grid-size="20" task-executor="taskExecutor"/>
</batch:partition>
</batch:step>
</batch:job>
Partitioner -
public class MyPartitioner implements Partitioner{
#Override
public Map<String, ExecutionContext> partition(int gridSize)
{
Map<String, ExecutionContext> partitionMap = new HashMap<String, ExecutionContext>();
List<String> codes = getCodes();
for (String code : codes)
{
ExecutionContext context = new ExecutionContext();
context.put("code", code);
partitionMap.put(code, context);
}
return partitionMap;}}
Thanks

I would say it is right approach, I do not see why you need to have one thread per 1000 items, if you partition per unique partitioning codes and have chunk of 1000 items you will have transactions on 1000 items per thread which is IMO ok.
In addition to saving unique partitioning codes you can count how
many you have of each partition code and partition even more, by
creating new subcontext for every 1000 of same partition code (that
way for partition code that has i.e. 2200 records you will invoke 3
threads with context params: 1=> partition_key=key1, skip=0,
count=1000, 2=>partition_key=key1, skip=1000, count=1000 and
3=>partition_key=key1, skip=2000, count=1000) if that is what you
want but I would still go without it
Number of threads is controlled with ThreadPoolTaskExecutor which is passed to partition step when you create it. You have method setCorePoolSize() which you can set on 20 and you will get at most 20 threads. Next fine grain configuration is grid-size which tells how many partitions will be created out of full partition map. Here is explanation of grid size. So partitioning is about dividing the work. After that, your threading configuration will define the concurrency of the actual processing.
If one partition fails whole partitioned step fails with information which partition failed. Success partitions are done and would not be invoked again and when job restarts it will pick up where it left off by redoing failed and non processed partitions.
Hope I picked up all questions you had since there were many.
Example of case 1 - maybe there are mistakes but just to get idea:
public class MyPartitioner implements Partitioner{
#Override
public Map<String, ExecutionContext> partition(int gridSize)
{
Map<String, ExecutionContext> partitionMap = new HashMap<String, ExecutionContext>();
Map<String, int> codesWithCounts = getCodesWithCounts();
for (Entry<String, int> codeWithCount : codesWithCounts.entrySet())
{
for (int i = 0; i < codeWithCount.getValue(); i + 1000){
ExecutionContext context = new ExecutionContext();
context.put("code", code);
context.put("skip", i);
context.put("count", 1000);
partitionMap.put(code, context);
}
}
return partitionMap;
}
Adn than you page by 1000 and you get from context how many you should skip which will in example of 2200 be: 0, 1000, 2000.

Related

Get jobExecutionContext in xml config spring batch from before step

I am defining my MultiResourceItemReader on this way:
<bean id="multiDataItemReader" class="org.springframework.batch.item.file.MultiResourceItemReader" scope="step">
<property name="resources" value="#{jobExecutionContext['filesResource']}"/>
<property name="delegate" ref="dataItemReader"/>
</bean>
How you can see I want read from the jobExecutionContext the "filesResource" value.
Note: I changed some names to keep the "code privacy". This is executing, Is somebody wants more info please tell me.
I am saving this value in my first step and I am using the reader in the second step, Should I have access to it?
I am saving it in the final lines from my step1 tasklet:
ExecutionContext jobContext = context.getStepContext().getStepExecution().getJobExecution().getExecutionContext();
jobContext.put("filesResource", resourceString);
<batch:job id="myJob">
<batch:step id="step1" next="step2">
<batch:tasklet ref="moveFilesFromTasklet" />
</batch:step>
<batch:step id="step2">
<tasklet>
<chunk commit-interval="500"
reader="multiDataItemReader"
processor="dataItemProcessor"
writer="dataItemWriter" />
</tasklet>
</batch:step>
</batch:job>
I am not really sure what I am forgetting to get the value. The error that I am getting is:
20190714 19:49:08.120 WARN org.springframework.batch.item.file.MultiResourceItemReader [[ # ]] - No resources to read. Set strict=true if this should be an error condition.

I see nothing wrong with your config. The value of resourceString should be an array of org.springframework.core.io.Resource as this is the parameter type of the resources attribute of MultiResourceItemReader.
You can pass an array or a list of String with the absolute path to each resource and it should work. Here is a quick example:
class MyTasklet implements Tasklet {
#Override
public RepeatStatus execute(StepContribution contribution, ChunkContext chunkContext) {
List<String> resources = Arrays.asList(
"/full/path/to/resource1",
"/full/path/to/resource2");
chunkContext.getStepContext().getStepExecution().getJobExecution().getExecutionContext()
.put("filesResource", resources);
return RepeatStatus.FINISHED;
}
}

How to make Execution context in Spring batch Partitioner to run in sequence

I have a requirement where first I have to select no of MasterRecords from table and then for each MasterRecords I will have to fetch no of child rows and for each child rows process and write chunk wise.
To do this I used Partitioner in spring batch and created master and slave steps to achieve this. Now code is working fine if I dont need to run slave step in same sequence it was added to Execution context.
But my requirement is to run slave step for each execution context in same sequence it was added in partitioner. Because until I process parent record I cannot process child records.
Using partitioner slave step is not running in same sequence. Please help me how to maintain same sequence for slave step run ?????
Is there any other way to achieve this using spring batch. any help is welcomed.
<job id="EPICSDBJob" xmlns="http://www.springframework.org/schema/batch">
<!-- Create Order Master Start -->
<step id="populateNewOrdersMasterStep" allow-start-if-complete="false"
next="populateLineItemMasterStep">
<partition step="populateNewOrders" partitioner="pdcReadPartitioner">
<handler grid-size="1" task-executor="taskExecutor" />
</partition>
<batch:listeners>
<batch:listener ref="partitionerStepListner" />
</batch:listeners>
</step>
<!-- Create Order Master End -->
<listeners>
<listener ref="epicsPimsJobListner" />
</listeners>
</job>
<step id="populateNewOrders" xmlns="http://www.springframework.org/schema/batch">
<tasklet allow-start-if-complete="true">
<chunk reader="epicsDBReader" processor="epicsPimsProcessor"
writer="pimsWriter" commit-interval="10">
</chunk>
</tasklet>
<batch:listeners>
<batch:listener ref="stepJobListner" />
</batch:listeners>
</step>
<bean id="epicsDBReader" class="com.cat.epics.sf.batch.reader.EPICSDBReader" scope="step" >
<property name="sfObjName" value="#{stepExecutionContext[sfParentObjNm]}" />
<property name="readChunkCount" value="10" />
<property name="readerDao" ref="readerDao" />
<property name="configDao" ref="configDao" />
<property name="dBReaderService" ref="dBReaderService" />
</bean>
Partitioner Method:
#Override
public Map<String, ExecutionContext> partition(int arg0) {
Map<String, ExecutionContext> result = new LinkedHashMap<String, ExecutionContext>();
List<String> sfMappingObjectNames = configDao.getSFMappingObjNames();
int i=1;
for(String sfMappingObjectName: sfMappingObjectNames){
ExecutionContext value = new ExecutionContext();
value.putString("sfParentObjNm", sfMappingObjectName);
result.put("partition:"+i, value);
i++;
}
return result;
}

There isn't a way to guarantee order within Spring Batch's partitioning model. The fact that the partitions are executed in parallel means that, by definition, there will be no ordering to the records processed. I think this is a case where restructuring the job a bit may help.
If your requirement is to execute the parent then execute the children, using a driving query pattern along with the partitioning would work. You'd partition along the parent records (which it looks like you're doing), then in the worker step, you'd use the parent record to drive queries and processing for the children records. That would guarantee that the child records are processed after the master one.

Spring Batch BeanWrapperFieldExtractor for large number of fields

I am in process of writing a Spring Batch application that reads a CSV file, does some transforming and writes a modified CSV to be sent to another batch process.
My writer configuration looks like this:
<beans:property name="lineAggregator">
<beans:bean class="org.springframework.batch.item.file.transform.FormatterLineAggregator">
<beans:property name="fieldExtractor">
<beans:bean class="org.springframework.batch.item.file.transform.BeanWrapperFieldExtractor">
<beans:property name="names" value="column1, column2, column3, column4 ------ 322 fields " />
</beans:bean>
</beans:property>
<beans:property name="format" value="%-8s%-12s%-11s%-16s" ----322 fields />
</beans:bean>
</beans:property>
I have to write around 322 fields. I am unable to get any FormatterLineAggregator to do my work. If I write the format like
<property name="format" value="%s;%s;%s;%s;%s;%s;%s;%f;%f;%s;%f;%f;%td.%tm.%tY;%td.%tm.%<‌tY;%s;%td.%tm.%&‌lt;tY;%s;%s;%s;%s;%t‌d.%tm.%tY" /> ,
its getting really messy and its tough to make sure all fields are correct.
I thought of 3 different solutions:
Either go with the approach above.
Write a CustomEditorFieldsExtractor but don't know what to write in the class and how to format the fields (preferred).
Use a "non-standard" BeanIO framework jar but I fear my client won't agree to this solution.
Can someone please provide some inputs. Appreciate your help!

You can proceed with solution #2 in this way:
Externalize how to format every property of you bean class (eg, in XML or text file)
Write a custom LineAggregator and make it works coupled with directive at point 1
class Aggregator implements LineAggregator<T> {
private Map<String, String> propertyFormat;
public String aggregate(T item) {
final StringBuilder sb = new StringBuilder();
for(final String property : propertyFormat.keySet()) {
final String format = propertyFormat.get(property);
final Object propertyValue = /* Extract property from item using Spring beans */;
sb.append(String.format(format, propertyValue));
}
return sb.toString();
}
}

Spring Batch JDBCPagingItemReader not partitioning equally for each thread

this is my first question here. I am working on a spring batch and I am using step partitioning for processing 70K records. For testing I am using 1021 records and found that the partitioning not happening equally for each thread. I am using JDBCPagingItemReader with 5 threads. The distribution should be
Thread 1 - 205
Thread 2 - 205
Thread 3 - 205
Thread 4 - 205
Thread 5 - 201
But unfortunately this is not happening and I am getting the below record distribution among threads
Thread 1 - 100
Thread 2 - 111
Thread 3 - 100
Thread 4 - 205
Thread 5 - 200
Total 716 records and 305 records are skipped while partitioning. I really don't have any clue what is happening. Could you please look at the below configurations and let me know am I missing anything? Thanks in advance for your help.
<import resource="../config/batch-context.xml" />
<import resource="../config/database.xml" />
<job id="partitionJob" xmlns="http://www.springframework.org/schema/batch">
<step id="masterStep" parent="abstractPartitionerStagedStep">
<partition step="slave" partitioner="rangePartitioner">
<handler grid-size="5" task-executor="taskExecutor"/>
</partition>
</step>
</job>
<bean id="abstractPartitionerStagedStep" abstract="true">
<property name="listeners">
<list>
<ref bean="updatelistener" />
</list>
</property>
</bean>
<bean id="updatelistener"
class="com.test.springbatch.model.UpdateFileCopyStatus" >
</bean>
<!-- Jobs to run -->
<step id="slave" xmlns="http://www.springframework.org/schema/batch">
<tasklet>
<chunk reader="pagingItemReader" writer="flatFileItemWriter"
processor="itemProcessor" commit-interval="1" retry-limit="0" skip-limit="100">
<skippable-exception-classes>
<include class="java.lang.Exception"/>
</skippable-exception-classes>
</chunk>
</tasklet>
</step>
<bean id="rangePartitioner" class="com.test.springbatch.partition.RangePartitioner">
<property name="dataSource" ref="dataSource" />
</bean>
<bean id="taskExecutor" class="org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor" >
<property name="corePoolSize" value="5"/>
<property name="maxPoolSize" value="5"/>
<property name="queueCapacity" value="100" />
<property name="allowCoreThreadTimeOut" value="true"/>
<property name="keepAliveSeconds" value="60" />
</bean>
<bean id="itemProcessor" class="com.test.springbatch.processor.CaseProcessor" scope="step">
<property name="threadName" value="#{stepExecutionContext[name]}" />
</bean>
<bean id="pagingItemReader"
class="org.springframework.batch.item.database.JdbcPagingItemReader"
scope="step">
<property name="dataSource" ref="dataSource" />
<property name="queryProvider">
<bean
class="org.springframework.batch.item.database.support.SqlPagingQueryProviderFactoryBean">
<property name="dataSource" ref="dataSource" />
<property name="selectClause" value="SELECT *" />
<property name="fromClause" value="FROM ( SELECT CASE_NUM ,CASE_STTS_CD, UPDT_TS,SBMT_OFC_CD,
SBMT_OFC_NUM,DSTR_CHNL_CD,APRV_OFC_CD,APRV_OFC_NUM,SBMT_TYP_CD, ROW_NUMBER()
OVER(ORDER BY CASE_NUM) AS rownumber FROM TSMCASE WHERE PROC_IND ='N' ) AS data" />
<property name="whereClause" value="WHERE rownumber BETWEEN :fromRow AND :toRow " />
<property name="sortKey" value="CASE_NUM" />
</bean>
</property>
<!-- Inject via the ExecutionContext in rangePartitioner -->
<property name="parameterValues">
<map>
<entry key="fromRow" value="#{stepExecutionContext[fromRow]}" />
<entry key="toRow" value="#{stepExecutionContext[toRow]}" />
</map>
</property>
<property name="pageSize" value="100" />
<property name="rowMapper">
<bean class="com.test.springbatch.model.CaseRowMapper" />
</property>
</bean>
<bean id="flatFileItemWriter" class="com.test.springbatch.writer.FNWriter" scope="step" >
</bean>
Here the partitioner code
public class OffRangePartitioner implements Partitioner {
private String officeLst;
private double splitvalue;
private DataSource dataSource;
private static Logger LOGGER = Log4JFactory.getLogger(OffRangePartitioner.class);
private static final int INDENT_LEVEL = 6;
public String getOfficeLst() {
return officeLst;
}
public void setOfficeLst(final String officeLst) {
this.officeLst = officeLst;
}
public void setDataSource(DataSource dataSource) {
this.dataSource = dataSource;
}
public OfficeRangePartitioner() {
super();
final GlobalProperties globalProperties = GlobalProperties.getInstance();
splitvalue = Double.parseDouble(globalProperties.getProperty("springbatch.part.splitvalue"));
}
#Override
public Map<String, ExecutionContext> partition(int threadSize) {
FormattedTraceHelper.formattedTrace(LOGGER,"Partition method in OffRangePartitioner class Start",INDENT_LEVEL, Level.INFO_INT);
final Session currentSession = HibernateUtil.getSessionFactory(HibernateConstants.DB2_DATABASE_NAME).getCurrentSession();
Query queryObj;
double count = 0.0;
final Transaction transaction = currentSession.beginTransaction();
queryObj = currentSession.createQuery(BatchConstants.PARTITION_CNT_QRY);
if (queryObj.iterate().hasNext()) {
count = Double.parseDouble(queryObj.iterate().next().toString());
}
int fromRow = 0;
int toRow = 0;
ExecutionContext context;
FormattedTraceHelper.formattedTrace(LOGGER,"Count of total records submitted for processing >> " + count, INDENT_LEVEL, Level.DEBUG_INT);
int gridSize = (int) Math.ceil(count / splitvalue);
FormattedTraceHelper.formattedTrace(LOGGER,"Total Grid size based on the count >> " + gridSize, INDENT_LEVEL, Level.DEBUG_INT);
Map<String, ExecutionContext> result = new HashMap<String, ExecutionContext>();
for (int threadCount = 1; threadCount <= gridSize; threadCount++) {
fromRow = toRow + 1;
if (threadCount == gridSize || gridSize == 1) {
toRow = (int) count;
} else {
toRow += splitvalue;
}
context = new ExecutionContext();
context.putInt("fromRow", fromRow);
context.putInt("toRow", toRow);
context.putString("name", "Processing Thread" + threadCount);
result.put("partition" + threadCount, context);
FormattedTraceHelper.formattedTrace(LOGGER, "Partition number >> "
+ threadCount + " from Row#: " + fromRow + " to Row#: "
+ toRow, INDENT_LEVEL, Level.DEBUG_INT);
}
if (transaction != null) {
transaction.commit();
}
FormattedTraceHelper.formattedTrace(LOGGER,
"Partition method in OffRangePartitioner class End",
INDENT_LEVEL, Level.INFO_INT);
return result;
}
}
Today, I have tested the same batch with 1056 records with Spring Framework log debug on.
PAGE SIZE 100
SELECT * FROM (
SELECT CASE_NUM, CASE_STTS_CD, UPDT_TS,SBMT_OFC_CD, SBMT_OFC_NUM, DSTR_CHNL_CD,
APRV_OFC_CD, APRV_OFC_NUM,SBMT_TYP_CD, ROW_NUMBER() OVER(ORDER BY CASE_NUM) AS rownumber
FROM TCASE
WHERE **SECARCH_PROC_IND ='P'**
) AS data
WHERE
rownumber BETWEEN :fromRow AND :toRow
ORDER BY
rownumber ASC
FETCH FIRST 100 ROWS ONLY
We are updating the SECARCH_PROC_IND ='P' flag to 'C' once each record processed. We are using ROWNUM in the main query to partition the records based on SECARCH_PROC_IND ='P' and the ROWNUM getting changed once the SECARCH_PROC_IND ='P' flag updated to 'C'by any threads.
Looks like this is the issue.

Spring Batch fires below query to fetch the data from databse
SELECT * FROM ( SELECT CASE_NUM, CASE_STTS_CD, UPDT_TS,SBMT_OFC_CD, SBMT_OFC_NUM, DSTR_CHNL_CD, APRV_OFC_CD, APRV_OFC_NUM,SBMT_TYP_CD, **ROW_NUMBER()** OVER(ORDER BY CASE_NUM) AS rownumber FROM TCASE WHERE **SECARCH_PROC_IND ='P'** ) AS data WHERE rownumber BETWEEN :fromRow AND :toRow ORDER BY rownumber ASC FETCH FIRST 100 ROWS ONLY
After processing each row the flag SECARCH_PROC_IND ='P' is updated to SECARCH_PROC_IND ='C'. As SECARCH_PROC_IND is used in WHERE clause and this is actually reducing the ROW_NUMBER in next sequence of queries fired by spring batch. This is the root cause of the issue.
We have introduced another column SECARCH_PROC_TMP_IND in the table which we are updating before batch processing with flag 'P' in beforeJob() method and we are using that column in WHERE clause of the query instead of using SECARCH_PROC_IND column.
Once batch processed, in afterJob() we are re-setting the SECARCH_PROC_TMP_IND to NULL.
This resolved the partition issue.

Camel: Aggregator doesn't persist Exchange properties

I'm using camel:aggregate backed by jdbc and it seems it doesn't save Exchange properties. For instance, if I configure the following route and the execution is stopped once aggregation has been completed and just before execute camel:to(log) forcing the aggregation to retrieve data from database when restarted, then camel:to(log) won't print the property myProperty
<camel:route id="myRoute">
<camel:from uri="direct:in"/>
<camel:setProperty propertyName="myProperty">
<camel:constant>myPropertyValue</camel:constant>
</camel:setProperty>
<camel:aggregate strategyRef="myStrategy" aggregationRepositoryRef="myAggregationRepo" discardOnCompletionTimeout="true" completionTimeout="86400000" >
<camel:correlationExpression>
<camel:simple>${property.partlastcorrelationkey}</camel:simple>
</camel:correlationExpression>
<camel:completionPredicate>
<camel:simple>${property.partlastcorrelationwaitmore} == false</camel:simple>
</camel:completionPredicate>
<camel:to uri="log:com.test?showAll=true"/>
</camel:aggregate>
</camel:route>
My aggregation repository is configured this way:
<bean id="myAggregationRepo" class="org.apache.camel.processor.aggregate.jdbc.JdbcAggregationRepository" init-method="start" destroy-method="stop">
<property name="transactionManager" ref="transactionManager"/>
<property name="repositoryName" value="PROC_AGG"/>
<property name="dataSource" ref="oracle-ds"/>
<property name="lobHandler">
<bean class="org.springframework.jdbc.support.lob.OracleLobHandler">
<property name="nativeJdbcExtractor">
<bean class="org.springframework.jdbc.support.nativejdbc.CommonsDbcpNativeJdbcExtractor"/>
</property>
</bean>
</property>
</bean>
How can I save properties when using the Aggregator?

I'll reply myself. As seen on the code JdbcCamelCodec doesn't allow to save properties when backing the Aggregator with a database:
public final class JdbcCamelCodec {
public byte[] marshallExchange(CamelContext camelContext, Exchange exchange) throws IOException {
// use DefaultExchangeHolder to marshal to a serialized object
DefaultExchangeHolder pe = DefaultExchangeHolder.marshal(exchange, false);
// add the aggregated size property as the only property we want to retain
DefaultExchangeHolder.addProperty(pe, Exchange.AGGREGATED_SIZE, exchange.getProperty(Exchange.AGGREGATED_SIZE, Integer.class));
// add the aggregated completed by property to retain
DefaultExchangeHolder.addProperty(pe, Exchange.AGGREGATED_COMPLETED_BY, exchange.getProperty(Exchange.AGGREGATED_COMPLETED_BY, String.class));
// add the aggregated correlation key property to retain
DefaultExchangeHolder.addProperty(pe, Exchange.AGGREGATED_CORRELATION_KEY, exchange.getProperty(Exchange.AGGREGATED_CORRELATION_KEY, String.class));
// persist the from endpoint as well
if (exchange.getFromEndpoint() != null) {
DefaultExchangeHolder.addProperty(pe, "CamelAggregatedFromEndpoint", exchange.getFromEndpoint().getEndpointUri());
}
return encode(pe);
}
Basically, the problem lies on this line where false means: don't save properties.
DefaultExchangeHolder pe = DefaultExchangeHolder.marshal(exchange, false);
The headers and the body are the only ones stored on database.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Processing huge data with spring batch partitioning - spring

Related

Get jobExecutionContext in xml config spring batch from before step

How to make Execution context in Spring batch Partitioner to run in sequence

Spring Batch BeanWrapperFieldExtractor for large number of fields

Spring Batch JDBCPagingItemReader not partitioning equally for each thread

Camel: Aggregator doesn't persist Exchange properties

Categories

Resources