how to use limit and offset clause in JdbcPagingItemReader in spring batch? - parallel-processing

The table has more than 200 million records, but i need to restrict the select top 5 million records. I have tried with jdbcCursorItemReader which is taking around 2-3 hrs to select and write it to the csv file using single step by chunk processing, So i choose to go with parallel processing that spring is batch offering.
i,e by having taskExecutor and JdbcPagingItemReader making each 5 individual files of million each but the problem is i am not able to specify the limit and offset clause in query parameters. please help me on this. Approach better than this too is appreciated.
<bean id="itemReader" class="org.springframework.batch.item.database.JdbcPagingItemReader" scope="step">
<property name="dataSource" ref="dataSource" />
<property name="rowMapper">
<bean class="MyRowMapper" />
</property>
<property name="queryProvider">
<bean class="org.springframework.batch.item.database.support.SqlPagingQueryProviderFactoryBean">
<property name="dataSource" ref="dataSource" />
<property name="sortKeys">
<map>
<entry key="esmeaddr" value="ASCENDING"/>
</map>
</property>
<property name="selectClause" value="elect cust_send,dest,msg,stime,dtime,dn_status,mid,rp,operator,circle,cust_mid,first_attempt,second_attempt,third_attempt,fourth_attempt,fifth_attempt,term_operator,term_circle,bindata,reason,tag1,tag2,tag3,tag4,tag5"
/>
<property name="fromClause" value="FROM bill_log " />
<property name="whereClause" value="where esmeaddr = '70897600000000' and country='India' and apptype='SMS' Limit 0,1000000" />
</bean>
</property>
<property name="pageSize" value="1000000" />
<property name="parameterValues">
<map>
<entry key="param1" value="#{jobExecutionContext[param1]}" />
<entry key="param2" value="#{jobExecutionContext[param2]}" />
</map>
</property>
</bean>

You can't use a SQL LIMIT clause within that reader since that's what the reader itself will do. Instead, Spring Batch has the functionality built into the JdbcPagingItemReader. To set the max number of items to process, you can configure the reader with JdbcPagingItemReader#setMaxItemCount(5000000) and if there is an offset, you would set the JdbcPagingItemReader#setCurrentItemCount(offset). That being said, the offset will be overriden on a restart with any value it finds in the ExecutionContext. You can read more about this in the javadoc here: https://docs.spring.io/spring-batch/trunk/apidocs/org/springframework/batch/item/database/JdbcPagingItemReader.html

Related

Spring batch job to update different tables

I am reading the article http://spring.io/guides/gs/batch-processing/ which explains reading a csv and writing it back to a DB. I want to know how can I read mutiple CSV files say A.csv, B.csv etc and write the content back in respective tables table_A, table_B etc. Please note the content of each csv file should go in a different table.
The basic use case here would be to create as much steps as you have CSV files (since there is no default MultiResourceItemReader implementation).
Each of your step would read a CSV (with a FlatFileItemReader) and write to your database (using JdbcBatchItemWriter or another one of the same kind). Although you will have multiple steps, if your CSV files have the same format (columns, separators), you can factorize the configuration using an AbstractStep. See documentation : http://docs.spring.io/spring-batch/trunk/reference/html/configureStep.html
If not, then you can at least share the common attributes such as LineMapper, ItemPreparedStatementSetter and DataSource.
UPDATE
Here are examples for your readers and writers :
<bean id="reader" class="org.springframework.batch.item.file.FlatFileItemReader">
<property name="resource" value="yourFile.csv" />
<property name="lineMapper">
<bean class="org.springframework.batch.item.file.mapping.DefaultLineMapper">
<property name="lineTokenizer">
<bean class="org.springframework.batch.item.file.transform.DelimitedLineTokenizer">
<property name="names" value="column1,column2,column3..." />
</bean>
</property>
<property name="fieldSetMapper">
<bean class="org.springframework.batch.item.file.mapping.BeanWrapperFieldSetMapper">
<property name="prototypeBeanName" value="yourBeanClass" />
</bean>
</property>
</bean>
</property>
</bean>
<bean id="writer" class="org.springframework.batch.item.database.JdbcBatchItemWriter">
<property name="dataSource" ref="dataSource" />
<property name="sql">
<value>
<![CDATA[
insert into YOUR_TABLE(column1,column2,column3...)
values (:beanField1, :beanField2, :beanField3...)
]]>
</value>
</property>
<property name="itemSqlParameterSourceProvider">
<bean class="org.springframework.batch.item.database.BeanPropertyItemSqlParameterSourceProvider" />
</property>
</bean>
UPDATE 2
Here's an example to chain the steps in the job (with Java-based configuration) :
#Bean
public Job job() {
return jobBuilderFactory().get("job").incrementer(new RunIdIncrementer())
.start(step1()).next(step2()).build();
}

Spring batch ItemProcessor order of processing the items

Here is my spring configuration file.
<batch:job id="empTxnJob">
<batch:step id="stepOne">
<batch:partition partitioner="partitioner" step="worker" handler="partitionHandler" />
</batch:step>
</batch:job>
<bean id="asyncTaskExecutor" class="org.springframework.core.task.SimpleAsyncTaskExecutor" />
<bean id="partitionHandler" class="org.springframework.batch.core.partition.support.TaskExecutorPartitionHandler" scope="step">
<property name="taskExecutor" ref="asyncTaskExecutor" />
<property name="step" ref="worker" />
<property name="gridSize" value="${batch.gridsize}" />
</bean>
<bean id="partitioner" class="com.spring.mybatch.EmpTxnRangePartitioner">
<property name="empTxnDAO" ref="empTxnDAO" />
</bean>
<batch:step id="worker">
<batch:tasklet transaction-manager="transactionManager">
<batch:chunk reader="databaseReader" writer="databaseWriter" commit-interval="25" processor="itemProcessor">
</batch:chunk>
</batch:tasklet>
</batch:step>
<bean name="databaseReader" class="org.springframework.batch.item.database.JdbcCursorItemReader" scope="step">
<property name="dataSource" ref="dataSource" />
<property name="sql">
<value>
<![CDATA[
select *
from
emp_txn
where
emp_txn_id >= #{stepExecutionContext['minValue']}
and
emp_txn_id <= #{stepExecutionContext['maxValue']}
]]>
</value>
</property>
<property name="rowMapper">
<bean class="com.spring.mybatch.EmpTxnRowMapper" />
</property>
<property name="verifyCursorPosition" value="false" />
</bean>
<bean id="databaseWriter" class="org.springframework.batch.item.database.JdbcBatchItemWriter">
<property name="dataSource" ref="dataSource" />
<property name="sql">
<value><![CDATA[update emp_txn set txn_status=:txnStatus where emp_txn_id=:empTxnId]]></value>
</property>
<property name="itemSqlParameterSourceProvider">
<bean class="org.springframework.batch.item.database.BeanPropertyItemSqlParameterSourceProvider" />
</property>
</bean>
<bean id="itemProcessor" class="org.springframework.batch.item.support.CompositeItemProcessor" scope="step">
<property name="delegates">
<list>
<ref bean="processor1" />
<ref bean="processor2" />
</list>
</property>
</bean>
My custom range partitioner will split it based on primary key of emp_txn records.
Assume that an emp(primary key - emp_id) can have multiple emp_txn(primary key - emp_txn_id) to be processed. With my current setup, Its possible in ItemProcessor(either processor1 or processor 2) that 2 threads can process the emp_txn for same employee(i.e., for same emp_id).
Unfortunately the back end logic that process(in processor2) the emp_txn is not capable of handling transactions for same emp in parallel. Is there a way in spring batch to control the order of such processing?
With the use case you are describing, I think you're partitioning by the wrong thing. I'd partition by emp instead of emp-txn. That would group the emp-txns and you could order them there. It would also prevent the risk of emp-txns from being processed out of order based on which thread gets to it first.
To answer your direct question, no. There is no way to order items going through processors in separate threads. Once you break the step up into partitioning, each partition works independently.

JdbcPagingItemReader endless loop

Why is it getting a endless loop? always return the ten records and doesn't continue with the next ones
<bean id="pagingItemReader1" class="org.springframework.batch.item.database.JdbcPagingItemReader" scope="step">
<property name="dataSource" ref="dataSource1" />
<property name="queryProvider">
<bean class="org.springframework.batch.item.database.support.SqlPagingQueryProviderFactoryBean">
<property name="dataSource" ref="dataSource1" />
<property name="selectClause" value="select id, image" />
<property name="fromClause" value="from squares" />
<property name="whereClause" value="where image like :value1 or image like :value2" />
<property name="sortKey" value="id" />
</bean>
</property>
<property name="parameterValues">
<map>
<entry key="value1" value="%/images/%" />
<entry key="value2" value="%/temp/%" />
</map>
</property>
<property name="pageSize" value="10" />
<property name="rowMapper">
<bean class="test.batch.ImagesRowMapper" />
</property>
</bean>
Using MySQL 5.1
Parentheses are missing.
Please change
"whereClause" value="where image like :value1 or image like :value2"
to
"whereClause" value="where (image like :value1 or image like :value2)"
I don't know why but I can share my solution with you.
Increase your pageSize over the total number of rows (of the current query of course).
This appears when I am using a 'OR' in whereClause
Someone can investigate ?

Spring Batch - Issue with PageSize in JdbcPagingItemReader

Hi We are working on a spring batch, which processes all the SKUs in SKU table and send a request to inventory system to get the inventory details. To send to invetory details we need to send 100 SKI ids at a time so we have set the pageSize as 100.
in the reader log:
we see
SELECT * FROM (SELECT S_ID ,S_PRNT_PRD,SQ, ROWNUM as TMP_ROW_NUM FROM
XXX_SKU WHERE SQ>=:min and SQ <=:max ORDER BY SQ ASC) WHERE ROWNUM <=
100]
But we observe in the WRITER that is for certain time 100 SKU are sent and for certain requests only 1 SKU is sent.
public void write(List<? extends XXXPagingBean> pItems) throws XXXSkipItemException {
if (mLogger.isLoggingDebug()) {
mLogger.logDebug("XXXInventoryServiceWriter.write() method STARTING, ItemsList size:{0}" +pItems.size());
}
....
....
}
pageSize and commitInterval is set to 100 (are these suppose to be same?)
sortKey (SEQ_ID) should be same a column use in partitiner?
Bean configurations in XML:
<!-- InventoryService Writer configuration -->
<bean id="inventoryGridService" class="atg.nucleus.spring.NucleusResolverUtil" factory-method="resolveName">
<constructor-arg value="/com/XXX/gigaspaces/inventorygrid/service/InventoryGridService" />
</bean>
<bean id="inventoryWriter" class="com.XXX.batch.integrations.XXXXXX.XXXXInventoryServiceWriter" scope="step">
<property name="jdbcTemplate" ref="batchDsTemplate"></property>
<property name="inventoryGridService" ref="inventoryGridService" />
</bean>
<bean id="pagingReader" class="org.springframework.batch.item.database.JdbcPagingItemReader" xmlns="http://www.springframework.org/schema/beans" scope="step">
<property name="dataSource" ref="dataSource" />
<property name="queryProvider">
<bean id=" productQueryProvider" class="org.springframework.batch.item.database.support.SqlPagingQueryProviderFactoryBean">
<property name="dataSource" ref="dataSource" />
<property name="selectClause" value="select S_ID ,S_PRNT_PRD" />
<property name="fromClause" value="from XXX_SKU" />
<property name="sortKey" value="SEQ_ID" />
<property name="whereClause" value="SEQ_ID>=:min and SEQ_ID <=:max"></property>
</bean>
</property>
<property name="parameterValues">
<map>
<entry key="min" value="#{stepExecutionContext[minValue]}"></entry>
<entry key="max" value="#{stepExecutionContext[maxValue]}"></entry>
</map>
</property>
<property name="pageSize" value="100" />
<property name="rowMapper">
<bean class="com.XXX.batch.integrations.endeca.XXXPagingRowMapper"></bean>
</property>
</bean>
Please suggest.
Remove your whereClause from the productQueryProvider bean definition and get rid of your parameterValues and it should work. The PagingQueryProvider takes care of paging automatically for you. There's no need to do that manually yourself.

Spring Batch: Reading a File : if field is empty setting the default value

I am very new to spring batch. I have requirement in which i have to read a file having a header(Field Names) record and data records
i have to validate 1st record (check the field names matching against set of predefined names)- note that this record need to be skipped- i mean should not be part of items in processor)
read and store rest of the field values to a POJO
if the field 'date' is empty , i need to set the default value as 'xxxx-yy-zz'
i am unable to 1st and 3rd requirement with batch
here is the sample reader XML. please help
<bean id="reader" class="org.springframework.batch.item.file.FlatFileItemReader">
<property name="resource" value="classpath:input/import" />
<property name="encoding" value="UTF-8" />
<property name="linesToSkip" value="1" />
<property name="lineMapper" ref="line.mapper"/>
</bean>
<bean id="line.mapper" class="org.springframework.batch.item.file.mapping .DefaultLineMapper">
<property name="lineTokenizer" ref="line.tokenizer"/>
<property name="fieldSetMapper" ref="fieldSet.enity.mapper"/>
</bean>
<bean id="line.tokenizer" class="org.springframework.batch.item.file.transfo rm.DelimitedLineTokenizer">
<property name="delimiter">
<util:constant static-field="org.springframework.batch.item.file.transfo rm.DelimitedLineTokenizer.DELIMITER_TAB"/>
</property>
<property name="names" value="id,date,age " />
<property name="strict" value="false"/>
</bean>
<bean id="fieldSet.enity.mapper" class="org.springframework.batch.item.file.mapping .BeanWrapperFieldSetMapper">
<property name="targetType" value="a.b.myPOJO"/>
<property name="customEditors">
<map>
<entry key="java.util.Date">
<bean class="org.springframework.beans.propertyeditors.C ustomDateEditor">
<constructor-arg>
<bean class="java.text.SimpleDateFormat">
<constructor-arg value="yyyy-mm-dd" />
</bean>
</constructor-arg>
<constructor-arg value="true" />
</bean>
</entry>
</map>
</property>
Create your own custom FieldSetMapper like below
CustomeFieldSetMapper implements FieldSetMapper<a.b.myPOJO> {
#Override
public a.b.myPOJO mapFieldSet(FieldSet fs) {
a.b.myPOJO myPOJO = new a.b.myPOJO();
if(fs.readString("date").isEmpty()){
myPOJO.setDate("xxxx-yy-zz");
}
return a.b.myPOJO;
}
}
You think you should do date set in ItemProcessor.
Also, if <property name="linesToSkip" value="1" /> not fill your requirements - extend FlatFileItemReader and validate first line manually in it.

Resources