I am a newbie in Spring batch and I have a couple of questions.
Question 1: I am using a MultiResourceItemReader for reading a bunch of CSV files and a JDBC Item writer to update the DB in batches. The commit interval is set to 1000. If there is a file with a 10k records and I encounter a DB error at the 7th batch is there any way I can roll back all the previously committed chunks?
Question 2: If there are two files each having 100 records and the commit interval is set to 1000 then the MultiResourceItemReader reads both files and sends it to the Writer. Is there any way we can just Write one file at a time ignoring the commit interval in this case essentially creating a loop in writer alone?
Posting the solution that worked for me in case someone need it for reference.
For Question 1 I was able to achieve it using the StepListenerSupport in the writer and overriding the BeforeStep and AfterStep. Sample snippet as below
public class JDBCWriter extends StepListenerSupport implements ItemWriter<MyDomain>{
private boolean errorFlag;
private String sql = "{ CALL STORED_PROC(?, ?, ?, ?, ?) }";
#Autowired
private JdbcTemplate jdbcTemplate;
#Override
public void beforeStep(StepExecution stepExecution){
try{
Connection connection = jdbcTemplate.getDataSource().getConnection();
connection.setAutoCommit(false);
}
catch(SQLException ex){
setErrorFlag(Boolean.TRUE);
}
}
#Override
public void write(List<? extends MyDomain> items) throws Exception{
if(!items.isEmpty()){
CallableStatement callableStatement = connection.prepareCall(sql);
callableStatement.setString("1", "FirstName");
callableStatement.setString("2", "LastName");
callableStatement.setString("3", "Date of Birth");
callableStatement.setInt("4", "Year");
callableStatement.registerOutParameter("errors", Types.INTEGER, "");
callableStatement.execute();
if(errors != 0){
this.setErrorFlag(Boolean.TRUE);
}
}
else{
this.setErrorFlag(Boolean.TRUE);
}
}
#Override
public void afterChunk(ChunkContext context){
if(errorFlag){
context.getStepContext().getStepExecution().setExitStatus(ExitStatus.FAILED); //Fail the Step
context.getStepContext().getStepExecution().setStatus(BatchStatus.FAILED); //Fail the batch
}
}
#Override
public ExitStatus afterStep(StepExecution stepExecution){
try{
if(!errorFlag){
connection.commit();
}
else{
connection.rollback();
stepExecution.setExitStatus(ExitStatus.FAILED);
}
}
catch(SQLException ex){
LOG.error("Commit Failed!" + ex);
}
return stepExecution.getExitStatus();
}
public void setErrorFlag(boolean errorFlag){
this.errorFlag = errorFlag;
}
}
XML Config:
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
....
http://www.springframework.org/schema/batch/spring-batch-3.0.xsd">
<job id="fileLoadJob" xmlns="http://www.springframework.org/schema/batch">
<step id="batchFileUpload" >
<tasklet>
<chunk reader="fileReader"
commit-interval="1000"
writer="JDBCWriter"
/>
</tasklet>
</step>
</job>
<bean id="fileReader" class="...com.FileReader" />
<bean id="JDBCWriter" class="...com.JDBCWriter" />
</beans>
Question 1: The only way to accomplish this is via some form of compensating logic. You can do that via a listener (ChunkListener#afterChunkError for example), but the implementation is up to you. There is nothing within Spring Batch that knows what the overall state of the output is and how to roll it back beyond the current transaction.
Question 2: Assuming you're looking for one output file per input file, due to the fact that most Resource implementations are non-transactional, the writers associated with them do special work to buffer up to the commit point and then flush. The problem here is that because of that, there is no real opportunity to divide that buffer to multiple resources. To be clear, it can be done, you'll just need a custom ItemWriter to do it.
Related
I am trying to do a change data capture from oracle DB using spring cloud data flow with kafka as broker. I am using polling mechanism for this. I am polling the data base with a basic select query at regular intervals to capture any updated data. For a better fail proof system, I have persisted my last poll time in oracle DB and used it to get the data which is updated after last poll.
public MessageSource<Object> jdbcMessageSource() {
JdbcPollingChannelAdapter jdbcPollingChannelAdapter =
new JdbcPollingChannelAdapter(this.dataSource, this.properties.getQuery());
jdbcPollingChannelAdapter.setUpdateSql(this.properties.getUpdate());
return jdbcPollingChannelAdapter;
}
#Bean
public IntegrationFlow pollingFlow() {
IntegrationFlowBuilder flowBuilder = IntegrationFlows.from(jdbcMessageSource(),spec -> spec.poller(Pollers.fixedDelay(3000)));
flowBuilder.channel(this.source.output());
flowBuilder.transform(trans,"transform");
return flowBuilder.get();
}
My queries in application properties are as below:
query: select * from kafka_test where LAST_UPDATE_TIME >(select LAST_POLL_TIME from poll_time)
update : UPDATE poll_time SET LAST_POLL_TIME = CURRENT_TIMESTAMP
This working perfectly for me. I am able to get the CDC from the DB with this approach.
The problem I am looking over now is below:
Creating an table just to maintain the poll time is an overburden. I am looking for maintaining this last poll time in a kafka topic and retrieve that time from kafka topic when I am making the next poll.
I have modified the jdbcMessageSource method as below to try that:
public MessageSource<Object> jdbcMessageSource() {
String query = "select * from kafka_test where LAST_UPDATE_TIME > '"+<Last poll time value read from kafka comes here>+"'";
JdbcPollingChannelAdapter jdbcPollingChannelAdapter =
new JdbcPollingChannelAdapter(this.dataSource, query);
return jdbcPollingChannelAdapter;
}
But the Spring Data Flow is instantiating the pollingFlow( ) (please see the code above) bean only once. Hence what ever the query that is run first will remain the same. I want to update the query with new poll time for each poll.
Is there a way where I can write a custom Integrationflow to have this query updated everytime I make a poll ?
I have tried out IntegrationFlowContext for that but wasn't successful.
Thanks in advance !!!
With the help of both the answer above, I was able to figure out the approach.
Write a jdbc template and wrap that as a bean and use it for the Integration Flow.
#EnableBinding(Source.class)
#AllArgsConstructor
public class StockSource {
private DataSource dataSource;
#Autowired
private JdbcTemplate jdbcTemplate;
private MessageChannelFactory messageChannelFactory; // You can use normal message channel which is available in spring cloud data flow as well.
private List<String> findAll() {
jdbcTemplate = new JdbcTemplate(dataSource);
String time = "10/24/60" . (this means 10 seconds for oracle DB)
String query = << your query here like.. select * from test where (last_updated_time > time) >>;
return jdbcTemplate.query(query, new RowMapper<String>() {
#Override
public String mapRow(ResultSet rs, int rowNum) throws SQLException {
...
...
any row mapper operations that you want to do with you result after the poll.
...
...
...
// Change the time here for the next poll to the DB.
return result;
}
});
}
#Bean
public IntegrationFlow supplyPollingFlow() {
IntegrationFlowBuilder flowBuilder = IntegrationFlows
.from(this::findAll, spec -> {
spec.poller(Pollers.fixedDelay(5000));
});
flowBuilder.channel(<<Your message channel>>);
return flowBuilder.get();
}
}
In our use case, we were persisting the last poll time in a kafka topic. This was to make the application state less. Every new poll to the DB now, will have a new time in the where condition.
P.S: your messaging broker (kafka/rabbit mq) sdould be running in your local or connect to them if there are hosted on a different platform.
God Speed !!!
See Artem's answer for the mechanism for a dynamic query in the standard adapter; an alternative, however, would be to simply wrap a JdbcTemplate in a Bean and invoke it with
IntegrationFlows.from(myPojo(), "runQuery", e -> ...)
...
or even a simple lambda
.from(() -> jdbcTemplate...)
We have this test configuration (sorry, it is an XML):
<inbound-channel-adapter query="select * from item where status=:status" channel="target"
data-source="dataSource" select-sql-parameter-source="parameterSource"
update="delete from item"/>
<beans:bean id="parameterSource" factory-bean="parameterSourceFactory"
factory-method="createParameterSourceNoCache">
<beans:constructor-arg value=""/>
</beans:bean>
<beans:bean id="parameterSourceFactory"
class="org.springframework.integration.jdbc.ExpressionEvaluatingSqlParameterSourceFactory">
<beans:property name="parameterExpressions">
<beans:map>
<beans:entry key="status" value="#statusBean.which()"/>
</beans:map>
</beans:property>
<beans:property name="sqlParameterTypes">
<beans:map>
<beans:entry key="status" value="#{ T(java.sql.Types).INTEGER}"/>
</beans:map>
</beans:property>
</beans:bean>
<beans:bean id="statusBean"
class="org.springframework.integration.jdbc.config.JdbcPollingChannelAdapterParserTests$Status"/>
Pay attention to the ExpressionEvaluatingSqlParameterSourceFactory and its createParameterSourceNoCache() factory. The this result can be used for the select-sql-parameter-source.
The JdbcPollingChannelAdapter has a setSelectSqlParameterSource on the matter.
So, you configure a ExpressionEvaluatingSqlParameterSourceFactory to be able to resolve some query parameter as an expression for some bean method invocation to get a desired value from Kafka. Then createParameterSourceNoCache() will help you to obtain an expected SqlParameterSource.
There is some info in docs as well: https://docs.spring.io/spring-integration/docs/current/reference/html/#jdbc-inbound-channel-adapter
In my Spring Batch application I have written a CustomItemWriter which internally writes item to DynamoDB using DynamoDBAsyncClient, this client returns Future object. I have a input file with millions of record. Since CustomItemWriter returns future object immediately my batch job exiting within 5 sec with status as COMPLETED, but in actual it is taking 3-4 minutes to write all item to the DB, I want that batch job finishes only after all item written to DataBase. How can i do that?
job is defined as below
<bean id="report" class="com.solution.model.Report" scope="prototype" />
<batch:job id="job" restartable="true">
<batch:step id="step1">
<batch:tasklet>
<batch:chunk reader="cvsFileItemReader" processor="filterReportProcessor" writer="customItemWriter"
commit-interval="20">
</batch:chunk>
</batch:tasklet>
</batch:step>
</batch:job>
<bean id="customItemWriter" class="com.solution.writer.CustomeWriter"></bean>
CustomeItemWriter is defined as below
public class CustomeWriter implements ItemWriter<Report>{
public void write(List<? extends Report> item) throws Exception {
List<Future<PutItemResult>> list = new LinkedList();
AmazonDynamoDBAsyncClient client = new AmazonDynamoDBAsyncClient();
for(Report report : item) {
PutItemRequest req = new PutItemRequest();
req.setTableName("MyTable");
req.setReturnValue(ReturnValue.ALL_ODD);
req.addItemEntry("customerId",new
AttributeValue(item.getCustomeId()));
Future<PutItemResult> res = client.putItemAsync(req);
list.add(res);
}
}
}
Main class contains
JobExecution execution = jobLauncher.run(job, new JobParameters());
System.out.println("Exit Status : " + execution.getStatus());
Since in ItemWriter its returning future object it doesn't waits to complete the opration. And from the main since all item is submitted for writing Batch Status is showing COMPLETED and job terminates.
I want that this job should terminate only after actual write is performed in the DynamoDB.
Can we have some other step well to wait on this or some Listener is available?
Here is one approach. Since ItemWriter::write doesn't return anything you can make use of listener feature.
#Component
#JobScope
public class YourWriteListener implements ItemWriteListener<WhatEverYourTypeIs> {
#Value("#{jobExecution.executionContext}")
private ExecutionContext executionContext;
#Override
public void afterWrite(final List<? extends WhatEverYourTypeIs> paramList) {
Future future = this.executionContext.readAndValidate("FutureKey", Future.class);
//wait till the job is done using future object
}
#Override
public void beforeWrite(final List<? extends WhatEverYourTypeIs> paramList) {
}
#Override
public void onWriteError(final Exception paramException, final List<? extends WhatEverYourTypeIs> paramList) {
}
}
In your writer class, everything remains same except addind the future object to ExecutionContext.
public class YourItemWriter extends ItemWriter<WhatEverYourTypeIs> {
#Value("#{jobExecution.executionContext}")
private ExecutionContext executionContext;
#Override
protected void doWrite(final List<? extends WhatEverYourTypeIs> youritems)
//write to DynamoDb and get Future object
executionContext.put("FutureKey", future);
}
}
}
And you can register the listener in your configuration. Here is a java code, you need to do the same in your xml
#Bean
public Step initStep() {
return this.stepBuilders.get("someStepName").<YourTypeX, YourTypeY>chunk(10)
.reader(yourReader).processor(yourProcessor)
.writer(yourWriter).listener(YourWriteListener)
.build();
}
We have some data coming in the flat file. e.g.
EmpCode,Salary,EmpName,...
100,1000,...,...
200,2000,...,...
200,2000,...,...
100,1000,...,...
300,3000,...,...
400,4000,...,...
We would like to aggregate the salary based on the EmpCode and write to the database as
Emp_Code Emp_Salary Updated_Time Updated_User
100 2000 ... ...
200 4000 ... ...
300 3000 ... ...
400 4000 ... ...
I have written classes as per the Spring Batch as follows
ItemReader - to read the employee data into a Employee object
A sample EmployeeItemProcessor:
public class EmployeeProcessor implements ItemProcessor<Employee, Employee> {
#Override
public Employee process(Employee employee) throws Exception {
employee.setUpdatedTime(new Date());
employee.setUpdatedUser("someuser");
return employee;
}
EmployeeItemWriter:
#Repository
public class EmployeeItemWriter implements ItemWriter<Employee> {
#Autowired
private SessionFactory sf;
#Override
public void write(List<? extends Employee> employeeList) throws Exception {
List<Employee> aggEmployeeList = aggregateEmpData(employeeList);
//write to db using session factory
}
private List<Employee> aggregateEmpData(List<? extends Employee> employeeList){
Map<String, Employee> map = new HashMap<String, Employee>();
for(Employee e: employeeList){
String empCode = e.getEmpCode();
if(map.containsKey(empCode)){
//get employee salary and add up
}else{
map.put(empCode,Employee);
}
}
return new ArrayList<Employee>(map.values());
}
}
XML Configuration
...
<batch:job id="employeeJob">
<batch:step id="step1">
<batch:tasklet>
<batch:chunk reader="employeeItemReader"
writer="employeeItemWriter" processor="employeeItemProcessor"
commit-interval="100">
</batch:chunk>
</batch:tasklet>
</batch:step>
</batch:job>
...
It is working and serving my purpose. However, I have a couple of questions.
1) When I look at the logs, it is showing as below(commit-interval=100):
status=COMPLETED, exitStatus=COMPLETED, readCount=2652, filterCount=0, writeCount=2652 readSkipCount=0, writeSkipCount=0, processSkipCount=0, commitCount=27, rollbackCount=0
But after aggregation, only 2515 records were written to the database. The write count is 2652. Is it because the number of items reaching ItemWriter are still 2652? How can this be corrected?
2) We are iterating through the list twice.Once in ItemProcessor and then in ItemWriter for aggregation. It could be a performance problem if, the number of records are higher. Is there any better way to achieve this?
If each line of input file, is an employee object, so your ReadCount would be number of lines in input file. WriteCount would be summation of size of all the lists passed to item writer. So, maybe your aggregateEmpData function removes or aggregates some records into one and hence, your db count is not the same as WriteCount.
If you want to make sure that WriteCount is exactly number of records in db, you should do your aggregate in processor.
Why do the aggregation in the ItemWriter? I'd do it in an ItemProcessor. This would allow the write count to be accurate and separates that component from the act of actual writing. If you provide some insight into your configuration, we could elaborate more.
I managed to write it. I did it as follows.
public class EmployeeProcessor implements ItemProcessor<Employee, Employee> {
Map<String, Employee> map;
#Override
public Employee process(Employee employee) throws Exception {
employee.setUpdatedTime(new Date());
employee.setUpdatedUser("someuser");
String empCode = employee.getEmpCode();
if(map.containsKey(empCode)){
//get employee salary and add up
return null;
}
map.put(empCode,employee);
return employee;
}
#BeforeStep
public void beforeStep(StepExecution stepExecution) {
map = new HashMap<String, Employee>();
}
The write count is appearing correctly now.
I have a requirement to use the spring batch to read the existing logic retrieved from database and the existing target object method returns me the list of objects after querying from database.
So I have a task to read this in chunks. When I see the list size from existing code, I see it is around 15000 but on implementing the spring batch, I wanted to read in chunks of 100 and this was not happening through ItemReaderAdapter.
Below code snippets would give you an idea of the issue I am mentioning. So would this be possible from Spring Batch. I notice the Delegating Job Sample Spring Example, but the service there returns the object on every chunk and not the total list object.
Please advice
Job.xml
<step id="firststep">
<tasklet>
<chunk reader="myreader" writer="mywriter" commit-interval="100" />
</tasklet>
</step>
<job id="firstjob" incrementer="idIncrementer">
<step id="step1" parent="firststep" />
</job>
<beans:bean id="myreader" class="org.springframework.batch.item.adapter.ItemReaderAdapter">
<beans:property name="targetObject" ref="readerService" />
<beans:property name="targetMethod" value="getCustomer" />
</beans:bean>
<beans:bean id="readerService" class="com.sh.java.ReaderService">
</beans:bean>
ReaderService.java
public class ReaderService {
public List<CustomItem> getCustomer() throws Exception {
/*
* code to get database instances
*/
List<CustomItem> customList = dao.getCustomers(date);
System.out.println("Customer List Size: " + customList.size()); //Here it is 15K
return (List<CustomItem>) customList;
}
}
Before all: read a 15K List<> of objects might impact (in negative) performance; check if you can write a custom SQL query and use a JDBC/Hibernate cursor item reader instead.
What you are trying to do is not possible using ItemReaderAdapter (it wasn't designed to read a chunk of object) but you can achieve the same result writing a custom ItemReader extending AbstractItemCountingItemStreamItemReader to inherit ItemStream capabilities and override the abstract or no-op methods; especially in:
doOpen() call your readerService.getCustomers() and save List<> in class variables,
in doRead() read next item - from List<> read in doOpen() - using built-in index stored in ExecutionContext
#Bellabax,
Doing the way you suggested also reads the entire database records in doOpen, however, from the list retrieved from doOpen, the reader reads it in a chunks. Pls advise
CustomerReader.java
public class CustomerReader extends
AbstractItemCountingItemStreamItemReader<Customer>
{
List<Customer> customerList;
public CustomerReader ()
{
}
#Override
protected void doClose() throws Exception
{
customerList.clear();
setMaxItemCount(0);
setCurrentItemCount(0);
}
#Override
protected void doOpen() throws Exception
{
customList = dao.getCustomers(date);
System.out.println("Customer List Size: "+list.size()); //This still prints 15k
setMaxItemCount(list.size());
}
#Override
protected Customer doRead() throws Exception
{
//Here reading 15K in chunks!
Customer customer = customList.get(getCurrentItemCount() - 1);
return customer;
}
}
How do I set batch size in spring JDBC batch update to improve performance?
Listed below is my code snippet.
public void insertListOfPojos(final List<Student> myPojoList) {
String sql = "INSERT INTO " + "Student " + "(age,name) " + "VALUES "
+ "(?,?)";
try {
jdbcTemplateObject.batchUpdate(sql,
new BatchPreparedStatementSetter() {
#Override
public void setValues(PreparedStatement ps, int i)
throws SQLException {
Student myPojo = myPojoList.get(i);
ps.setString(2, myPojo.getName());
ps.setInt(1, myPojo.getAge());
}
#Override
public int getBatchSize() {
return myPojoList.size();
}
});
} catch (Exception e) {
System.out.println("Exception");
}
}
I read that with Hibernate you can provide your batch size in the
configuration xml.
For example,
<property name="hibernate.jdbc.batch_size" value="100"/>.
Is there something similar in Spring's jdbc?
There is no option for jdbc that looks like Hibernate; I think you have to get a look to specif RDBMS vendor driver options when preparing connection string.
About your code you have to use
BatchPreparedStatementSetter.getBatchSize()
or
JdbcTemplate.batchUpdate(String sql, final Collection<T> batchArgs, final int batchSize, final ParameterizedPreparedStatementSetter<T> pss)
if you use JDBC directly, you decide yourself how much statements are used in one commit, while using one of the provided JDBCWriters you decide the batch* size with the configured commit-rate
*afaik the actual spring version uses the prepared statement batch methods under the hood, see https://github.com/SpringSource/spring-framework/blob/master/spring-jdbc/src/main/java/org/springframework/jdbc/core/JdbcTemplate.java#L549