Spring Batch - use JpaPagingItemReader to read lists instead of individual items - spring

Spring Batch is designed to read and process one item at a time, then write the list of all items processed in a chunk. I want my item to be a List<T> as well, to be thus read and processed, and then write a List<List<T>>. My data source is a standard Spring JpaRepository<T, ID>.
My question is whether there are some standard solutions for this "aggregated" approach. I see that there are some, but they don't read from a JpaRepository, like:
https://github.com/spring-projects/spring-batch/blob/main/spring-batch-samples/src/main/java/org/springframework/batch/sample/domain/multiline/AggregateItemReader.java
Spring Batch - Item Reader and ItemProcessor with a list
Spring Batch- how to pass list of multiple items from input to ItemReader, ItemProcessor and ItemWriter
Update:
I'm looking for a solution that would work for a rapidly changing dataset and in a multithreading environment.

I want my item to be a List as well, to be thus read and processed, and then write a List<List>.
Spring Batch does not (and should not) be aware of what an "item" is. It is up to you do design what an "item" is and how it is implemented (a single value, a list, a stream , etc). In your case, you can encapsulate the List<T> in a type that could be used as an item, and process data as needed. You would need a custom item reader though.

The solution we found is to use a custom aggregate reader as suggested here, which accumulates the read data into a list of a given size then passes it along. For our specific use case, we read data using a JpaPagingItemReader. The relevant part is:
public List<T> read() throws Exception {
ResultHolder holder = new ResultHolder();
// read until no more results available or aggregated size is reached
while (!itemReaderExhausted && holder.getResults().size() < aggregationSize) {
process(itemReader.read(), holder);
}
if (CollectionUtils.isEmpty(holder.getResults())) {
return null;
}
return holder.getResults();
}
private void process(T readValue, ResultHolder resultHolder) {
if (readValue == null) {
itemReaderExhausted = true;
return;
}
resultHolder.addResult(readValue);
}
In order to account for the volatility of the dataset, we extended the JPA reader and overwritten the getPage() method to always return 0, and controlled the dataset through the processor and writer to have the next fresh data to be fetched always on the first page. The hint was given here and in some other SO answers.
public int getPage() {
return 0;
}

Related

Spring Batch - MongoItemReader not reading all records

I created a Spring Batch job which reads orders from MongoDB and makes a rest call to upload them. However, the batch job automatically gets completed even though all records are not read by MongoItemReader.
I am maintaining a field batchProcessed:boolean on Orders collection. The MongoItemReader reads records for which {batchProcessed:{$ne:true}} as I need to run the batch job multiple times but not process the same documents again and again.
In my OrderWriter I set batchProcessed to true.
#Bean
#StepScope
public MongoItemReader<Order> orderReader() {
MongoItemReader<Order> reader = new MongoItemReader<>();
reader.setTemplate(mongoTempate);
HashMap<String,Sort.Direction> sortMap = new HashMap<>();
sortMap.put("_id",Direction.ASC);
reader.setSort(sortMap);
reader.setTargetType(Order.class);
reader.setQuery("{batchProcessed:{$ne:true}}");
return reader;
}
#Bean
public Step uploadOrdersStep(OrderItemProcessor processor) {
return stepBuilderFactory.get("step1").<Order, Order>chunk(1)
.reader(orderReader()).processor(processor).writer(orderWriter).build();
}
#Bean
public Job orderUploadBatchJob(JobBuilderFactory factory, OrderItemProcessor processor) {
return factory.get("uploadOrder").flow(uploadOrdersStep(processor)).end().build();
}
The MongoItemReader is a paging item reader. When reading items in pages and changing items that might be returned by the query (ie a field that is used in the query's "where" clause), the paging logic can be lost and some items might be skipped. There's a similar problem with the JPA paging item reader that is explained in details here: Spring batch jpaPagingItemReader why some rows are not read?
Common techniques to work around this issue is to use a cursor-based reader, use a staging table/collection, use a partitioned step with a partition per page, etc.

Spring Batch multiple readers for different DB's

I have an existing spring batch project which reads data from MySQL or ArangoDB(NoSql database) based on feature toggle decision during startup and does some process and again writes back to MySQL/ArangoDB.
Now the reader configuration for MySQL is something like below,
#Bean
#Primary
#StepScope
public HibernatePagingItemReader reader(
#Value("#{jobParameters[oldMetadataDefinitionId]}") Long oldMetadataDefinitionId) {
Map<String, Object> queryParameters = new HashMap<>();
queryParameters.put(Constants.OLD_METADATA_DEFINITION_ID, oldMetadataDefinitionId);
HibernatePagingItemReader<Long> reader = new HibernatePagingItemReader<>();
reader.setUseStatelessSession(false);
reader.setPageSize(250);
reader.setParameterValues(queryParameters);
reader.setSessionFactory(((HibernateEntityManagerFactory) entityManagerFactory.getObject()).getSessionFactory());
return reader;
}
and i have another arango reader like below,
#Bean
#StepScope
public ListItemReader arangoReader(
#Value("#{jobParameters[oldMetadataDefinitionId]}") Long oldMetadataDefinitionId) {
List<InstanceDTO> instanceList = new ArrayList<InstanceDTO>();
PersistenceService arangoPersistence = arangoConfiguration
.getPersistenceService());
List<Long> instanceIds = arangoPersistence.getDefinitionInstanceIds(oldMetadataDefinitionId);
instanceIds.forEach((instanceId) ->
{
InstanceDTO instanceDto = new InstanceDTO();
instanceDto.setDefinitionID(oldMetadataDefinitionId);
instanceDto.setInstanceID(instanceId);
instanceList.add(instanceDto);
});
return new ListItemReader(instanceList);
}
and my step configuration is below,
#Bean
#SuppressWarnings("unchecked")
public Step InstanceMergeStep(ListItemReader arangoReader, ItemWriter<MetadataInstanceDTO> arangoWriter,
ItemReader<Long> mysqlReader, ItemWriter<Long> mysqlWriter) {
Step step = null;
if (arangoUsage) {
step = steps.get("arangoInstanceMergeStep")
.<Long, Long>chunk(1)
.reader(arangoReader)
.writer(arangoWriter)
.faultTolerant()
.skip(Exception.class)
.skipLimit(10)
.taskExecutor(stepTaskExecutor())
.build();
((TaskletStep) step).registerChunkListener(chunkListener);
}
else {
step = steps.get("mysqlInstanceMergeStep")
.<Long, Long>chunk(1)
.reader(mysqlReader)
.writer(mysqlWriter)
.faultTolerant()
.skip(Exception.class)
.skipLimit(failedSkipLimit)
.taskExecutor(stepTaskExecutor())
.build();
((TaskletStep) step).registerChunkListener(chunkListener);
}
return step;
}
The MySQL reader has pagination support through HibernatePagingItemReader so that it will handle millions of items without any memory issue.
I want to implement the same pagination support for arango reader to fetch only 250 documents per iteration how can modify the arango reader code to acheive this?
First of all documentation of ListItemReader says that - Useful for testing so don't use it for production. Return an ItemReader instead from all your reader beans instead of actual concrete types.
Having said that, Spring Batch API or Spring Data doesn't seem to supporting Arango DB . Closest that I could find is this
( I have not worked with Arango DB before ) .
So in my opinion, you have to write your own custom arango reader that implements paging by possibly implementing abstract class - org.springframework.batch.item.database.AbstractPagingItemReader
If its not doable by extending above class, you might have to implement everything from scratch. All of pagination readers in Spring Batch API extend this abstract class including HibernatePagingItemReader.
Also, remember that arango record set should have some kind of ordering to implement pagination so we can distinguish between page - 0 & page -1 etc ( similar to ORDER BY clause , BETWEEN Operator & less than , greater than operators etc in SQL. Also FETCH FIRST XXX ROWS OR LIMIT clause kind of thing would be needed too ) .
Implementing by your own is not a very tough task as you have to calculate total possible items , order them and then divide into pages and fetch only one page at a time.
Look at API for implementations like - HibernatePagingItemReader etc to get ideas.
Hope it helps !!

Spring batch reader - How to avoid returning a list of objects

So I have a spring batch app that I have getting a list of ids that it then uses 'read()' on to get 1 to many results back. The issue is, I have no control over how many results I get back for each id meaning that my chunking is spotty at best. Is there a suggested way to avoid spikes in memory/cpu? An example is below:
#Before
public void getIds() {
*getListOfIds* //Usually around 10,000 or so
}
#Override
public AccountObject read() {
if(list of ids havent all been used) {
List<AccountObject> myAccounts = myService.getAccounts(id);
return myAccounts; //This could be anywhere from 1 result to 100,000 results.
} else {
return null;
}
}
So the myAccounts object above could be small or huge. This causes chunking to basically be useless because at the moment I am chunking by List. I'd really rather chunk by straight AccountObject but don't see an easy way to do this.
Is there a class, strategy, etc. that I am missing here?

New Output file for each Item passed into FlatFileItemWriter

I have the following domain object. This is the object being passed from my processor to my writer.
public class DivisionIdPromoCompStartDtEndDtGrouping {
private int divisionId;
private Date rpmPromoCompDetailStartDate;
private Date rpmPromoCompDetailEndDate;
private List<MasterList> detailRecords = new ArrayList<MasterList>();
I would like a new file per DivisionIdPromoCompStartDtEndDtGrouping. each file would have a line for each of the detailRecords in the list. The output files would be of the same format just logically separated based on data (divisionId,rpmPromoCompDetailStartDate and rpmPromoCompDetailEndDate).
How can I create an FlatFileItemWriter to output a new file for each DivisionIdPromoCompStartDtEndDtGrouping with the content detailRecords?
I think the answer might be a compositeItemWriter. Is that right? Could someone help me with an example of this.
thanks in advance
You're close. Instead of just a CompositeItemWriter, use a ClassifierCompositeItemWriter. This coupled with a Classifier implementation that will choose a writer by grouping will allow you to have one file per group. You can read more about this ItemReader in the javadoc here: http://docs.spring.io/spring-batch/apidocs/org/springframework/batch/item/support/ClassifierCompositeItemWriter.html
No, the answer is not a composite writer. A composite writer simple forwards all items it receives to all defined childwriters.
The problem with FlatFileItemWriter is, that you you have to open and to close it, which is handled by the Framwork itself.
A simple approach would be to implement your own writer and use a FlatFileWriter in its write method.
public class MyWriter implements ItemWriter<..>{
public void write(List<..> items) {
for (.. item:items) {
FlatFileItemWriter fileWriter = new FlatFileItemWriter();
fileWriter.setResource(...); // unique FileName
fileWriter.setLineAggregator(...);
fileWriter.... ; // do other settings if necessary
fileWriter.afterPropertiesSet();
fileWriter.open(new ExecutionContext());
fileWriter.write(Collections.singleList(item));
fileWriter.close();
}
}
}
The lineAggregator has to create an appropriate String including all the linebreaks, so that everyDetail is written on its own line in the file.
Of course, you don't have to use a FlatFileWriter and just open an file, use the lineAggregator to create to line and save the line to the file.

What's the best way to pass a huge collection to a Spring Batch Step?

Use case:
A one-time read of data set X (from database) into a Collection C. [Collection size could be say 5000]
Use Collection C to process/enrich items in a Spring Batch Step (say enrichStep)
If C is much greater than what can be passed via ExecutionContext, how can we make it available in the ItemProcessor of the enrichStep?
In your enrichStep add a StepExecutionListener.beforeStep and load your huge collection in a HugeCollectionBeanHolder bean.
In this way you will load collection only once (when step start or re-start) and without persist it into execution context.
In your enrich processor wire the HugeCollectionBeanHolder to access huge collection.
class HugeCollectionBeanHolder {
Collection<Item> hudeCollection;
void setHugeCollection(Collection<Item> c) { this.hugeCollection = c;}
Collection<Item> getHugeCollection() { return this.hugeCollection;}
}
class MyProcessor implements ItemProcessor<Input,Output> {
HugeCollectionBeanHolder hcbh;
void setHugeCollectionBeanHolder(HugeCollectionBeanHolder bean) { this.hcbh = bean;}
// other methods...
}
You can also look at Spring Batch: what is the best way to use, the data retrieved in one TaskletStep, in the processing of another step

Resources