I created a Spring Batch job which reads orders from MongoDB and makes a rest call to upload them. However, the batch job automatically gets completed even though all records are not read by MongoItemReader.
I am maintaining a field batchProcessed:boolean on Orders collection. The MongoItemReader reads records for which {batchProcessed:{$ne:true}} as I need to run the batch job multiple times but not process the same documents again and again.
In my OrderWriter I set batchProcessed to true.
#Bean
#StepScope
public MongoItemReader<Order> orderReader() {
MongoItemReader<Order> reader = new MongoItemReader<>();
reader.setTemplate(mongoTempate);
HashMap<String,Sort.Direction> sortMap = new HashMap<>();
sortMap.put("_id",Direction.ASC);
reader.setSort(sortMap);
reader.setTargetType(Order.class);
reader.setQuery("{batchProcessed:{$ne:true}}");
return reader;
}
#Bean
public Step uploadOrdersStep(OrderItemProcessor processor) {
return stepBuilderFactory.get("step1").<Order, Order>chunk(1)
.reader(orderReader()).processor(processor).writer(orderWriter).build();
}
#Bean
public Job orderUploadBatchJob(JobBuilderFactory factory, OrderItemProcessor processor) {
return factory.get("uploadOrder").flow(uploadOrdersStep(processor)).end().build();
}
The MongoItemReader is a paging item reader. When reading items in pages and changing items that might be returned by the query (ie a field that is used in the query's "where" clause), the paging logic can be lost and some items might be skipped. There's a similar problem with the JPA paging item reader that is explained in details here: Spring batch jpaPagingItemReader why some rows are not read?
Common techniques to work around this issue is to use a cursor-based reader, use a staging table/collection, use a partitioned step with a partition per page, etc.
Related
Spring Batch is designed to read and process one item at a time, then write the list of all items processed in a chunk. I want my item to be a List<T> as well, to be thus read and processed, and then write a List<List<T>>. My data source is a standard Spring JpaRepository<T, ID>.
My question is whether there are some standard solutions for this "aggregated" approach. I see that there are some, but they don't read from a JpaRepository, like:
https://github.com/spring-projects/spring-batch/blob/main/spring-batch-samples/src/main/java/org/springframework/batch/sample/domain/multiline/AggregateItemReader.java
Spring Batch - Item Reader and ItemProcessor with a list
Spring Batch- how to pass list of multiple items from input to ItemReader, ItemProcessor and ItemWriter
Update:
I'm looking for a solution that would work for a rapidly changing dataset and in a multithreading environment.
I want my item to be a List as well, to be thus read and processed, and then write a List<List>.
Spring Batch does not (and should not) be aware of what an "item" is. It is up to you do design what an "item" is and how it is implemented (a single value, a list, a stream , etc). In your case, you can encapsulate the List<T> in a type that could be used as an item, and process data as needed. You would need a custom item reader though.
The solution we found is to use a custom aggregate reader as suggested here, which accumulates the read data into a list of a given size then passes it along. For our specific use case, we read data using a JpaPagingItemReader. The relevant part is:
public List<T> read() throws Exception {
ResultHolder holder = new ResultHolder();
// read until no more results available or aggregated size is reached
while (!itemReaderExhausted && holder.getResults().size() < aggregationSize) {
process(itemReader.read(), holder);
}
if (CollectionUtils.isEmpty(holder.getResults())) {
return null;
}
return holder.getResults();
}
private void process(T readValue, ResultHolder resultHolder) {
if (readValue == null) {
itemReaderExhausted = true;
return;
}
resultHolder.addResult(readValue);
}
In order to account for the volatility of the dataset, we extended the JPA reader and overwritten the getPage() method to always return 0, and controlled the dataset through the processor and writer to have the next fresh data to be fetched always on the first page. The hint was given here and in some other SO answers.
public int getPage() {
return 0;
}
I have a Spring Batch reader with following configurations.
This reader is reading from the database and and at a time its reading a page size records.
#Autowired
private SomeCreditRepot someCreditRepo;
public RepositoryItemReader<SomeCreditModel> reader() {
RepositoryItemReader<SomeCreditModel> reader = new RepositoryItemReader<>();
reader.setRepository(someCreditRepo);
reader.setMethodName("someCreditTransfer");
.
.
..
return reader;
}
I want to call utils method,
refValue = BatchProcessingUtil.generateSomeRefValue();
before the processor step, so that all the records fetched by the reader will have the same value set by which is given by the above call.
So that all the entity fetched by the reader will get the same value, in the processor.
And then this refValue will be written to another table StoreRefValue(table).
What is the right way to do this in Spring Batch?
Should I fire the query to write the refValue, to the table StoreRefValue in the processor?
You can let your processor implement the interface StepExecutionListener. You'll then have to implement the methods afterStep and beforeStep. The first should simply return null, and in beforeStep you can call the utility method and save its return value.
Alternatively, you can use the annotation #BeforeStep. If you use the usual Java DSL, it's not required to explicitly add the processor as a listener to the step. Adding it as a processor should suffice.
There are more details in the reference documentation:
https://docs.spring.io/spring-batch/docs/current/reference/html/step.html#interceptingStepExecution
When operating on large data sets, Spring Data presents two abstractions: Stream and Page. We've been using Stream for awhile and had no issues, but recently I wanted to try a paginated approach and ran into a reliability issue.
Consider the following:
#Entity
public class MyData {
}
public interface MyDataRepository extends JpaRepository<MyData, UUID> {
}
#Component
public class MyDataService {
private MyDataRepository repository;
// Bridge between a Reactive service and a transactional / non-reactive database call
#Transactional
public void getAllMyData(final FluxSink<MyData> sink) {
final Pageable firstPage = PageRequest.of(0, 500);
Page<MyData> page = repository.findAll(firstPage);
while (page != null && page.hasContent()) {
page.getContent().forEach(sink::next);
if (page.hasNext()) {
page = repository.findAll(page.nextPageable());
}
else {
page = null;
}
}
sink.complete();
}
}
Using two Postgres 9.5 databases, the source database had close to 100,000 rows while the destination was empty. The example code was then used to copy from the source to the destination. At the end I would find that my destination database had far smaller row count than the source.
Run as a springboot app
The flux doing the copy was using 4-6 threads in parallel (for speed)
Total run time of at least an hour (max was 2 hours)
As it turns out, I was eventually processing the same rows multiple times (and missing other rows as a result). This lead me to discovering a fix that others had already ran into, where you should provide a Sort.by("") argument.
After changing the service to use:
// Make our pages sorted by the PKEY
final Pageable firstPage = PageRequest.of(0, 500, Sort.by("id"));
I found that while it GREATLY helped, I would still process multiple rows (from losing about half the rows to only seeing ~12 duplicates). When I use a Stream instead, I have no issues.
Does anyone have any explanation for what is going on? I don't seem to have any duplicates come through until the test has been running for at least 10-15min, which almost leads me to believe that there is some kind of session or other timeout happening (either in the client, or on the database) that causes the hiccups. But I'm really far out of my knowledge area for troubleshooting it further heh.
I have an existing spring batch project which reads data from MySQL or ArangoDB(NoSql database) based on feature toggle decision during startup and does some process and again writes back to MySQL/ArangoDB.
Now the reader configuration for MySQL is something like below,
#Bean
#Primary
#StepScope
public HibernatePagingItemReader reader(
#Value("#{jobParameters[oldMetadataDefinitionId]}") Long oldMetadataDefinitionId) {
Map<String, Object> queryParameters = new HashMap<>();
queryParameters.put(Constants.OLD_METADATA_DEFINITION_ID, oldMetadataDefinitionId);
HibernatePagingItemReader<Long> reader = new HibernatePagingItemReader<>();
reader.setUseStatelessSession(false);
reader.setPageSize(250);
reader.setParameterValues(queryParameters);
reader.setSessionFactory(((HibernateEntityManagerFactory) entityManagerFactory.getObject()).getSessionFactory());
return reader;
}
and i have another arango reader like below,
#Bean
#StepScope
public ListItemReader arangoReader(
#Value("#{jobParameters[oldMetadataDefinitionId]}") Long oldMetadataDefinitionId) {
List<InstanceDTO> instanceList = new ArrayList<InstanceDTO>();
PersistenceService arangoPersistence = arangoConfiguration
.getPersistenceService());
List<Long> instanceIds = arangoPersistence.getDefinitionInstanceIds(oldMetadataDefinitionId);
instanceIds.forEach((instanceId) ->
{
InstanceDTO instanceDto = new InstanceDTO();
instanceDto.setDefinitionID(oldMetadataDefinitionId);
instanceDto.setInstanceID(instanceId);
instanceList.add(instanceDto);
});
return new ListItemReader(instanceList);
}
and my step configuration is below,
#Bean
#SuppressWarnings("unchecked")
public Step InstanceMergeStep(ListItemReader arangoReader, ItemWriter<MetadataInstanceDTO> arangoWriter,
ItemReader<Long> mysqlReader, ItemWriter<Long> mysqlWriter) {
Step step = null;
if (arangoUsage) {
step = steps.get("arangoInstanceMergeStep")
.<Long, Long>chunk(1)
.reader(arangoReader)
.writer(arangoWriter)
.faultTolerant()
.skip(Exception.class)
.skipLimit(10)
.taskExecutor(stepTaskExecutor())
.build();
((TaskletStep) step).registerChunkListener(chunkListener);
}
else {
step = steps.get("mysqlInstanceMergeStep")
.<Long, Long>chunk(1)
.reader(mysqlReader)
.writer(mysqlWriter)
.faultTolerant()
.skip(Exception.class)
.skipLimit(failedSkipLimit)
.taskExecutor(stepTaskExecutor())
.build();
((TaskletStep) step).registerChunkListener(chunkListener);
}
return step;
}
The MySQL reader has pagination support through HibernatePagingItemReader so that it will handle millions of items without any memory issue.
I want to implement the same pagination support for arango reader to fetch only 250 documents per iteration how can modify the arango reader code to acheive this?
First of all documentation of ListItemReader says that - Useful for testing so don't use it for production. Return an ItemReader instead from all your reader beans instead of actual concrete types.
Having said that, Spring Batch API or Spring Data doesn't seem to supporting Arango DB . Closest that I could find is this
( I have not worked with Arango DB before ) .
So in my opinion, you have to write your own custom arango reader that implements paging by possibly implementing abstract class - org.springframework.batch.item.database.AbstractPagingItemReader
If its not doable by extending above class, you might have to implement everything from scratch. All of pagination readers in Spring Batch API extend this abstract class including HibernatePagingItemReader.
Also, remember that arango record set should have some kind of ordering to implement pagination so we can distinguish between page - 0 & page -1 etc ( similar to ORDER BY clause , BETWEEN Operator & less than , greater than operators etc in SQL. Also FETCH FIRST XXX ROWS OR LIMIT clause kind of thing would be needed too ) .
Implementing by your own is not a very tough task as you have to calculate total possible items , order them and then divide into pages and fetch only one page at a time.
Look at API for implementations like - HibernatePagingItemReader etc to get ideas.
Hope it helps !!
I currently use the spring data solr library and implement its repository interfaces, I'm trying to add functionality to one of my custom queries that uses a Solr template with a SimpleQuery. it currently uses paging which appears to be working well, however, I want to use a Group field so sibling products are only counted once, at their first occurrence. I have set the group field on the query and it works well, however, it still seems to be using the un-grouped number of documents when constructing the page attributes.
is there a known work around for this?
the query syntax provides the following parameter for this purpose, but it would seem that Spring Data Solr isn’t taking advantage of it. &group.ngroups=true should return the number of groups in the result and thus give a correct page numbering.
any other info would be appreciated.
There are actually two ways to add this parameter.
Queries are converted to the solr format using QueryParsers, so it would be possible to register a modified one.
QueryParser modifiedParser = new DefaultQueryParser() {
#Override
protected void appendGroupByFields(SolrQuery solrQuery, List<Field> fields) {
super.appendGroupByFields(solrQuery, fields);
solrQuery.set(GroupParams.GROUP_TOTAL_COUNT, true);
}
};
solrTemplate.registerQueryParser(Query.class, modifiedParser);
Using a SolrCallback would be a less intrusive option:
final Query query = //...whatever query you have.
List<DomainType> result = solrTemplate.execute(new SolrCallback<List<DomainType>>() {
#Override
public List<DomainType> doInSolr(SolrServer solrServer) throws SolrServerException, IOException {
SolrQuery solrQuery = new QueryParsers().getForClass(query.getClass()).constructSolrQuery(query);
//add missing params
solrQuery.set(GroupParams.GROUP_TOTAL_COUNT, true);
return solrTemplate.convertQueryResponseToBeans(solrServer.query(solrQuery), DomainType.class);
}
});
Please feel free to open an issue.