There is a requirement where I need to read and process data fetched from a rest api let say restApi1 and write to different rest api let say restApi2 .
For this I am using chunk oriented approach .
But the issue is currently the restApi1 is not paginated .
That endpoint returns a large number of data approximately 10000 .
So if my step failed then while restarting I have to read all the data again and process .
I can not start from where it failed .
Is this thought correct in relation to spring batch processing ?
Kindly suggest some possible approach .
ItemReader
public class MyItemReader extends ItemStreamSupport implements ItemReader<Data> {
private int curIndex = 0;
#Override
public void open(ExecutionContext executionContext) {
this.curIndex = 0;
}
#Override
public void update(ExecutionContext executionContext) {
}
}
ItemProcessor
public class MyItemProcessor extends ItemStreamSupport implements ItemProcessor<Data1, Data2> {
#Override
public Data2 process(Data1 data1) throws Exception {
}
}
ItemWriter
public class MyItemWriter extends ItemStreamSupport implements ItemWriter<Data2> {
#Override
public void write(List<? extends Data2> listOfData) throws Exception {
}
}
You can download data to a file or a staging table in a tasklet step, then make your item reader read items from there.
In case of failure, the tasklet step should not be restarted, whereas your chunk-oriented step would resume from where it left off in the previous run from the file or staging table.
Related
I have two transaction manager for two database. I need to persist same data into both databases. If one transaction failed, other one need rollback. I have done like below
public interface DataService {
void saveData();
}
#Service
public class DataServiceImpl implements DataService {
#Autowired
private DataRepository dataRepository;
#Autowired
private OrDataRepository orDataRepository;
#Autowired
#Qualifier("orService")
private OrService orDataServiceImpl;
#Override
#Transactional(transactionManager = "transactionManager", rollbackFor = {RuntimeException.class})
public void saveData() {
Data data = new Data();
data.setCompKey(UUID.randomUUID().toString().substring(1,5));
data.setName("data");
dataRepository.save(data);
orDataServiceImpl.save();
//throw new RuntimeException("");
}
}
public interface OrService {
void save();
}
#Service("orService")
public class OrDataServiceImpl implements OrService {
#Autowired
private OrDataRepository orDataRepository;
#Override
#Transactional(rollbackFor = {RuntimeException.class})
public void save() {
OrData data = new OrData();
data.setCompKey(UUID.randomUUID().toString().substring(1,5));
data.setName("ordata");
orDataRepository.save(data);
}
}
I have two transaction manager (entityManager & orEntityManager) for two different DB.
If any exception in OrDataServiceImpl save method, data is not getting persisted in both DB. But if any exception in DataServiceImpl saveData method, data is getting persisted into OrData table.
I want to rollback the data from both DB if any exception.
chainedTransactionManager is deprecated. So can't use. atomikos and bitronix also can't use due to some restrictions. Kindly suggest better way to achieve distributed transation
The code need to be refactored, edit the DataServiceImpl.save() method.
Comment the orDataServiceImpl.save() line
public void saveData() {
Data data = new Data();
data.setCompKey(UUID.randomUUID().toString().substring(1,5));
data.setName("data");
dataRepository.save(data);
//orDataServiceImpl.save();
//throw new RuntimeException("");
}
Refactor/Edit the OrDataService Interface
public interface OrDataService {
void save(String uuid);
void delete(String uuid);
//will be use for compensating transaction
}
Update the OrDataServiceImpl class to implement above interface
Write new orchestration Method and use compensating transaction to rollback
pseudo code
call OrDataServiceImpl.save()
if step#1 was success
-> DataServiceImpl.saveData()
if Exception at step#3,
->OrDataServiceImpl.delete() [//to rollback]
else if, Exception at step#1
//do nothing
In my Spring boot and Spring batch application, I have a step like this:
#Bean
public Step step1() {
return stepBuilderFactory.get("step1").<FileInfo, FileInfo>chunk(10).reader(FileInfoItemReader).processor(processor()).writer(writer()).build();
}
My writer is a empty like below:
public class BlankWriter<T> implements ItemWriter<T> {
#Override
public void write(List<? extends T> items) throws Exception {
}
}
Now, in my processor I have this:
public class FileInfoItemProcessor implements ItemProcessor<FileInfo, FileInfo> {
.....
#Override
public FileInfo process(final FileInfo FileInfo) throws Exception {
myCustomStuff () {
......
}
}
public static void myCustomStuff() {
......
......
}
}
Question: As all the objects are passed to processor, I can deal with them in my processor itself rather using any transformations etc AND since my purpose get solved by using processor, is it a good practice? or I must use a writer/custom-writer to get the job done?
I think doing the REST POST call in the writer is more appropriate than doing it in the processor. A REST POST call is a kind of write operation to a remote location.
So you can omit the processor (since it is optional) and move that code to the item writer (instead of using a NoOp item writer with an empty write method).
ItemReader is reading data from DB2 and gave java object ClaimDto. Now the ClaimProcessor takes in the object of ClaimDto and return CompositeClaimRecord object which comprises of claimRecord1 and claimRecord2 which to be sent to two different Kafka topics. How to write claimRecord1 and claimRecord2 to topic1 and topic2 respectively.
Just write a custom ItemWriter that does exactly that.
public class YourItemWriter implements ItemWriter<CompositeClaimRecord>` {
private final ItemWriter<Record1> writer1;
private final ItemWriter<Record2> writer2;
public YourItemWriter(ItemWriter<Record1> writer1, ItemWriter<Record2> writer2>) {
this.writer1=writer1;
this.writer2=writer2;
}
public void write(List<CompositeClaimRecord> items) throws Exception {
for (CompositeClaimRecord record : items) {
writer1.write(Collections.singletonList(record.claimRecord1));
writer2.write(Collections.singletonList(record.claimRecord2));
}
}
}
Or instead of writing 1 record at a time convert the single list into 2 lists and pass that along. But error handling might be a bit of a challenge that way. \
public class YourItemWriter implements ItemWriter<CompositeClaimRecord>` {
private final ItemWriter<Record1> writer1;
private final ItemWriter<Record2> writer2;
public YourItemWriter(ItemWriter<Record1> writer1, ItemWriter<Record2> writer2>) {
this.writer1=writer1;
this.writer2=writer2;
}
public void write(List<CompositeClaimRecord> items) throws Exception {
List<ClaimRecord1> record1List = items.stream().map(it -> it.claimRecord1).collect(Collectors.toList());
List<ClaimRecord2> record2List = items.stream().map(it -> it.claimRecord2).collect(Collectors.toList());
writer1.write(record1List);
writer2.write(record2List);
}
}
You can use a ClassifierCompositeItemWriter with two KafkaItemWriters as delegates (one for each topic).
The Classifier would classify items according to their type (claimRecord1 or claimRecord2) and route them to the corresponding kafka item writer (topic1 or topic2).
I'm writing a Spring Boot application that starts up, gathers and converts millions of database entries into a new streamlined JSON format, and then sends them all to a GCP PubSub topic. I'm attempting to use Spring Batch for this, but I'm running into trouble implementing fault tolerance for my process. The database is rife with data quality issues, and sometimes my conversions to JSON will fail. When failures occur, I don't want the job to immediately quit, I want it to continue processing as many records as it can and, before completion, to report which exact records failed so that I, and or my team, can examine these problematic database entries.
To achieve this, I've attempted to use Spring Batch's SkipListener interface. But I'm also using an AsyncItemProcessor and an AsyncItemWriter in my process, and even though the exceptions are occurring during the processing, the SkipListener's onSkipInWrite() method is catching them - rather than the onSkipInProcess() method. And unfortunately, the onSkipInWrite() method doesn't have access to the original database entity, so I can't store its ID in my list of problematic DB entries.
Have I misconfigured something? Is there any other way to gain access to the objects from the reader that failed the processing step of an AsynItemProcessor?
Here's what I've tried...
I have a singleton Spring Component where I store how many DB entries I've successfully processed along with up to 20 problematic database entries.
#Component
#Getter //lombok
public class ProcessStatus {
private int processed;
private int failureCount;
private final List<UnexpectedFailure> unexpectedFailures = new ArrayList<>();
public void incrementProgress { processed++; }
public void logUnexpectedFailure(UnexpectedFailure failure) {
failureCount++;
unexpectedFailure.add(failure);
}
#Getter
#AllArgsConstructor
public static class UnexpectedFailure {
private Throwable error;
private DBProjection dbData;
}
}
I have a Spring batch Skip Listener that's supposed to catch failures and update my status component accordingly:
#AllArgsConstructor
public class ConversionSkipListener implements SkipListener<DBProjection, Future<JsonMessage>> {
private ProcessStatus processStatus;
#Override
public void onSkipInRead(Throwable error) {}
#Override
public void onSkipInProcess(DBProjection dbData, Throwable error) {
processStatus.logUnexpectedFailure(new ProcessStatus.UnexpectedFailure(error, dbData));
}
#Override
public void onSkipInWrite(Future<JsonMessage> messageFuture, Throwable error) {
//This is getting called instead!! Even though the exception happened during processing :(
//But I have no access to the original DBProjection data here, and messageFuture.get() gives me null.
}
}
And then I've configured my job like this:
#Configuration
public class ConversionBatchJobConfig {
#Autowired
private JobBuilderFactory jobBuilderFactory;
#Autowired
private StepBuilderFactory stepBuilderFactory;
#Autowired
private TaskExecutor processThreadPool;
#Bean
public SimpleCompletionPolicy processChunkSize(#Value("${commit.chunk.size:100}") Integer chunkSize) {
return new SimpleCompletionPolicy(chunkSize);
}
#Bean
#StepScope
public ItemStreamReader<DbProjection> dbReader(
MyDomainRepository myDomainRepository,
#Value("#{jobParameters[pageSize]}") Integer pageSize,
#Value("#{jobParameters[limit]}") Integer limit) {
RepositoryItemReader<DbProjection> myDomainRepositoryReader = new RepositoryItemReader<>();
myDomainRepositoryReader.setRepository(myDomainRepository);
myDomainRepositoryReader.setMethodName("findActiveDbDomains"); //A native query
myDomainRepositoryReader.setArguments(new ArrayList<Object>() {{
add("ACTIVE");
}});
myDomainRepositoryReader.setSort(new HashMap<String, Sort.Direction>() {{
put("update_date", Sort.Direction.ASC);
}});
myDomainRepositoryReader.setPageSize(pageSize);
myDomainRepositoryReader.setMaxItemCount(limit);
// myDomainRepositoryReader.setSaveState(false); <== haven't figured out what this does yet
return myDomainRepositoryReader;
}
#Bean
#StepScope
public ItemProcessor<DbProjection, JsonMessage> dataConverter(DataRetrievalSerivice dataRetrievalService) {
//Sometimes throws exceptions when DB data is exceptionally weird, bad, or missing
return new DbProjectionToJsonMessageConverter(dataRetrievalService);
}
#Bean
#StepScope
public AsyncItemProcessor<DbProjection, JsonMessage> asyncDataConverter(
ItemProcessor<DbProjection, JsonMessage> dataConverter) throws Exception {
AsyncItemProcessor<DbProjection, JsonMessage> asyncDataConverter = new AsyncItemProcessor<>();
asyncDataConverter.setDelegate(dataConverter);
asyncDataConverter.setTaskExecutor(processThreadPool);
asyncDataConverter.afterPropertiesSet();
return asyncDataConverter;
}
#Bean
#StepScope
public ItemWriter<JsonMessage> jsonPublisher(GcpPubsubPublisherService publisherService) {
return new JsonMessageWriter(publisherService);
}
#Bean
#StepScope
public AsyncItemWriter<JsonMessage> asyncJsonPublisher(ItemWriter<JsonMessage> jsonPublisher) throws Exception {
AsyncItemWriter<JsonMessage> asyncJsonPublisher = new AsyncItemWriter<>();
asyncJsonPublisher.setDelegate(jsonPublisher);
asyncJsonPublisher.afterPropertiesSet();
return asyncJsonPublisher;
}
#Bean
public Step conversionProcess(SimpleCompletionPolicy processChunkSize,
ItemStreamReader<DbProjection> dbReader,
AsyncItemProcessor<DbProjection, JsonMessage> asyncDataConverter,
AsyncItemWriter<JsonMessage> asyncJsonPublisher,
ProcessStatus processStatus,
#Value("${conversion.failure.limit:20}") int maximumFailures) {
return stepBuilderFactory.get("conversionProcess")
.<DbProjection, Future<JsonMessage>>chunk(processChunkSize)
.reader(dbReader)
.processor(asyncDataConverter)
.writer(asyncJsonPublisher)
.faultTolerant()
.skipPolicy(new MyCustomConversionSkipPolicy(maximumFailures))
// ^ for now this returns true for everything until 20 failures
.listener(new ConversionSkipListener(processStatus))
.build();
}
#Bean
public Job conversionJob(Step conversionProcess) {
return jobBuilderFactory.get("conversionJob")
.start(conversionProcess)
.build();
}
}
This is because the future wrapped by the AsyncItemProcessor is only unwrapped in the AsyncItemWriter, so any exception that might occur at that time is seen as a write exception instead of a processing exception. That's why onSkipInWrite is called instead of onSkipInProcess.
This is actually a known limitation of this pattern which is documented in the Javadoc of the AsyncItemProcessor, here is an excerpt:
Because the Future is typically unwrapped in the ItemWriter,
there are lifecycle and stats limitations (since the framework doesn't know
what the result of the processor is).
While not an exhaustive list, things like StepExecution.filterCount will not
reflect the number of filtered items and
itemProcessListener.onProcessError(Object, Exception) will not be called.
The Javadoc states that the list is not exhaustive, and the side-effect regarding the SkipListener that you are experiencing is one these limitations.
We have approximately 20 different Spring Batch jobs (some running as microservices, some lumped together in one Spring Boot app). What I need to do is gather all the errors encountered by ALL the jobs, as well as the number of records processed, and summarize it all in an email.
I have implemented ItemListenerSupport as a start:
public class BatchItemListener extends ItemListenerSupport<BaseDomainDataObject, BaseDomainDataObject> {
private final static Log logger = LogFactory.getLog(BatchItemListener.class);
private final static Map<String, Integer> numProcessedMap = new HashMap<>();
private final static Map<String, Integer> errorMap = new HashMap<>();
#Override
public void onReadError(Exception ex) {
logger.error("Encountered error on read", ex);
}
#Override
public void onProcessError(BaseDomainDataObject item, Exception ex) {
String msgBody = ExceptionUtils.getStackTrace(ex);
errorMap.put(item, msgBody);
}
#Override
public void onWriteError(Exception ex, List<? extends BaseDomainDataObject> items) {
logger.error("Encountered error on write", ex);
numProcessedMap.computeIfAbsent("numErrors", val -> items.size());
}
#Override
public void afterWrite(List<? extends BaseDomainDataObject> items) {
logger.info("Logging successful number of items written...");
numProcessedMap.computeIfAbsent("numSuccess", val -> items.size());
}
}
But how to I access the errors I accumulate in the listener when my batch jobs are finally finished? Right now I don't even have a good way to know when they are all finished. Any suggestions? Does Spring Batch provide something better for summarizing jobs?
Spring Batch does not provide a way to orchestrate jobs. The closest you can get out of the box is using a "master" job with multiple steps of type Jobstep that delegate to your sub-jobs. with this approach, you can do the aggregation in a JobExecutionListener#afterJob configured on the master job.
Otherwise, you can Spring Cloud Data Flow and create a composed task of all your jobs.