Can we get data processed in Spring Batch after batch job is completed? - spring

I am using spring batch for reading data from db and process the same and do spome process in writer.
if batch size is less than the records read by reader then spring batch runs in multiple batches.I want to do the processing in writer only once at the end of all batch process completion or if this is not possible then i will remove writer and process the data obtained in processor after batch job is completed.Is this possible?
Below is my trigger Spring Batch job code
private void triggerSpringBatchJob() {
loggerConfig.logDebug(log, " : Triggering product catalog scheduler ");
JobParametersBuilder builder = new JobParametersBuilder();
try {
// Adding date in buildJobParameters because if not added we will get A job
// instance already exists: JobInstanceAlreadyCompleteException
builder.addDate("date", new Date());
jobLauncher.run(processProductCatalog, builder.toJobParameters());
} catch (JobExecutionAlreadyRunningException | JobRestartException | JobInstanceAlreadyCompleteException
| JobParametersInvalidException e) {
e.printStackTrace();
}
}
Below is my spring batch configuration
#Configuration
#EnableBatchProcessing
public class BatchJobProcessConfiguration {
#Bean
#StepScope
RepositoryItemReader<Tuple> reader(SkuRepository skuRepository,
ProductCatalogConfiguration productCatalogConfiguration) {
RepositoryItemReader<Tuple> reader = new RepositoryItemReader<>();
reader.setRepository(skuRepository);
// query parameters
List<Object> queryMethodArguments = new ArrayList<>();
if (productCatalogConfiguration.getSkuId().isEmpty()) {
reader.setMethodName("findByWebEligibleAndDiscontinued");
queryMethodArguments.add(productCatalogConfiguration.getWebEligible()); // for web eligible
queryMethodArguments.add(productCatalogConfiguration.getDiscontinued()); // for discontinued
queryMethodArguments.add(productCatalogConfiguration.getCbdProductId()); // for cbd products
} else {
reader.setMethodName("findBySkuIds");
queryMethodArguments.add(productCatalogConfiguration.getSkuId()); // for sku ids
}
reader.setArguments(queryMethodArguments);
reader.setPageSize(1000);
Map<String, Direction> sorts = new HashMap<>();
sorts.put("sku_id", Direction.ASC);
reader.setSort(sorts);
return reader;
}
#Bean
#StepScope
ItemWriter<ProductCatalogWriterData> writer() {
return new ProductCatalogWriter();
}
#Bean
ProductCatalogProcessor processor() {
return new ProductCatalogProcessor();
}
#Bean
SkipPolicy readerSkipper() {
return new ReaderSkipper();
#Bean
Step productCatalogDataStep(ItemReader<Tuple> itemReader, ProductCatalogWriter writer,
HttpServletRequest request, StepBuilderFactory stepBuilderFactory,BatchConfiguration batchConfiguration) {
return stepBuilderFactory.get("processProductCatalog").<Tuple, ProductCatalogWriterData>chunk(batchConfiguration.getBatchChunkSize())
.reader(itemReader).faultTolerant().skipPolicy(readerSkipper()).processor(processor()).writer(writer).build();
}
#Bean
Job productCatalogData(Step productCatalogDataStep, HttpServletRequest request,
JobBuilderFactory jobBuilderFactory) {
return jobBuilderFactory.get("processProductCatalog").incrementer(new RunIdIncrementer())
.flow(productCatalogDataStep).end().build();
}
}

want to do the processing in writer only once at the end of all batch process completion or if this is not possible then i will remove writer and process the data obtained in processor after batch job is completed.Is this possible?
"at the end of all batch process completion" is key here. If the requirement is to do some processing after all chunks have been "pre-processed", I would keep it simple and use two steps for that:
Step 1: (pre)processes the data as needed and writes it to a temporary storage
Step 2: Here you do whatever you want with the processed data prepared in the temporary storage
A final step would clean up the temporary storage if it is persistent (file, staging table, etc). Otherwise, ie if it is in memory, this is optional.

Related

Spring Batch - delete

How can I do the deletion of the entities that I just persisted?
#Bean
public Job job() {
return this.jobBuilderFactory.get("job")
.start(this.syncStep())
.build();
}
#Bean
public Step syncStep() {
// read
RepositoryItemReader<Element1> reader = new RepositoryItemReader<>();
reader.setRepository(repository);
reader.setMethodName("findElements");
reader.setArguments(new ArrayList<>(Arrays.asList(ZonedDateTime.now())));
final HashMap<String, Sort.Direction> sorts = new HashMap<>();
sorts.put("uid", Sort.Direction.ASC);
reader.setSort(sorts);
// write
RepositoryItemWriter<Element1> writer = new RepositoryItemWriter<>();
writer.setRepository(otherrepository);
writer.setMethodName("save");
return stepBuilderFactory.get("syncStep")
.<Element1, Element2> chunk(10)
.reader(reader)
.processor(processor)
.writer(writer)
.build();
}
It is a process of dumping elements. We pass the elements from one table to another.
It is a process of dumping elements. We pass the elements from one table to another.
You can do that in two steps. The first step copies items from one table to another. The second step deletes the items from the source table. The second step should be executed only if the first step succeeds.
There are a few options:
Using a CompositeItemWriter
You could create a second ItemWriter that does the delete logic, for example:
RepositoryItemWriter<Element1> deleteWriter = new RepositoryItemWriter<>();
deleteWriter.setRepository(repository);
deleteWriter.setMethodName("delete");
To execute both writers you can use a CompositeItemWriter:
CompositeItemWriter<User> writer = new CompositeItemWriter<>();
// 'saveWriter' would be the writer you currently have
writer.setDelegates(List.of(saveWriter, deleteWriter));
This however won't work if your ItemProcessor transforms the original entity to something completely new. In that case I suggest using PropertyExtractingDelegatingItemWriter.
(Note, according to this question the writers run sequentially and the second writer should not be executed if the first one fails, but I'm not 100% sure on that.)
Using a separate Step
Alternatively, you could put the new writer in an entirely separate Step:
#Bean
public Step cleanupStep() {
// Same reader as before (might want to put this in a separate #Bean)
RepositoryItemReader<Element1> reader = new RepositoryItemReader<>();
// ...
// The 'deleteWriter' from before
RepositoryItemWriter<Element1> deleteWriter = new RepositoryItemWriter<>();
// ...
return stepBuilderFactory.get("cleanupStep")
.<Element1, Element2> chunk(10)
.reader(reader)
.writer(writer)
.build();
}
Now you can schedule the two steps individually:
#Bean
public Job job() {
return this.jobBuilderFactory.get("job")
.start(this.syncStep())
.next(this.cleanupStep())
.build();
}
Using a Tasklet
If you're using a separate step and depending on the amount of data, it might be more interesting to offload it entirely to the database and execute a single delete ... where ... query.
public class CleanupRepositoryTasklet implements Tasklet {
private final Repository repository;
#Override
public RepeatStatus execute(StepContribution contribution, ChunkContext chunkContext) throws Exception {
repository.customDeleteMethod();
return RepeatStatus.FINISHED;
}
}
This Tasklet can then be registered in the same way as before, by declaring a new Step in your configuration:
return this.stepBuilderFactory.get("cleanupStep")
.tasklet(myTasklet())
.build();

Spring-batch step not re-executed (cache)

I'm working on a project that includes Spring batch, before copying the code snippets, I'm going to summarize easily how the job works with a cron.
the cron calls a rest API on my project (#PostMapping("/jobs/external/{jobName}"))
in the post method, I get the job and execute it.
in each execution, I'm supposed to run a step.
the step contains a reader (external rest call to elastic API to get documents) and a processor.
now my problem: in the catalina.out, I'm able to see the rest call from the cron every 10 minutes as configured in my cron. BUT, the step doesn't seem to make that call to elastic every 10 minutes, the batch process always has the same set of data, which is fetched one time when the batch is called during tomcat restart.
job rest api :
#PostMapping("/jobs/external/{jobName}")
#Timed
public ResponseEntity start(#PathVariable String jobName) throws BatchException {
log.info("LAUNCHING JOB FROM EXTERNAL : {}, timestamp : {}", jobName, Instant.now().toString());
try {
Job job = jobRegistry.getJob(jobName);
JobParametersBuilder builder = new JobParametersBuilder();
builder.addDate("date", new Date());
return Optional.of(jobLauncher.run(job, builder.toJobParameters()))
.map(BatchExecutionVM::new)
.map(exec -> ResponseEntity
.ok()
.headers(HeaderUtil.createAlert("jobManagement.started", jobName))
.body(exec))
.orElseGet(() -> ResponseEntity.badRequest().build());
} catch (NoSuchJobException aEx) {
log.warn(JOB_NOT_FOUND, aEx);
throw new BatchException();
} catch (JobInstanceAlreadyCompleteException | JobExecutionAlreadyRunningException | JobRestartException aEx) {
log.warn("Job execution error.", aEx);
throw new BatchException();
} catch (JobParametersInvalidException aEx) {
log.warn("Job parameters are invalid.", aEx);
throw new BatchException();
}
}
job configuration :
#Bean
public Job usualJob() {
return jobBuilderFactory
.get("usualJob")
.incrementer(new SimpleJobIncrementer())
.flow(readUsualStep())
.end()
.build();
}
#Bean
public Step readUsualStep() {
// TODO: simplifier on n'a pas besoin de chunk
return stepBuilderFactory.get("readUsualStep")
.allowStartIfComplete(true)
.<AlertDocument, Void>chunk(25)
.readerIsTransactionalQueue()
.reader(rowItemReader())
.processor(rowItemProcessor())
.build();
}
#Bean
public ItemReader<AlertDocument> rowItemReader() {
return new UsualItemReader(usualService.getLastAlerts());
}
#Bean
public UsualMapRowProcessor rowItemProcessor() {
return new UsualMapRowProcessor();
}
i don't know why usualService.getLastAlerts() is called just once and not every 10 minutes.
thanks to M. Deinum, this is basically the solution :
#Bean
#StepScope
public ItemReader<AlertDocument> rowItemReader() {
return new UsualItemReader(usualService.getLastAlerts());
}
annotating the step bean with stepScope annotation will make it reinstantiate every step.

Configure ItemWriter, ItemReader, Step and Job dynamically in Spring Batch

I'm working on process which uses Spring Integration and Spring Batch
1)Using Spring integration I will poll remote sftp dir to get different csv files as Message
2)Message which carries csv file as payload is sent downstream to Transformer which will transform Message to JobLaunchRequest
3)Spring batch reads csv files and dumps into DB
Question:
For each csv file I need to configure (ItemReader, ItemWriter, Step, Job)
So with that into consideration if I have to deal with 10 different csv files do I have to configure all 4 beans listed above for each csv?
CSVs differs in HeaderNames and HeaderCount and each csv has different JPA Entity
Eventually I will have 40 #Bean Configurations which ideally I think is bad
Can anyone suggest me if this is how spring batch is made to work or there is other way to make it one common dynamic bean for different CSVs
Here is code:
IntegartionFlow:
#Bean
public IntegrationFlow integrationFlow(JobLaunchingGateway jobLaunchingGateway) {
return IntegrationFlows.from(Sftp.inboundAdapter(sftpSessionFactory)
.remoteDirectory("/uploads")
.localDirectory(new File("C:\\Users\\DELL\\Desktop\\local"))
.patternFilter("*.csv")
.autoCreateLocalDirectory(true)
, c -> c.poller(Pollers.fixedRate(1000).taskExecutor(taskExecutor()).maxMessagesPerPoll(1)))
.transform(fileMessageToJobRequest())
.handle(jobLaunchingGateway)
.log(LoggingHandler.Level.WARN, "headers.id + ': ' + payload")
.route(JobExecution.class, j -> j.getStatus().isUnsuccessful() ? "jobFailedChannel" : "jobSuccessfulChannel")
.get();
}
Transformer:
#Transformer
public JobLaunchRequest toRequest(Message<File> message) {
JobParametersBuilder jobParametersBuilder =
new JobParametersBuilder();
jobParametersBuilder.addString(fileParameterName,
message.getPayload().getAbsolutePath());
jobParametersBuilder.addLong("key.id", System.currentTimeMillis());
return new JobLaunchRequest(job, jobParametersBuilder.toJobParameters());
}
Batch Job:
#Bean
public Job vendorMasterBatchJob(Step vendorMasterStep) {
return jobBuilderFactory.get("vendorMasterBatchJob")
.incrementer(new RunIdIncrementer())
.start(vendorMasterStep)
.listener(deleteInputFileJobListener)
.build();
}
Batch Step:
#Bean
public Step vendorMasterStep(FlatFileItemReader<ERPVendorMaster> vendorMasterReader,
JpaItemWriter<ERPVendorMaster> vendorMasterWriter) {
return stepBuilderFactory.get("vendorMasterStep")
.<ERPVendorMaster, ERPVendorMaster>chunk(chunkSize)
.reader(vendorMasterReader)
.writer(vendorMasterWriter)
.faultTolerant()
.skipLimit(Integer.MAX_VALUE)
.skip(RuntimeException.class)
.listener(skipListener)
.build();
}
ItemWriter:
#Bean
public JpaItemWriter<ERPVendorMaster> vendorMasterWriter() {
return new JpaItemWriterBuilder<ERPVendorMaster>()
.entityManagerFactory(entityManagerFactory)
.build();
}
ItemReader:
#Bean
#StepScope
public FlatFileItemReader<ERPVendorMaster> vendorMasterReader(#Value("#{jobParameters['input.file.name']}") String fileName) {
return new FlatFileItemReaderBuilder<ERPVendorMaster>()
.name("vendorMasterItemReader")
.resource(new FileSystemResource(fileName))
.linesToSkip(1)
.delimited()
.names(commaSeparatedVendorMasterHeaderValues.split(","))
.fieldSetMapper(new BeanWrapperFieldSetMapper<ERPVendorMaster>() {{
setConversionService(stringToDateConversionService());
setTargetType(ERPVendorMaster.class);
}})
.build();
}
I'm very new to Spring boot any help will be appreciated
Thanks

Spring Batch Partitioning At Runtime

I want to read a large file using spring batch. I want to split into multiple files and process each of them in a different thread using partitions. I am using the below code:
#Bean
#StepScope
public MultiResourcePartitioner partitioner() {
MultiResourcePartitioner partitioner = new MultiResourcePartitioner();
partitioner.setKeyName("file");
partitioner.setResources(splitFiles());
return partitioner;
}
private Resource[] splitFiles() {
// Read the large File available in the specified folder
// split the file to smaller files and return them as resource list
}
#Bean
public TaskExecutorPartitionHandler partitionHandler() {
TaskExecutorPartitionHandler partitionHandler = new TaskExecutorPartitionHandler();
partitionHandler.setStep(step1());
partitionHandler.setTaskExecutor(new SimpleAsyncTaskExecutor());
return partitionHandler;
}
#Bean
public Step partitionedMaster() {
return this.stepBuilderFactory.get("step1")
.partitioner(step1().getName(), partitioner(null))
.partitionHandler(partitionHandler())
.build();
}
#Bean
public Job partitionedJob() {
return this.jobBuilderFactory.get("partitionedJob")
.start(partitionedMaster())
.build();
}
#Bean
#StepScope
public FlatFileItemReader<Transaction> fileTransactionReader(#Value("#{stepExecutionContext['file']}") Resource resource) {
return new FlatFileItemReaderBuilder<Transaction>()
.name("flatFileTransactionReader")
.resource(resource)
.fieldSetMapper(fsm)
.build();
}
My issue is that the partitioner is partitioning the files which are only available in the folder at the start of the application. Once the application is up and running, if a new file is available in the same folder, the job couldn't read them/partition them.
I used #StepScope, still i'm having the issue.
How do I read and partition the files dynamically at runtime?
Editing it after the first answer:
Hi, Thanks for the inputs.
I can modify the code as below to send the files as parameters and invoke the job, but still the control is not going inside partitioner method, hence could not leverage partitioning.
Any inputs on this?
public JobParameters getJobParameters() {
Resource[] resources = //getFileToProcessResource
return new JobParametersBuilder()
.addLong(TIME, System.currentTimeMillis())
.addString("inputFiles", resources)
.toJobParameters();
}
JobParameters jobParameters = getJobParameters();
jobLauncher.run(partitionedJob(), jobParameters);
#Bean
#StepScope
public MultiResourcePartitioner partitioner(#Value("#{jobParameters['inputFiles']}") Resource[] resources) {
MultiResourcePartitioner partitioner = new MultiResourcePartitioner();
partitioner.setKeyName("file");
partitioner.setResources(resources);
return partitioner;
}
Once the application is up and running, if a new file is available in the same folder, the job couldn't read them/partition them
Batch processing is about fixed data sets. In your case, you start a job but its input data changes in the meantime, so that's not going to work as you expect. A fixed data set is required for restartability in order to work on the same data set in case of failure.
Since the input of your job is a file, you can use the file as a job parameter and configure a watch service (or similar mechanism) to launch a new job instance for each new file in the folder.
EDIT: Add example to make the partitioner aware of the job parameter
#Bean
#StepScope
public MultiResourcePartitioner partitioner(#Value("#{jobParameters['fileName']}") String fileName) {
MultiResourcePartitioner partitioner = new MultiResourcePartitioner();
partitioner.setKeyName("file");
partitioner.setResources(splitFiles(fileName));
return partitioner;
}
private Resource[] splitFiles(String fileName) {
// Read the large File available in the specified folder
// split the file to smaller files and return them as resource list
return null;
}

How to make writer initiated for every job instance in spring batch

I am writing a spring batch job. I am implementing custom writer using KafkaClientWriter extends AbstractItemStreamItemWriter<ProducerMessage>
I have fields which need to be unique for each instance. But I could see this class initiated only once. Rest jobs have same instance of writer class.
Where as my custom readers and processors are getting initiated for each job.
Below is my job configurations. How can I achieve the same behavior for writer as well?
#Bean
#Scope("job")
public ZipMultiResourceItemReader reader(#Value("#{jobParameters[fileName]}") String fileName, #Value("#{jobParameters[s3SourceFolderPrefix]}") String s3SourceFolderPrefix, #Value("#{jobParameters[timeStamp]}") long timeStamp, com.fastretailing.catalogPlatformSCMProducer.service.ConfigurationService confService) {
FlatFileItemReader faltFileReader = new FlatFileItemReader();
ZipMultiResourceItemReader zipReader = new ZipMultiResourceItemReader();
Resource[] resArray = new Resource[1];
resArray[0] = new FileSystemResource(new File(fileName));
zipReader.setArchives(resArray);
DefaultLineMapper<ProducerMessage> lineMapper = new DefaultLineMapper<ProducerMessage>();
lineMapper.setLineTokenizer(new DelimitedLineTokenizer());
CSVFieldMapper csvFieldMapper = new CSVFieldMapper(fileName, s3SourceFolderPrefix, timeStamp, confService);
lineMapper.setFieldSetMapper(csvFieldMapper);
faltFileReader.setLineMapper(lineMapper);
zipReader.setDelegate(faltFileReader);
return zipReader;
}
#Bean
#Scope("job")
public ItemProcessor<ProducerMessage, ProducerMessage> processor(#Value("#{jobParameters[timeStamp]}") long timeStamp) {
ProducerProcessor processor = new ProducerProcessor();
processor.setS3FileTimeStamp(timeStamp);
return processor;
}
#Bean
#ConfigurationProperties
public ItemWriter<ProducerMessage> writer() {
return new KafkaClientWriter();
}
#Bean
public Step step1(StepBuilderFactory stepBuilderFactory,
ItemReader reader, ItemWriter writer,
ItemProcessor processor, #Value("${reader.chunkSize}")
int chunkSize) {
LOGGER.info("Step configuration loaded with chunk size {}", chunkSize);
return stepBuilderFactory.get("step1")
.chunk(chunkSize).reader(reader)
.processor(processor).writer(writer)
.build();
}
#Bean
public StepScope stepScope() {
final StepScope stepScope = new StepScope();
stepScope.setAutoProxy(true);
return stepScope;
}
#Bean
public JobScope jobScope() {
final JobScope jobScope = new JobScope();
return jobScope;
}
#Bean
public Configuration configuration() {
return new Configuration();
}
I tried making the writer with job scope. But in that case open is not getting called. This is where I am doing some initializations.
When using java based configuration and a scoped proxy what happens is that the return type of the method is detected and for that a proxy is created. So when you return ItemWriter you will get a JDK proxy only implementing ItemWriter, whereas your open method is on the ItemStream interface. Because that interface isn't included on the proxy there is no way to call the method.
Either change the return type to KafkaClientWriter or ItemStreamWriter< ProducerMessage> (assuming the KafkaCLientWriter implements that method). Next add #Scope("job") and you should have your open method called again with a properly scoped writer.

Resources