Spring Batch Partitioning At Runtime - spring

I want to read a large file using spring batch. I want to split into multiple files and process each of them in a different thread using partitions. I am using the below code:
#Bean
#StepScope
public MultiResourcePartitioner partitioner() {
MultiResourcePartitioner partitioner = new MultiResourcePartitioner();
partitioner.setKeyName("file");
partitioner.setResources(splitFiles());
return partitioner;
}
private Resource[] splitFiles() {
// Read the large File available in the specified folder
// split the file to smaller files and return them as resource list
}
#Bean
public TaskExecutorPartitionHandler partitionHandler() {
TaskExecutorPartitionHandler partitionHandler = new TaskExecutorPartitionHandler();
partitionHandler.setStep(step1());
partitionHandler.setTaskExecutor(new SimpleAsyncTaskExecutor());
return partitionHandler;
}
#Bean
public Step partitionedMaster() {
return this.stepBuilderFactory.get("step1")
.partitioner(step1().getName(), partitioner(null))
.partitionHandler(partitionHandler())
.build();
}
#Bean
public Job partitionedJob() {
return this.jobBuilderFactory.get("partitionedJob")
.start(partitionedMaster())
.build();
}
#Bean
#StepScope
public FlatFileItemReader<Transaction> fileTransactionReader(#Value("#{stepExecutionContext['file']}") Resource resource) {
return new FlatFileItemReaderBuilder<Transaction>()
.name("flatFileTransactionReader")
.resource(resource)
.fieldSetMapper(fsm)
.build();
}
My issue is that the partitioner is partitioning the files which are only available in the folder at the start of the application. Once the application is up and running, if a new file is available in the same folder, the job couldn't read them/partition them.
I used #StepScope, still i'm having the issue.
How do I read and partition the files dynamically at runtime?
Editing it after the first answer:
Hi, Thanks for the inputs.
I can modify the code as below to send the files as parameters and invoke the job, but still the control is not going inside partitioner method, hence could not leverage partitioning.
Any inputs on this?
public JobParameters getJobParameters() {
Resource[] resources = //getFileToProcessResource
return new JobParametersBuilder()
.addLong(TIME, System.currentTimeMillis())
.addString("inputFiles", resources)
.toJobParameters();
}
JobParameters jobParameters = getJobParameters();
jobLauncher.run(partitionedJob(), jobParameters);
#Bean
#StepScope
public MultiResourcePartitioner partitioner(#Value("#{jobParameters['inputFiles']}") Resource[] resources) {
MultiResourcePartitioner partitioner = new MultiResourcePartitioner();
partitioner.setKeyName("file");
partitioner.setResources(resources);
return partitioner;
}

Once the application is up and running, if a new file is available in the same folder, the job couldn't read them/partition them
Batch processing is about fixed data sets. In your case, you start a job but its input data changes in the meantime, so that's not going to work as you expect. A fixed data set is required for restartability in order to work on the same data set in case of failure.
Since the input of your job is a file, you can use the file as a job parameter and configure a watch service (or similar mechanism) to launch a new job instance for each new file in the folder.
EDIT: Add example to make the partitioner aware of the job parameter
#Bean
#StepScope
public MultiResourcePartitioner partitioner(#Value("#{jobParameters['fileName']}") String fileName) {
MultiResourcePartitioner partitioner = new MultiResourcePartitioner();
partitioner.setKeyName("file");
partitioner.setResources(splitFiles(fileName));
return partitioner;
}
private Resource[] splitFiles(String fileName) {
// Read the large File available in the specified folder
// split the file to smaller files and return them as resource list
return null;
}

Related

Can we get data processed in Spring Batch after batch job is completed?

I am using spring batch for reading data from db and process the same and do spome process in writer.
if batch size is less than the records read by reader then spring batch runs in multiple batches.I want to do the processing in writer only once at the end of all batch process completion or if this is not possible then i will remove writer and process the data obtained in processor after batch job is completed.Is this possible?
Below is my trigger Spring Batch job code
private void triggerSpringBatchJob() {
loggerConfig.logDebug(log, " : Triggering product catalog scheduler ");
JobParametersBuilder builder = new JobParametersBuilder();
try {
// Adding date in buildJobParameters because if not added we will get A job
// instance already exists: JobInstanceAlreadyCompleteException
builder.addDate("date", new Date());
jobLauncher.run(processProductCatalog, builder.toJobParameters());
} catch (JobExecutionAlreadyRunningException | JobRestartException | JobInstanceAlreadyCompleteException
| JobParametersInvalidException e) {
e.printStackTrace();
}
}
Below is my spring batch configuration
#Configuration
#EnableBatchProcessing
public class BatchJobProcessConfiguration {
#Bean
#StepScope
RepositoryItemReader<Tuple> reader(SkuRepository skuRepository,
ProductCatalogConfiguration productCatalogConfiguration) {
RepositoryItemReader<Tuple> reader = new RepositoryItemReader<>();
reader.setRepository(skuRepository);
// query parameters
List<Object> queryMethodArguments = new ArrayList<>();
if (productCatalogConfiguration.getSkuId().isEmpty()) {
reader.setMethodName("findByWebEligibleAndDiscontinued");
queryMethodArguments.add(productCatalogConfiguration.getWebEligible()); // for web eligible
queryMethodArguments.add(productCatalogConfiguration.getDiscontinued()); // for discontinued
queryMethodArguments.add(productCatalogConfiguration.getCbdProductId()); // for cbd products
} else {
reader.setMethodName("findBySkuIds");
queryMethodArguments.add(productCatalogConfiguration.getSkuId()); // for sku ids
}
reader.setArguments(queryMethodArguments);
reader.setPageSize(1000);
Map<String, Direction> sorts = new HashMap<>();
sorts.put("sku_id", Direction.ASC);
reader.setSort(sorts);
return reader;
}
#Bean
#StepScope
ItemWriter<ProductCatalogWriterData> writer() {
return new ProductCatalogWriter();
}
#Bean
ProductCatalogProcessor processor() {
return new ProductCatalogProcessor();
}
#Bean
SkipPolicy readerSkipper() {
return new ReaderSkipper();
#Bean
Step productCatalogDataStep(ItemReader<Tuple> itemReader, ProductCatalogWriter writer,
HttpServletRequest request, StepBuilderFactory stepBuilderFactory,BatchConfiguration batchConfiguration) {
return stepBuilderFactory.get("processProductCatalog").<Tuple, ProductCatalogWriterData>chunk(batchConfiguration.getBatchChunkSize())
.reader(itemReader).faultTolerant().skipPolicy(readerSkipper()).processor(processor()).writer(writer).build();
}
#Bean
Job productCatalogData(Step productCatalogDataStep, HttpServletRequest request,
JobBuilderFactory jobBuilderFactory) {
return jobBuilderFactory.get("processProductCatalog").incrementer(new RunIdIncrementer())
.flow(productCatalogDataStep).end().build();
}
}
want to do the processing in writer only once at the end of all batch process completion or if this is not possible then i will remove writer and process the data obtained in processor after batch job is completed.Is this possible?
"at the end of all batch process completion" is key here. If the requirement is to do some processing after all chunks have been "pre-processed", I would keep it simple and use two steps for that:
Step 1: (pre)processes the data as needed and writes it to a temporary storage
Step 2: Here you do whatever you want with the processed data prepared in the temporary storage
A final step would clean up the temporary storage if it is persistent (file, staging table, etc). Otherwise, ie if it is in memory, this is optional.

How to specify the taskexecutor to spawn thread after reading a single csv by using MultiResourceItemReader in spring batch

Trying to read multiple files in spring batch using MultiResourceItemReader and also have taskExecutor for the records in each file to read in multithreading. Suppose there are 3 csv files in a folder, MultiResourceItemReader should execute it one by one but as I have taskExecutor , different threads takes up csv files like two csv files are taken up by the threads from the same folder and starts executing .
Expectation:-
MultiResourceItemReader should read first file and then taskExecutor should spawn different threads and execute. Then another file should be picked up and should be taken by taskExecutor for execution.
Code Snippet/Batch_Configuration :-
#Bean
public Step Step1() {
return stepBuilderFactory.get("Step1")
.<POJO, POJO>chunk(5)
.reader(multiResourceItemReader())
.writer(writer())
.taskExecutor(taskExecutor()).throttleLimit(throttleLimit)
.build();
}
#Bean
public MultiResourceItemReader<POJO> multiResourceItemReader()
{
MultiResourceItemReader<POJO> resourceItemReader = new MultiResourceItemReader<POJO>();
ClassLoader cl = this.getClass().getClassLoader();
ResourcePatternResolver resolver = new PathMatchingResourcePatternResolver(cl);
Resource[] resources;
try {
resources = resolver.getResources("file:/temp/*.csv");
resourceItemReader.setResources(resources);
resourceItemReader.setDelegate(itemReader());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return resourceItemReader;
}
maybe you should check Partitioner (https://docs.spring.io/spring-batch/docs/current/reference/html/scalability.html#partitioning) and implement something like the following:
#Bean
public Step mainStep(StepBuilderFactory stepBuilderFactory,
FlatFileItemReader itemReader,
ListDelegateWriter listDelegateWriter,
BatchProperties batchProperties) {
return stepBuilderFactory.get(Steps.MAIN)
.<POJO, POJO>chunk(pageSize)
.reader(unmatchedItemReader)
.writer(listDelegateWriter)
.build();
}
#Bean
public TaskExecutor jobTaskExecutor(#Value("${batch.config.core-pool-size}") Integer corePoolSize,
#Value("${batch.config.max-pool-size}") Integer maxPoolSize) {
ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor();
taskExecutor.setCorePoolSize(corePoolSize);
taskExecutor.setMaxPoolSize(maxPoolSize);
taskExecutor.afterPropertiesSet();
return taskExecutor;
}
#Bean
#StepScope
public Partitioner partitioner(#Value("#{jobExecutionContext[ResourcesToRead]}") String[] resourcePaths,
#Value("${batch.config.grid-size}") Integer gridSize) {
Resource[] resourceList = Arrays.stream(resourcePaths)
.map(FileSystemResource::new)
.toArray(Resource[]::new);
MultiResourcePartitioner partitioner = new MultiResourcePartitioner();
partitioner.setResources(resourceList);
partitioner.partition(gridSize);
return partitioner;
}
#Bean
public Step masterStep(StepBuilderFactory stepBuilderFactory, Partitioner partitioner, Step mainStep, TaskExecutor jobTaskExecutor) {
return stepBuilderFactory.get(BatchConstants.MASTER)
.partitioner(mainStep)
.partitioner(Steps.MAIN, partitioner)
.taskExecutor(jobTaskExecutor)
.build();
}

Ability to read multiple files then write content per source file

Reading file is working good but writing file is not.
I would like to read multiple files then use MultiResourceItemWriter to write it separately, like:
Read Files:
source/abc.csv
source/cbd.csv
source/efg.csv
Should write files separate like:
target/abc.csv
target/cbd.csv
target/efg.csv
But currently it's putting all data in one file.
#Bean
public MultiResourceItemWriter<FooCsv> multipleCsvWriter(#Value("${directory.destination}") Resource folder) throws Exception {
MultiResourceItemWriter<FooCsv> writer = new MultiResourceItemWriter<>();
writer.setResource(folder);
writer.setDelegate(csvWriter(file));
return writer;
}
Note this is like copy and paste from source folder to target folder.
There are two ways to this-
Write custom reader by extending multiresourceitemreader, then in read() method you can get your current source file name. Then set this file to your csvFoo and pass this source file name dynamically to you are writer.
Write one extra step and taskley which will list all your available source files then add them to jobparameter and pass both source and dest file name dynamically to next step and call this step in cycle until all list files are not proceed.
You can use MultiResourcePartitioner to achieve the same. Here is sample Batch Config
#Configuration
#EnableBatchProcessing
public class BatchConfig {
#Autowired
private ResourcePatternResolver resourcePatternResolver;
#Autowired
private JobBuilderFactory jobBuilderFactory;
#Autowired
private StepBuilderFactory stepBuilderFactory;
#Autowired
public DataSource dataSource;
#Autowired
ApplicationContext context;
#Bean
#JobScope
public MultiResourcePartitioner paritioner(#Value("#{jobParameters[srcDir]}") String src) throws IOException {
Resource[] resources = resourcePatternResolver.getResources(src);
MultiResourcePartitioner partitioner = new MultiResourcePartitioner();
partitioner.partition(1);
partitioner.setResources(resources);
return partitioner;
}
#Bean
#StepScope
public FlatFileItemReader<String> reader(#Value("#{stepExecutionContext[fileName]}") Resource file) {
FlatFileItemReader<String> reader = new FlatFileItemReader<String>();
reader.setResource(file);
reader.setLineMapper(new PassThroughLineMapper());
return reader;
}
#Bean
#StepScope
public FlatFileItemWriter<String> writer(#Value("#{jobParameters[destDir]}") String dest,
#Value("#{stepExecutionContext[fileName]}") Resource file) {
String destFile = dest + file.getFilename();
System.out.println(destFile);
FlatFileItemWriter<String> writer = new FlatFileItemWriter<String>();
writer.setLineAggregator(new PassThroughLineAggregator<>());
writer.setResource(resourcePatternResolver.getResource(destFile));
return writer;
}
#Bean
public Job kpJob() {
return jobBuilderFactory.get("kpJob").incrementer(new RunIdIncrementer()).flow(step1()).end().build();
}
#Bean
public Step step1() {
Partitioner partitioner = context.getBean(MultiResourcePartitioner.class);
return stepBuilderFactory.get("step1").partitioner(slaveStep()).partitioner("step1.slave", partitioner).build();
}
#Bean
public Step slaveStep() {
ItemReader<String> reader = context.getBean(FlatFileItemReader.class);
ItemWriter<String> writer = context.getBean(FlatFileItemWriter.class);
return stepBuilderFactory.get("step1.slave").<String, String>chunk(10).reader(reader).writer(writer).build();
}
}
And pass srcDir and dstDir as job parameter.
simple solution to your problem - why not just create Steps for each of the source file. Given that and based in your scenario, file source and target is one to one. MultiResourceItemWriter is just a way to create new output file once it reaches the limit you set setItemCountLimitPerResource(int) (that i dont find in your example). You can even run the steps in parallel if you have any issue with performance, if there are no dependencies or sequence.

MultiResourcePartitioner - Multiple Partitions Loading Same Resource

I'm using local partitioning in spring batch to write xml files to the database. I have already split the original file to smaller files and i have used MultiResourcePartitioner to process each one of them as each file will be processed by one thread. I'm getting a violation of primary Key constraint error i don't know how to deal with this issue
List of files
The partitionner
#Bean
public Partitioner partitioner1(){
MultiResourcePartitioner partitioner = new MultiResourcePartitioner();
Resource[] resources;
try {
resources = resourcePatternResolver.getResources("file:src/main/resources/data/*.xml");
} catch (IOException e) {
throw new RuntimeException("I/O problems when resolving the input file pattern.",e);
}
partitioner.setResources(resources);
return partitioner;
}
The StaxEventItemReader using XML file as an input for the reader
#Bean
#StepScope
public StaxEventItemReader<Customer> CustomerItemReader() {
XStreamMarshaller unmarshaller = new XStreamMarshaller();
Map<String, Class> aliases = new HashMap<>();
aliases.put("customer", Customer.class);
unmarshaller.setAliases(aliases);
StaxEventItemReader<Customer> reader = new StaxEventItemReader<>();
reader.setResource(new ClassPathResource("data/customerOutput1-25000.xml"));
reader.setFragmentRootElementName("customer");
reader.setUnmarshaller(unmarshaller);
return reader;
}
The JdbcBatchItemWriter (writing to the database)
#Bean
#StepScope
public JdbcBatchItemWriter<Customer> customerItemWriter() {
JdbcBatchItemWriter<Customer> itemWriter = new JdbcBatchItemWriter<>();
itemWriter.setDataSource(this.dataSource);
itemWriter.setSql("INSERT INTO NEW_CUSTOMER VALUES (:id, :firstName, :lastName, :birthdate)");
itemWriter.setItemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider());
itemWriter.afterPropertiesSet();
return itemWriter;
}
Thanks for any help
Your reader has this line, which causes all the partitions to load the same file:
reader.setResource(new ClassPathResource("data/customerOutput1-25000.xml"));
It should instead take the resource from the Step Execution Context. You can access the execution context either in the open() method using the ItemStream interface or the beforeStep() method of the StepExectionListener interface. A bit of personal preference here, but I generally thing using ItemStream is the "better" solution.

How to change my job configuration to add file name dynamically

I have a spring batch job which reads from a db then outputs to a multiple csv's. Inside my db I have a special column named divisionId. A CSV file should exist for every distinct value of divisionId. I split out the data using a ClassifierCompositeItemWriter.
At the moment I have an ItemWriter bean defined for every distinct value of divisionId. The beans are the same, it's only the file name that is different.
How can I change the configuration below to create a file with the divisionId automatically pre-pended to the file name without having to register a new ItemWriter for each divisionId?
I've been playing around with #JobScope and #StepScope annotations but can't get it right.
Thanks in advance.
#Bean
public Step readStgDbAndExportMasterListStep() {
return commonJobConfig.stepBuilderFactory
.get("readStgDbAndExportMasterListStep")
.<MasterList,MasterList>chunk(commonJobConfig.chunkSize)
.reader(commonJobConfig.queryStagingDbReader())
.processor(masterListOutputProcessor())
.writer(masterListFileWriter())
.stream((ItemStream) divisionMasterListFileWriter45())
.stream((ItemStream) divisionMasterListFileWriter90())
.build();
}
#Bean
public ItemWriter<MasterList> masterListFileWriter() {
BackToBackPatternClassifier classifier = new BackToBackPatternClassifier();
classifier.setRouterDelegate(new DivisionClassifier());
classifier.setMatcherMap(new HashMap<String, ItemWriter<? extends MasterList>>() {{
put("45", divisionMasterListFileWriter45());
put("90", divisionMasterListFileWriter90());
}});
ClassifierCompositeItemWriter<MasterList> writer = new ClassifierCompositeItemWriter<MasterList>();
writer.setClassifier(classifier);
return writer;
}
#Bean
public ItemWriter<MasterList> divisionMasterListFileWriter45() {
FlatFileItemWriter<MasterList> writer = new FlatFileItemWriter<>();
writer.setResource(new FileSystemResource(new File(commonJobConfig.outDir, "45_masterList" + "" + ".csv")));
writer.setHeaderCallback(masterListFlatFileHeaderCallback());
writer.setLineAggregator(masterListFormatterLineAggregator());
return writer;
}
#Bean
public ItemWriter<MasterList> divisionMasterListFileWriter90() {
FlatFileItemWriter<MasterList> writer = new FlatFileItemWriter<>();
writer.setResource(new FileSystemResource(new File(commonJobConfig.outDir, "90_masterList" + "" + ".csv")));
writer.setHeaderCallback(masterListFlatFileHeaderCallback());
writer.setLineAggregator(masterListFormatterLineAggregator());
return writer;
}
I came up with a pretty complex way of doing this. I followed a tutorial at https://github.com/langmi/spring-batch-examples/wiki/Rename-Files.
The premise is to use the step execution context to place the file name in it.

Resources