MultiResourcePartitioner - Multiple Partitions Loading Same Resource - spring

I'm using local partitioning in spring batch to write xml files to the database. I have already split the original file to smaller files and i have used MultiResourcePartitioner to process each one of them as each file will be processed by one thread. I'm getting a violation of primary Key constraint error i don't know how to deal with this issue
List of files
The partitionner
#Bean
public Partitioner partitioner1(){
MultiResourcePartitioner partitioner = new MultiResourcePartitioner();
Resource[] resources;
try {
resources = resourcePatternResolver.getResources("file:src/main/resources/data/*.xml");
} catch (IOException e) {
throw new RuntimeException("I/O problems when resolving the input file pattern.",e);
}
partitioner.setResources(resources);
return partitioner;
}
The StaxEventItemReader using XML file as an input for the reader
#Bean
#StepScope
public StaxEventItemReader<Customer> CustomerItemReader() {
XStreamMarshaller unmarshaller = new XStreamMarshaller();
Map<String, Class> aliases = new HashMap<>();
aliases.put("customer", Customer.class);
unmarshaller.setAliases(aliases);
StaxEventItemReader<Customer> reader = new StaxEventItemReader<>();
reader.setResource(new ClassPathResource("data/customerOutput1-25000.xml"));
reader.setFragmentRootElementName("customer");
reader.setUnmarshaller(unmarshaller);
return reader;
}
The JdbcBatchItemWriter (writing to the database)
#Bean
#StepScope
public JdbcBatchItemWriter<Customer> customerItemWriter() {
JdbcBatchItemWriter<Customer> itemWriter = new JdbcBatchItemWriter<>();
itemWriter.setDataSource(this.dataSource);
itemWriter.setSql("INSERT INTO NEW_CUSTOMER VALUES (:id, :firstName, :lastName, :birthdate)");
itemWriter.setItemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider());
itemWriter.afterPropertiesSet();
return itemWriter;
}
Thanks for any help

Your reader has this line, which causes all the partitions to load the same file:
reader.setResource(new ClassPathResource("data/customerOutput1-25000.xml"));
It should instead take the resource from the Step Execution Context. You can access the execution context either in the open() method using the ItemStream interface or the beforeStep() method of the StepExectionListener interface. A bit of personal preference here, but I generally thing using ItemStream is the "better" solution.

Related

Spring Batch Partitioning At Runtime

I want to read a large file using spring batch. I want to split into multiple files and process each of them in a different thread using partitions. I am using the below code:
#Bean
#StepScope
public MultiResourcePartitioner partitioner() {
MultiResourcePartitioner partitioner = new MultiResourcePartitioner();
partitioner.setKeyName("file");
partitioner.setResources(splitFiles());
return partitioner;
}
private Resource[] splitFiles() {
// Read the large File available in the specified folder
// split the file to smaller files and return them as resource list
}
#Bean
public TaskExecutorPartitionHandler partitionHandler() {
TaskExecutorPartitionHandler partitionHandler = new TaskExecutorPartitionHandler();
partitionHandler.setStep(step1());
partitionHandler.setTaskExecutor(new SimpleAsyncTaskExecutor());
return partitionHandler;
}
#Bean
public Step partitionedMaster() {
return this.stepBuilderFactory.get("step1")
.partitioner(step1().getName(), partitioner(null))
.partitionHandler(partitionHandler())
.build();
}
#Bean
public Job partitionedJob() {
return this.jobBuilderFactory.get("partitionedJob")
.start(partitionedMaster())
.build();
}
#Bean
#StepScope
public FlatFileItemReader<Transaction> fileTransactionReader(#Value("#{stepExecutionContext['file']}") Resource resource) {
return new FlatFileItemReaderBuilder<Transaction>()
.name("flatFileTransactionReader")
.resource(resource)
.fieldSetMapper(fsm)
.build();
}
My issue is that the partitioner is partitioning the files which are only available in the folder at the start of the application. Once the application is up and running, if a new file is available in the same folder, the job couldn't read them/partition them.
I used #StepScope, still i'm having the issue.
How do I read and partition the files dynamically at runtime?
Editing it after the first answer:
Hi, Thanks for the inputs.
I can modify the code as below to send the files as parameters and invoke the job, but still the control is not going inside partitioner method, hence could not leverage partitioning.
Any inputs on this?
public JobParameters getJobParameters() {
Resource[] resources = //getFileToProcessResource
return new JobParametersBuilder()
.addLong(TIME, System.currentTimeMillis())
.addString("inputFiles", resources)
.toJobParameters();
}
JobParameters jobParameters = getJobParameters();
jobLauncher.run(partitionedJob(), jobParameters);
#Bean
#StepScope
public MultiResourcePartitioner partitioner(#Value("#{jobParameters['inputFiles']}") Resource[] resources) {
MultiResourcePartitioner partitioner = new MultiResourcePartitioner();
partitioner.setKeyName("file");
partitioner.setResources(resources);
return partitioner;
}
Once the application is up and running, if a new file is available in the same folder, the job couldn't read them/partition them
Batch processing is about fixed data sets. In your case, you start a job but its input data changes in the meantime, so that's not going to work as you expect. A fixed data set is required for restartability in order to work on the same data set in case of failure.
Since the input of your job is a file, you can use the file as a job parameter and configure a watch service (or similar mechanism) to launch a new job instance for each new file in the folder.
EDIT: Add example to make the partitioner aware of the job parameter
#Bean
#StepScope
public MultiResourcePartitioner partitioner(#Value("#{jobParameters['fileName']}") String fileName) {
MultiResourcePartitioner partitioner = new MultiResourcePartitioner();
partitioner.setKeyName("file");
partitioner.setResources(splitFiles(fileName));
return partitioner;
}
private Resource[] splitFiles(String fileName) {
// Read the large File available in the specified folder
// split the file to smaller files and return them as resource list
return null;
}

How to call appropriate Item Processor for different records?

I have a flat file containing different records(header, record and footer)
HR,...
RD,...
FR,...
ItemReader
#Bean
#StepScope
public FlatFileItemReader reader(#Value("#{jobParameters['inputFileName']}") String inputFileName) {
FlatFileItemReader reader = new FlatFileItemReader();
reader.setResource(new FileSystemResource(inputFileName));
reader.setLineMapper(patternLineMapper());
return reader;
}
#Bean
public LineMapper patternLineMapper() {
PatternMatchingCompositeLineMapper patternLineMapper = new PatternMatchingCompositeLineMapper<>();
tokenizers = new HashMap<String, LineTokenizer>();
try {
tokenizers.put("HR*", headerLineTokenizer());
tokenizers.put("RD*", recordLineTokenizer());
tokenizers.put("FR*", footerLineTokenizer());
} catch (Exception e) {
e.printStackTrace();
}
fieldSetMappers = new HashMap<String, FieldSetMapper>();
fieldSetMappers.put("HR*", new HeaderFieldSetMapper());
fieldSetMappers.put("RD*", new RecordFieldSetMapper());
fieldSetMappers.put("FR*", new FooterFieldSetMapper());
patternLineMapper.setTokenizers(tokenizers);
patternLineMapper.setFieldSetMappers(fieldSetMappers);
return patternLineMapper;
}
They are working fine and spring batch calls the appropriate reader for each record the problem is when it comes to item processor I want to use the same approach I get java.lang.ClassCastException cuz spring batch try to map domain object [returned from reader] to java.lang.String
ItemProcessor
#Bean
#StepScope
public ItemProcessor processor() {
ClassifierCompositeItemProcessor processor = new ClassifierCompositeItemProcessor();
PatternMatchingClassifier<ItemProcessor> classifier = new PatternMatchingClassifier<>();
Map<String, ItemProcessor> patternMap = new HashMap<>();
patternMap.put("HR*", new HeaderItemProcessor());
patternMap.put("RD*", new RecordItemProcessor());
patternMap.put("FR*", new FooterItemProcessor());
classifier.setPatternMap(patternMap);
processor.setClassifier(classifier);
return processor;
}
I also used BackToBackPatternClassifier but it turns out it has a bug and when I use generics like ItemWriter<Object> I get an exception Couldn't Open File. the question is
How can I make ItemProcessor that handles different record types returned from Reader??
Your issue is that the classifier you use in the ClassifierCompositeItemProcessor is based on a String pattern and not a type. What really should happen is something like:
The reader returns a specific type of items based on the input pattern, something like:
HR* -> HRType
RD* -> RDType
FR* -> FRType
This is what you have basically done on the reader side. Now on the processing side, the processor will receive objects of type HRType, RDType and FRType. So the classifier should not be based on String as input type, but on the item type, something like:
Map<Object, ItemProcessor> patternMap = new HashMap<>();
patternMap.put(HRType.class, new HeaderItemProcessor());
patternMap.put(RDType.class, new RecordItemProcessor());
patternMap.put(FRType.class, new FooterItemProcessor());
This classifier uses Object type because your ItemReader returns a raw type. I would not recommend using raw types and Object type in the classifier. What you should do is:
create a base class of your items and a specific class for each type
Make the reader return items of type <? extends BaseClass>
Use a org.springframework.classify.SubclassClassifier in your ClassifierCompositeItemProcessor

ItemWriter not outputting rows how I would like

I've written a spring batch job to read from a database and then write to a csv.
The job works but unfortunately in my output CSV file it just puts whatever is in the toString method of my Domain Object.
What I am really after is all the values in the bean separated by a comma. Which is why in my ItemWriter below I put in a DelimitedLineAggregator.
But I think my understanding of that DelimitedLineAggregator is wrong. I thought that the LineAggregator was used for the output but now I think it's used for the input data.
#Bean
#StepScope
public ItemWriter<MasterList> masterListFileWriter(
FileSystemResource masterListFile,
#Value("#{stepExecutionContext}")Map<String, Object> executionContext) {
FlatFileItemWriter<MasterList> writer = new FlatFileItemWriter<>();
writer.setResource(masterListFile);
DelimitedLineAggregator<MasterList> lineAggregator = new DelimitedLineAggregator<>();
lineAggregator.setDelimiter(";");
writer.setLineAggregator(lineAggregator);
writer.setForceSync(true);
writer.open(new ExecutionContext(executionContext));
return writer;
}
Two things.
What can I change to output all the values of my MasterList domain object separated by a comma? Is changing the toString method the only way?
Also can someone clarify the use of the LineAggregator in the writer. I'm now thinking it's used to specify how you want to aggregate lines coming from your Reader. Is that right?
Thanks in advance
I worked this out by adding a BeanWrapperFieldExtractor to the writer.
#Bean
#StepScope
public ItemWriter<MasterList> masterListFileWriter(
FileSystemResource masterListFile,
#Value("#{stepExecutionContext}")Map<String, Object> executionContext) {
FlatFileItemWriter<MasterList> writer = new FlatFileItemWriter<>();
writer.setResource(masterListFile);
DelimitedLineAggregator<MasterList> lineAggregator = new DelimitedLineAggregator<>();
lineAggregator.setDelimiter(",");
BeanWrapperFieldExtractor<MasterList> extractor = new BeanWrapperFieldExtractor<MasterList>();
extractor.setNames(new String[] { "l2", "l2Name"});
lineAggregator.setFieldExtractor(extractor);
writer.setLineAggregator(lineAggregator);
writer.setForceSync(true);
writer.open(new ExecutionContext(executionContext));
return writer;
}

How to change my job configuration to add file name dynamically

I have a spring batch job which reads from a db then outputs to a multiple csv's. Inside my db I have a special column named divisionId. A CSV file should exist for every distinct value of divisionId. I split out the data using a ClassifierCompositeItemWriter.
At the moment I have an ItemWriter bean defined for every distinct value of divisionId. The beans are the same, it's only the file name that is different.
How can I change the configuration below to create a file with the divisionId automatically pre-pended to the file name without having to register a new ItemWriter for each divisionId?
I've been playing around with #JobScope and #StepScope annotations but can't get it right.
Thanks in advance.
#Bean
public Step readStgDbAndExportMasterListStep() {
return commonJobConfig.stepBuilderFactory
.get("readStgDbAndExportMasterListStep")
.<MasterList,MasterList>chunk(commonJobConfig.chunkSize)
.reader(commonJobConfig.queryStagingDbReader())
.processor(masterListOutputProcessor())
.writer(masterListFileWriter())
.stream((ItemStream) divisionMasterListFileWriter45())
.stream((ItemStream) divisionMasterListFileWriter90())
.build();
}
#Bean
public ItemWriter<MasterList> masterListFileWriter() {
BackToBackPatternClassifier classifier = new BackToBackPatternClassifier();
classifier.setRouterDelegate(new DivisionClassifier());
classifier.setMatcherMap(new HashMap<String, ItemWriter<? extends MasterList>>() {{
put("45", divisionMasterListFileWriter45());
put("90", divisionMasterListFileWriter90());
}});
ClassifierCompositeItemWriter<MasterList> writer = new ClassifierCompositeItemWriter<MasterList>();
writer.setClassifier(classifier);
return writer;
}
#Bean
public ItemWriter<MasterList> divisionMasterListFileWriter45() {
FlatFileItemWriter<MasterList> writer = new FlatFileItemWriter<>();
writer.setResource(new FileSystemResource(new File(commonJobConfig.outDir, "45_masterList" + "" + ".csv")));
writer.setHeaderCallback(masterListFlatFileHeaderCallback());
writer.setLineAggregator(masterListFormatterLineAggregator());
return writer;
}
#Bean
public ItemWriter<MasterList> divisionMasterListFileWriter90() {
FlatFileItemWriter<MasterList> writer = new FlatFileItemWriter<>();
writer.setResource(new FileSystemResource(new File(commonJobConfig.outDir, "90_masterList" + "" + ".csv")));
writer.setHeaderCallback(masterListFlatFileHeaderCallback());
writer.setLineAggregator(masterListFormatterLineAggregator());
return writer;
}
I came up with a pretty complex way of doing this. I followed a tutorial at https://github.com/langmi/spring-batch-examples/wiki/Rename-Files.
The premise is to use the step execution context to place the file name in it.

How to make writer initiated for every job instance in spring batch

I am writing a spring batch job. I am implementing custom writer using KafkaClientWriter extends AbstractItemStreamItemWriter<ProducerMessage>
I have fields which need to be unique for each instance. But I could see this class initiated only once. Rest jobs have same instance of writer class.
Where as my custom readers and processors are getting initiated for each job.
Below is my job configurations. How can I achieve the same behavior for writer as well?
#Bean
#Scope("job")
public ZipMultiResourceItemReader reader(#Value("#{jobParameters[fileName]}") String fileName, #Value("#{jobParameters[s3SourceFolderPrefix]}") String s3SourceFolderPrefix, #Value("#{jobParameters[timeStamp]}") long timeStamp, com.fastretailing.catalogPlatformSCMProducer.service.ConfigurationService confService) {
FlatFileItemReader faltFileReader = new FlatFileItemReader();
ZipMultiResourceItemReader zipReader = new ZipMultiResourceItemReader();
Resource[] resArray = new Resource[1];
resArray[0] = new FileSystemResource(new File(fileName));
zipReader.setArchives(resArray);
DefaultLineMapper<ProducerMessage> lineMapper = new DefaultLineMapper<ProducerMessage>();
lineMapper.setLineTokenizer(new DelimitedLineTokenizer());
CSVFieldMapper csvFieldMapper = new CSVFieldMapper(fileName, s3SourceFolderPrefix, timeStamp, confService);
lineMapper.setFieldSetMapper(csvFieldMapper);
faltFileReader.setLineMapper(lineMapper);
zipReader.setDelegate(faltFileReader);
return zipReader;
}
#Bean
#Scope("job")
public ItemProcessor<ProducerMessage, ProducerMessage> processor(#Value("#{jobParameters[timeStamp]}") long timeStamp) {
ProducerProcessor processor = new ProducerProcessor();
processor.setS3FileTimeStamp(timeStamp);
return processor;
}
#Bean
#ConfigurationProperties
public ItemWriter<ProducerMessage> writer() {
return new KafkaClientWriter();
}
#Bean
public Step step1(StepBuilderFactory stepBuilderFactory,
ItemReader reader, ItemWriter writer,
ItemProcessor processor, #Value("${reader.chunkSize}")
int chunkSize) {
LOGGER.info("Step configuration loaded with chunk size {}", chunkSize);
return stepBuilderFactory.get("step1")
.chunk(chunkSize).reader(reader)
.processor(processor).writer(writer)
.build();
}
#Bean
public StepScope stepScope() {
final StepScope stepScope = new StepScope();
stepScope.setAutoProxy(true);
return stepScope;
}
#Bean
public JobScope jobScope() {
final JobScope jobScope = new JobScope();
return jobScope;
}
#Bean
public Configuration configuration() {
return new Configuration();
}
I tried making the writer with job scope. But in that case open is not getting called. This is where I am doing some initializations.
When using java based configuration and a scoped proxy what happens is that the return type of the method is detected and for that a proxy is created. So when you return ItemWriter you will get a JDK proxy only implementing ItemWriter, whereas your open method is on the ItemStream interface. Because that interface isn't included on the proxy there is no way to call the method.
Either change the return type to KafkaClientWriter or ItemStreamWriter< ProducerMessage> (assuming the KafkaCLientWriter implements that method). Next add #Scope("job") and you should have your open method called again with a properly scoped writer.

Resources