Reading and writing multiple files simultaneously using Spring batch - spring

We are developing one application which will read multiple files & write multiple files i.e. one output file for one input file (name of output file must be same as input file).
MultiResourceItemReader can read multiple files but not simultaneously, which is a performance bottleneck for us. Spring batch provides multithreading support for this but again many threads will read the same file & try to write it. Since output file name must be same as Input file name, we can't use that option too.
Now I am looking for one more possibility, if I can create 'n' threads to read & write 'n' files. But I am not sure how to integrate this logic with Spring Batch framework.
Advance thanks for any help.

Since MultiResourceItemReader doesn't meet your performance needs you may take a closer look at parallel processing, which you already mentioned is a desirable option. I don't think many threads will read the same file and try to write it when running multi-threaded, if configured correctly.
Rather than taking the typical chunk-oriented approach you could create a tasklet-orient step that is partitioned (multi-threaded). The tasklet class would be the main driver, delegating calls to a reader and a writer.
The general flow would be something like this:
Retrieve the names of all the files that need to be read in/written out (via some service class) and save them to the execution context within an implementation of Partitioner.
public class filePartitioner implements Partitioner {
#Override
public Map<String, ExecutionContext> partition(int gridSize) {
Map<String, Path> filesToProcess = this.service.getFilesToProcess(directory); // this is just sudo-ish code but maybe you inject the directory you'll be targeting into this class
Map<String, ExecutionContext> execCtxs = new HashMap<>();
for(Entry<String, Path> entry : filesToProcess.entrySet()) {
ExecutionContext execCtx = new ExecutionContext();
execCtx.put("file", entry.getValue());
execCtxs.put(entry.getKey(), execCtx);
}
return execCtxs;
}
// injected
public void setServiceClass(ServiceClass service) {
this.service = service;
}
}
a. For the .getFilesToProcess() method you just need something that returns all of the files in the designated directory because you need to eventually know what is to be read and the name of the file that is to be written. Obviously there are several ways to go about this, such as...
public Map<String, Path> getFilesToProcess(String directory) {
Map<String, Path> filesToProcess = new HashMap<String, Path>();
File directoryFile = new File(directory); // where directory is where you intend to read from
this.generateFileList(filesToProcess, directoryFile, directory);
private void generateFileList(Map<String, Path> fileList, File node, String directory) {
// traverse directory and get files, adding to file list.
if(node.isFile()) {
String file = node.getAbsoluteFile().toString().substring(directory.length() + 1, node.toString().length());
fileList.put(file, directory);
}
if(node.isDirectory()) {
String[] files = node.list();
for(String filename : files) {
this.generateFileList(fileList, new File(node, filename), directory);
}
}
}
You'll need to create a tasklet, which will pull file names from the execution context and pass them to some injected class that will read in the file and write it out (custom ItemReaders and ItemWriters may be necessary).
The rest of the work would be in configuration, which should be fairly straight forward. It is in the configuration of the Partitioner where you can set your grid size, which could even be done dynamically using SpEL if you really intend to create n threads for n files. I would bet a fixed number of threads running across n files would show significant improvement in performance but you'll be able to determine that for yourself.
Hope this helps.

Related

Spring Batch - use JpaPagingItemReader to read lists instead of individual items

Spring Batch is designed to read and process one item at a time, then write the list of all items processed in a chunk. I want my item to be a List<T> as well, to be thus read and processed, and then write a List<List<T>>. My data source is a standard Spring JpaRepository<T, ID>.
My question is whether there are some standard solutions for this "aggregated" approach. I see that there are some, but they don't read from a JpaRepository, like:
https://github.com/spring-projects/spring-batch/blob/main/spring-batch-samples/src/main/java/org/springframework/batch/sample/domain/multiline/AggregateItemReader.java
Spring Batch - Item Reader and ItemProcessor with a list
Spring Batch- how to pass list of multiple items from input to ItemReader, ItemProcessor and ItemWriter
Update:
I'm looking for a solution that would work for a rapidly changing dataset and in a multithreading environment.
I want my item to be a List as well, to be thus read and processed, and then write a List<List>.
Spring Batch does not (and should not) be aware of what an "item" is. It is up to you do design what an "item" is and how it is implemented (a single value, a list, a stream , etc). In your case, you can encapsulate the List<T> in a type that could be used as an item, and process data as needed. You would need a custom item reader though.
The solution we found is to use a custom aggregate reader as suggested here, which accumulates the read data into a list of a given size then passes it along. For our specific use case, we read data using a JpaPagingItemReader. The relevant part is:
public List<T> read() throws Exception {
ResultHolder holder = new ResultHolder();
// read until no more results available or aggregated size is reached
while (!itemReaderExhausted && holder.getResults().size() < aggregationSize) {
process(itemReader.read(), holder);
}
if (CollectionUtils.isEmpty(holder.getResults())) {
return null;
}
return holder.getResults();
}
private void process(T readValue, ResultHolder resultHolder) {
if (readValue == null) {
itemReaderExhausted = true;
return;
}
resultHolder.addResult(readValue);
}
In order to account for the volatility of the dataset, we extended the JPA reader and overwritten the getPage() method to always return 0, and controlled the dataset through the processor and writer to have the next fresh data to be fetched always on the first page. The hint was given here and in some other SO answers.
public int getPage() {
return 0;
}

Getting a FileNotFoundException in VSCode, but not in JGrasp

Ok, so this is what's going on. I'm trying to learn how to use vscode (switching over from jgrasp). I'm trying to run this old school assignment that requires the use of outside .txt files. The .txt files, as well as other classes that I have written are in the same folder and everything. When I try to run this program in JGrasp, it works fine. Though, in VSCode, I get an exception. Not sure what is going wrong here. Thanks Here is an example:
import java.io.*;
public class HangmanMain {
public static final String DICTIONARY_FILE = "dictionary.txt";
public static final boolean SHOW_COUNT = true; // show # of choices left
public static void main(String[] args) throws FileNotFoundException {
System.out.println("Welcome to the cse143 hangman game.");
System.out.println();
// open the dictionary file and read dictionary into an ArrayList
Scanner input = new Scanner(new File(DICTIONARY_FILE));
List<String> dictionary = new ArrayList<String>();
while (input.hasNext()) {
dictionary.add(input.next().toLowerCase());
}
// set basic parameters
Scanner console = new Scanner(System.in);
System.out.print("What length word do you want to use? ");
int length = console.nextInt();
System.out.print("How many wrong answers allowed? ");
int max = console.nextInt();
System.out.println();
//The rest of the program is not shown. This was included just so you guys could see a little bit of it.
If you're not using a project, jGRASP makes the working directory for your program the same one that contains the source file. You are creating the file with a relative path, so it is assumed to be in the working directory. You can print new File(DICTIONARY_FILE).getAbsolutePath() to see where VSCode is looking (probably a separate "classes" directory) and move your data file there, or use an absolute path.

Spring Batch - MongoItemReader not reading all records

I created a Spring Batch job which reads orders from MongoDB and makes a rest call to upload them. However, the batch job automatically gets completed even though all records are not read by MongoItemReader.
I am maintaining a field batchProcessed:boolean on Orders collection. The MongoItemReader reads records for which {batchProcessed:{$ne:true}} as I need to run the batch job multiple times but not process the same documents again and again.
In my OrderWriter I set batchProcessed to true.
#Bean
#StepScope
public MongoItemReader<Order> orderReader() {
MongoItemReader<Order> reader = new MongoItemReader<>();
reader.setTemplate(mongoTempate);
HashMap<String,Sort.Direction> sortMap = new HashMap<>();
sortMap.put("_id",Direction.ASC);
reader.setSort(sortMap);
reader.setTargetType(Order.class);
reader.setQuery("{batchProcessed:{$ne:true}}");
return reader;
}
#Bean
public Step uploadOrdersStep(OrderItemProcessor processor) {
return stepBuilderFactory.get("step1").<Order, Order>chunk(1)
.reader(orderReader()).processor(processor).writer(orderWriter).build();
}
#Bean
public Job orderUploadBatchJob(JobBuilderFactory factory, OrderItemProcessor processor) {
return factory.get("uploadOrder").flow(uploadOrdersStep(processor)).end().build();
}
The MongoItemReader is a paging item reader. When reading items in pages and changing items that might be returned by the query (ie a field that is used in the query's "where" clause), the paging logic can be lost and some items might be skipped. There's a similar problem with the JPA paging item reader that is explained in details here: Spring batch jpaPagingItemReader why some rows are not read?
Common techniques to work around this issue is to use a cursor-based reader, use a staging table/collection, use a partitioned step with a partition per page, etc.

Integrate key-value database with Spark

I'm having trouble understanding how Spark interacts with storage.
I would like to make a Spark cluster that fetches data from a RocksDB database (or any other key-value store). However, at this moment, the best I can do is fetch the whole dataset from the database into memory in each of the cluster nodes (into a map for example) and build an RDD from that object.
What do I have to do to fetch only the necessary data (like Spark does with HDFS)? I've read about Hadoop Input Format and Record Readers, but I'm not completely grasping what I should implement.
I know this is kind of a broad question, but I would really appreciate some help to get me started. Thank you in advance.
Here is one possible solution. I assume you have client library for the key-value store(RocksDB in your case) that you want to access.
KeyValuePair represents a bean class representing one Key-value pair from your key-value store.
Classes
/*Lazy iterator to read from KeyValue store*/
class KeyValueIterator implements Iterator<KeyValuePair> {
public KeyValueIterator() {
//TODO initialize your custom reader using java client library
}
#Override
public boolean hasNext() {
//TODO
}
#Override
public KeyValuePair next() {
//TODO
}
}
class KeyValueReader implements FlatMapFunction<KeyValuePair, KeyValuePair>() {
#Override
public Iterator<KeyValuePair> call(KeyValuePair keyValuePair) throws Exception {
//ignore empty 'keyValuePair' object
return new KeyValueIterator();
}
}
Create KeyValue RDD
/*list with a dummy KeyValuePair instance*/
ArrayList<KeyValuePair> keyValuePairs = new ArrayList<>();
keyValuePairs.add(new KeyValuePair());
JavaRDD<KeyValuePair> keyValuePairRDD = javaSparkContext.parallelize(keyValuePairs);
/*Read one key-value pair at a time lazily*/
keyValuePairRDD = keyValuePairRDD.flatMap(new KeyValueReader());
Note:
Above solution creates an RDD with two partitions by default(one of them will be empty). Increase the partitions before applying any transformation on keyValuePairRDD to distribute the processing across executors.
Different ways to increase partitions:
keyValuePairRDD.repartition(partitionCounts)
//OR
keyValuePairRDD.partitionBy(...)

New Output file for each Item passed into FlatFileItemWriter

I have the following domain object. This is the object being passed from my processor to my writer.
public class DivisionIdPromoCompStartDtEndDtGrouping {
private int divisionId;
private Date rpmPromoCompDetailStartDate;
private Date rpmPromoCompDetailEndDate;
private List<MasterList> detailRecords = new ArrayList<MasterList>();
I would like a new file per DivisionIdPromoCompStartDtEndDtGrouping. each file would have a line for each of the detailRecords in the list. The output files would be of the same format just logically separated based on data (divisionId,rpmPromoCompDetailStartDate and rpmPromoCompDetailEndDate).
How can I create an FlatFileItemWriter to output a new file for each DivisionIdPromoCompStartDtEndDtGrouping with the content detailRecords?
I think the answer might be a compositeItemWriter. Is that right? Could someone help me with an example of this.
thanks in advance
You're close. Instead of just a CompositeItemWriter, use a ClassifierCompositeItemWriter. This coupled with a Classifier implementation that will choose a writer by grouping will allow you to have one file per group. You can read more about this ItemReader in the javadoc here: http://docs.spring.io/spring-batch/apidocs/org/springframework/batch/item/support/ClassifierCompositeItemWriter.html
No, the answer is not a composite writer. A composite writer simple forwards all items it receives to all defined childwriters.
The problem with FlatFileItemWriter is, that you you have to open and to close it, which is handled by the Framwork itself.
A simple approach would be to implement your own writer and use a FlatFileWriter in its write method.
public class MyWriter implements ItemWriter<..>{
public void write(List<..> items) {
for (.. item:items) {
FlatFileItemWriter fileWriter = new FlatFileItemWriter();
fileWriter.setResource(...); // unique FileName
fileWriter.setLineAggregator(...);
fileWriter.... ; // do other settings if necessary
fileWriter.afterPropertiesSet();
fileWriter.open(new ExecutionContext());
fileWriter.write(Collections.singleList(item));
fileWriter.close();
}
}
}
The lineAggregator has to create an appropriate String including all the linebreaks, so that everyDetail is written on its own line in the file.
Of course, you don't have to use a FlatFileWriter and just open an file, use the lineAggregator to create to line and save the line to the file.

Resources