Merge Multiple csv files into Single csv using Spring batch - spring

I have a business case of Merge Multiple csv files(around 1000+ Each containing 1000 records )into Single csv using Spring batch .
Please help me provide your guidance and solutions in terms of approach and performance-wise as well.
So far, I have tried two approaches,
Approach 1.
Tasklet chunk with multiResourceItemReader to read the files from directory and
FlatFileItemWriter as item writer.
Issue here is, it is very slow in processing since this is single threaded, but approach works as expected.
Approach 2:
Using MultiResourcePartitioner partitioner and AsynTaskExceutor as task-executor.
Issue here is, since it is async multi-thread, data is getting overwritten/ corrupted while merging into final single file.

You can wrap your FlatFileItemWriter in AsyncItemWriter and use along with AsyncItemProcessor. This will not corrupt your data and increase the performance as processing and writing will be through several threads.
#Bean
public AsyncItemWriter asyncItemWriter() throws Exception {
AsyncItemWriter<Customer> asyncItemWriter = new AsyncItemWriter<>();
asyncItemWriter.setDelegate(flatFileItemWriter);
asyncItemWriter.afterPropertiesSet();
return asyncItemWriter;
}
#Bean
public AsyncItemProcessor asyncItemProcessor() throws Exception {
AsyncItemProcessor<Customer, Customer> asyncItemProcessor = new AsyncItemProcessor();
asyncItemProcessor.setDelegate(itemProcessor());
asyncItemProcessor.setTaskExecutor(threadPoolTaskExecutor());
asyncItemProcessor.afterPropertiesSet();
return asyncItemProcessor;
}
#Bean
public TaskExecutor threadPoolTaskExecutor() {
ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
executor.setCorePoolSize(10);
executor.setMaxPoolSize(10);
executor.setThreadNamePrefix("default_task_executor_thread");
executor.initialize();
return executor;
}

Since your headers are common between your source and destination files, I wouldn't recommend using Spring Batch provided readers to convert lines into specific beans since column level information is not needed & csv being a text format , you can go ahead only with line level info without breaking it at field level.
Also, partitioning per file is going to be a very slow ( if you have those many files ) & you should try by first fixing your number of partitions ( like 10 or 20 ) and try grouping your files into those many partitions. Secondly file writing being a disk based operation & not CPU based, multi threading won't be useful.
What I suggest instead is to write your custom reader & writer in plain Java on the lines as suggested in this answer where your reader will return a List<String> and writer will get List<List<String>> & that you can write to file.
If you have enough memory to hold lines from all files in one go then you can read all files in one go & keep returning chunk_size or you can keep reading small set of files to reach chunk size limit should be good enough. Your reader will return null when no more files to read.

Related

Update Reading Source after Writing Step

We have a huge table "clients" in postgresql database that have a lot of duplicated content. So I have created a Spring batch to merge duplicated client into one based on their email.
I have a reading step that reads from this table (with a custom query that target only clients with at least a single duplication) then in the processing step I gather all information and in the writing step I delete the duplicated client and keep only the one where I merged all information.
#Bean
public RepositoryItemReader<Client> clientFusionReader(
ClientRepository clientRepository
) {
return new RepositoryItemReaderBuilder<Tiers>()
.methodName("findClientsWithPhoneAndEmailNotNullAndNonDuplicated")
.sorts(Collections.singletonMap("update_date", Sort.Direction.ASC))
.repository(clientRepository)
.pageSize(100)
.name("BATCH_MERGE_CLIENT_READER")
.build();
}
The problem I have currently is that Spring batch does not read the table after I delete the rows so I still have the old items getting processed and in the delete step an exception is raised as the item don't exist anymore.
Is there a way to refresh the reading source after the writing step so that I avoid process deleted items that Spring batch keep in memory after reading?
UPDATE 1 :
Here is my job configuration :
#Bean
public Step clientFusionStep(RepositoryItemReader<Client> clientFusionReader) {
return stepBuilderFactory.get("fusionClientStep")
.<Client, ClientOutput>chunk(100)
.reader(clientFusionReader)
.processor(clientFusionProcessor)
.writer(clientFusionWriter)
.faultTolerant()
.build();
}
#Bean
public Job job(Step clientFusionStep) {
return jobs.get("fusionClientJob")
.incrementer(new RunIdIncrementer())
.start(clientFusionStep)
.preventRestart()
.build();
}
First step, I'm reading from database taking all records that are at least duplicated once, in the processor I look for related tables to client so that if I delete a client I need to delete all table related to it and in the final step of writing I'm deleting the duplicated clients.
Once I arrive to writing step, I want refresh the datasource so that I take in consideration the updates I did. Because in second chunk I get deleted clients and the batch is processing them.

improve spring batch job performance

I am in the process of implementing a spring batch job for our file upload process. My requirement is to read a flat file, apply business logic then store it in DB then post a Kafka message.
I have a single chunk-based step that uses a custom reader, processor, writer. The process works fine but takes a lot of time to process a big file.
It takes 15 mins to process a file having 60K records. I need to reduce it to less than 5 mins, as we will be consuming much bigger files than this.
As per https://docs.spring.io/spring-batch/docs/current/reference/html/scalability.html I understand making it multithreaded would give a performance boost, at the cost of restart ability. However, I am using FlatFileItemReader, ItemProcessor, ItemWriter and none of them is thread-safe.
Any suggestions as to how to improve performance here?
Here is the writer code:-
public void write(List<? extends Message> items) {
items.forEach(this::process);
}
private void process(Message message) {
if (message == null)
return;
try {
//message is a DTO that have info about success or failure.
if (success) {
//post kafka message using spring cloud stream
//insert record in DB using spring jpaRepository
} else {
//insert record in DB using spring jpaRepository
}
} catch (Exception e) {
//throw exception
}
}
Best regards,
Preeti
Please refer to below SO thread and refer the git hub source code for parallel processing
Spring Batch multiple process for heavy load with multiple thread under every process
Spring batch to process huge data

Spring Batch - How to output Thread and Grid number to console or log

In my Spring Batch configuration I have this:
#Bean
public TaskExecutor taskExecutor() {
SimpleAsyncTaskExecutor taskExecutor = new SimpleAsyncTaskExecutor("myJob");
asyncTaskExecutor.setConcurrencyLimit(15);
asyncTaskExecutor.setThreadNamePrefix("SrcToDest");
return taskExecutor;
}
And also I have a "master-step" where I am setting the grid-size as per below:
#Bean
#Qualifier("masterStep")
public Step masterStep() {
return stepBuilderFactory.get("masterStep").partitioner("step1", partitioner()).step(step1())
.taskExecutor(threadpooltaskExecutor()).taskExecutor(taskExecutor())
.gridSize(10).build();
}
In my case, I see only "Thread-x" at the end when "myjob" finishes with "COMPLETED" status.
Questions
In order to monitor how can I print the thread number to the console/log throughout the execution process? i.e. "myjob" start to finish
Is there some way I can get the output to console/log to see the grid action too?
I could not find any example or anywhere in Spring Guides for these.
Still looking how to display grid numbers to console
This depends on your partitioner. You can add a log statement in your partitioner and show the grid size. So at partitioning time, it's on your side.
At partition handling time, Spring Batch will show a log statement at debug level of each execution of the worker step.

Apache Spark move/rename succefully processed files

I would like to use spark streaming (1.1.0-rc2 Java-API) to process some files, and move/rename them once the processing is done successfully in order to push them to other jobs.
I thought about using the file path included in the name of generated RDDs (newAPIHadoopFile), but how can we determine a successful end of processing of a file?
Also not sure this the right way to achieve it so any ideas are welcome.
EDIT:
Here is some pseudo code to be more clear :
logs.foreachRDD(new Function2<JavaRDD<String>, Time, Void>() {
#Override
public Void call(JavaRDD<String> log, Time time) throws Exception {
String fileName=log.name();
String newlog=Process(log);
SaveResultToFile(newlog, time);
//are we done with the file so we can move it ????
return null;
}
});
You aren't guaranteed that the input is backed by an HDFS file. But it doesn't seem like you need that given your question. You create a new file and write something to it. When the write completes, you're done. Move it with other HDFS APIs.

Hadoop jobs using same reducer output to same file

I ran into an interesting situation, and now am looking for how to do it intentionally. On my local single node setup, I ran 2 jobs simultaneously from the terminal screen. My both jobs use same reducer, they only have difference in map function (aggregation key - the group by), the output of both jobs was written to the output of first job (though second job did created its own folder, but it was empty). What I am working on is providing rollup aggregations across various levels, and this behavior is fascinating for me, that the aggregation output from two different levels are available to me in one single file (also perfectly sorted).
My question is how to achieve the same in real Hadoop cluster, where we have multiple data nodes i.e. I programmatically initiate multiple jobs, all accessing same input file, mapping the data differently, but using the same reducer, and the output is available in one single file, and not in 5 different output files.
Please advise.
I was taking a look at merge output files after reduce phase before I decided to ask my question.
Thanks and Kind regards,
Moiz Ahmed.
When different Mappers consume the same input file, with other words the same data structure, then source code for all these different mappers can be placed into separate methods of a single Mapper implementation and use a parameter from the context to decide which map functions to invoke. On the pluss side you need to start only one Map Reduce Job. Example is pseudo code:
class ComplexMapper extends Mapper {
protected BitSet mappingBitmap = new BitSet();
protected void setup(Context context) ... {
{
String params = context.getConfiguration().get("params");
---analyze params and set bits into the mappingBitmap
}
protected void mapA(Object key, Object value, Context context){
.....
context.write(keyA, value);
}
protected void mapB(Object key, Object value, Context context){
.....
context.write(keyA, value);
}
protected void mapB(Object key, Object value, Context context){
.....
context.write(keyB, value);
}
public void map(Object key, Object value, Context context) ..... {
if (mappingBitmap.get(1)) {
mapA(key, value, context);
}
if (mappingBitmap.get(2)) {
mapB(key, value, context);
}
if (mappingBitmap.get(3)) {
mapC(key, value, context);
}
}
Of cause it can be implemented more elegantly with interfaces etc.
In the job setup just add:
Configuration conf = new Configuration();
conf.set("params", "AB");
Job job = new Job(conf);
As Praveen Sripati mentioned, having a single output file will force you into having just one Reducer which might be bad for performance. You can always concatenate the part** files when you download them from the hdfs. Example:
hadoop fs -text /output_dir/part* > wholefile.txt
Usually each reducer task produces a separate file in HDFS, so that the reduce tasks can operate in parallel. If the requirement is to have one o/p file from the reduce task then configure the job to have one reducer task. The number of reducers can be configure using the mapred.reduce.tasks property which is defaulted to 1. The con of this approach is there is only one reducer which might be a bottle neck for the job to complete.
Another option is to use some other output format which allows multiple reducers to write to the same sink simultaneously like DBOuputFormat. Once the Job processing is complete, the results from the DB can be exported into a flat file. This approach will enable multiple reduce tasks to run in parallel.
Another options is to merge the o/p files as mentioned in the OP. So, based on the pros and cons of each of the approach and the volume of the data to be processed the one of the approach can be chosen.

Resources