Spring Batch : multi file processing using multi threads - spring

I have requirement where I have to deal with multiple files (say 300 csv files).
I need to read --> process --> write, each individual file as I need to apply some transformation logic on the data.
For each input file there would be a corresponding transformed file. so for 300 input files we would have 300 output files.
At the end, all the 300 output files are needed to be merged into a single file which would be compressed and then transferred to a remote location over FTP/SFTP.
Say, every hour we would have to deal with a new set of 300 file on which we would be required to apply the above processing, so we would be scheduling the above job per hour.
How to handle multi file processing in the above scenario using Spring Batch ?
How to make the above processing to happen in multiple threads ?
Please suggest.
Thanks in advance.

You can use spring task execution and scheduling and then use java ThreadPoolExecutor
Check this answer here at SO for a very simple example.

Related

Run multiple batch jobs in parallel Mule 4

I have a batch job that reads hundreds of images from an SFTP location and then encodes them into base64 and uploads them via API using HTTP connector.
I would like to make the process run quicker and hence trying to split the payload into 2 via scatter-gather and then sending then sending payload1 to one batch job in a subflow and payload2 to another batch job in another subflow.
Is this the right approach?
Or is it possible to split the load in just one batch process, ie for one half of the payload to be processed by batch step 1 and second half will be processed by batch step 2 at the same time?
Thank you
No, it is not a good approach. Batch jobs are always executed asynchronously (ie using different threads) so there is no benefit on using scatter-gather and it has the cons of increasing resource usage.
Splitting the payload in different batch steps doesn't make sense either. You should not try to scale by adding steps.
Batch jobs should be used naturally to work in parallel by iterating on an input. It may be able to handle the splitting itself or you can manually split the input payload before. Then let it handle the concurrency automatically. There are some configurations you can use to tune it, like block sizing.

Spring integration with spring batch

Please let me know , when i am putting say 5 files in a directory , 5 messages gets generated by the poller , i want that the spring batch job will get triggered only one time, not five times ,if the files are coming together say within 1 min duration. is it possible?
You may consider to use an Aggregator for this kind of task. So, you will collect several files together by expected size or withing some time window. You need to use some static correlationKey to let the component to group files.
When the group is ready, a single message is emitted and you are good to trigger a Batch job for this set of files.

Does Apache NiFi support batch processing?

I need to know if Apache NiFi supports running processors until completion.
"the execution of a series of processors in process group wait for anothor process group results execution to be complete".
For example:
Suppose there are three processors in NiFi UI.
P1-->P2-->P3
P-->Processor
Now I need to run P1 if it run completely then run P2 And finally it will run like sequence but one wait for another to be complete.
EDIT-1:
Just for example I have data in web URL. I can download that data using GetHTTP Processor. Now I stored that in putFile content. If file saved in putFile directory then run FetchFile to process that file into my database like below workflow.
GetHTTP-->PutFile-->FetchFile-->DB
Is this possible?
NiFi itself is not really a batch processing system, it is a data flow system more geared towards continuous processing. Having said that, there are some techniques you can use to do batch-like operations, depending on which processors you're using.
The Split processors (SplitText, SplitJSON, etc.) write attributes to the flow files that include a "fragment.identifier" which is unique for all splits created from an incoming flow file, and "fragment.count" which is the total number of those splits. Processors like MergeContent use those attributes to process a whole batch (aka fragment), so the output from those kinds of processors would occur after an entire batch/fragment has been processed.
Another technique is to write an empty file in a temp directory when the job is complete, then a ListFile processor (pointing at that temp directory) would issue a flow file when the file is detected.
Can you describe more about the processors in your flow, and how you would know when a batch was complete?

Grinder - how to distribute invocation of urls from file

We have a huge file of different urls (~500K - ~1M urls).
We want to use Grinder 3 for distributing these urls to the Workers in a way that every worker will invoke a single and different url.
In the JY script we could:
Read the file one time per Agent
Allocate line-number-ranges per Agent
Every Worker would gets a line/url according to its run-id from its Agent line-number-range.
This still means loading a huge file into memory and writing some code to a problem that might be common to many.
Any ideas to a simpler/ready-made solution?
I used Grinder in a similar fashion a while back, and wrote a utility for multi-threaded, one-time ingestion of URLs from a large file.
See https://bitbucket.org/travis_bear/file_util -- in particular, the sequential reader.
I'd recommend using the split command-line utility (or similar) to give separate chunks of the master file to each agent prior to executing your Grinder run.
I would have taken a different approach if you like since its a huge file ,
How many threads are you planning to spawn . I believe you already know that you can get Grinder.ThreadNo to get the currently executing thread.
You can actually divide the file using a pre-processor with equal number of records into number of thread and name them 0 , 1 ,2 etc which matches with thread name .
Why I am suggesting this is that processing the file looks like a pre task whats important are its contents. File processing should not interfere when threads are executing.
So now each thread will have its own file and no collisions .
for eg 20 threads 20 files however your number of threads should be chosen carefully and may be peak + 50 % .

How to architecture file processing in laravel

I have task observe folder where files are coming from SFTP. File are big and processing one file is relatively time consuming. I am looking for best approach to do it. Here are some ideas how to do it, but I am not sure what is the best way.
Run scheduller each 5 min to check for new files
For each new file trigger event that there is new file.
Create listener which will listen for this event and which will using queues. In the listener for new files copy new file in the processing folder and process it. When processing of new files start insert record in the DB with status processing. When processing is done change record status and copy file to processed folder.
I this solution I have 2 copy operations for each file. This is because it is possible if second scheduler executes before all files are processed than some files could overlap in 2 processing jobs.
What is the best way to do it? Should I use another approach to avoid 2 copy operations? Something like to put database check during scheduler execution to see if the file is already in the processing state?
You should use the ->withoutOverlapping(); as stated in the manual of task Scheduler here.
Using this you will make sure that only one instance of the task run at any given time.

Resources