Please let me know , when i am putting say 5 files in a directory , 5 messages gets generated by the poller , i want that the spring batch job will get triggered only one time, not five times ,if the files are coming together say within 1 min duration. is it possible?
You may consider to use an Aggregator for this kind of task. So, you will collect several files together by expected size or withing some time window. You need to use some static correlationKey to let the component to group files.
When the group is ready, a single message is emitted and you are good to trigger a Batch job for this set of files.
Related
I have a batch job that reads hundreds of images from an SFTP location and then encodes them into base64 and uploads them via API using HTTP connector.
I would like to make the process run quicker and hence trying to split the payload into 2 via scatter-gather and then sending then sending payload1 to one batch job in a subflow and payload2 to another batch job in another subflow.
Is this the right approach?
Or is it possible to split the load in just one batch process, ie for one half of the payload to be processed by batch step 1 and second half will be processed by batch step 2 at the same time?
Thank you
No, it is not a good approach. Batch jobs are always executed asynchronously (ie using different threads) so there is no benefit on using scatter-gather and it has the cons of increasing resource usage.
Splitting the payload in different batch steps doesn't make sense either. You should not try to scale by adding steps.
Batch jobs should be used naturally to work in parallel by iterating on an input. It may be able to handle the splitting itself or you can manually split the input payload before. Then let it handle the concurrency automatically. There are some configurations you can use to tune it, like block sizing.
I have configured Spring Boot Batch to process Fixed length flat file. I read and split columns by using FlatFileItemReader, FixedLengthTokenizer and Writing data into Database by using ItemWriter, JPA Repository.
I have a scenario like, My Server was crashed or it was stopped at the time of file processing. At this point half of the file was processed(means half of the data wrote into DB). When it comes to next Job(when server was running up) the file has to start from where it stops.
For Example, A file having 1000 lines, Server was shutdown after processing 500 rows. In the next Job, The file has to start from 501 row.
I googled for solution but nothing relevant. Any help appreciated.
As far as I know, what you are asking ( restart at chunk level ) doesn't automatically exist in Spring Batch API & is something that programmer has to implement on his/her own.
Spring Batch provides Job restart feature via JobOperator.restart . This is a job level restart and a new execution id will be created for next run & whole of the job will rerun as there are other concerns like somebody put in a new file or renamed existing file to process in place of old file , how batch will know that its same input file content wise or db not changed since last run?
Due to these concerns , its imperative that programmer handles these situations via custom code.
Second concern is that when there is a server failure, job status would still be STARTED & not FAILED since it happens all of a sudden and framework couldn't update status correctly.
Following steps you need to implement ,
1.Implement a custom logic to decide if last job execution was successful or restart is needed.
2.If restart is needed, mark previous job execution as FAILED & then use JobOperator.restart(long executionId) - For a non - partitioned job , only useful impact would be the marking of job status to be correct as FAILED but whole job will restart from beginning.
There are many scenarios like,
a)job status is STARTED but all steps are marked COMPELTEDetc
b)For a partitioned job, few steps are completed , few failed & few in started etc
3.If restart is not needed, launch a new job using - JobLauncher.run.
So with above steps, you see that a real chunk level job restart is not achieved but above steps are primary things that you first need to understand & implement.
Next would be to changing your input at job restart i.e. you devise a mechanism to mark input records as processed for processed chunks ( i.e. read , processed & written ) & have a way to know what input records are not processed - then in next job run you feed modified input that is still unprocessed. So its all going to be your use case specific custom logic.
I am not aware of any inbuilt mechanism in the framework itself to achieve this. To me a Job Restart is a brand new job execution with modified / reduced input.
I have task observe folder where files are coming from SFTP. File are big and processing one file is relatively time consuming. I am looking for best approach to do it. Here are some ideas how to do it, but I am not sure what is the best way.
Run scheduller each 5 min to check for new files
For each new file trigger event that there is new file.
Create listener which will listen for this event and which will using queues. In the listener for new files copy new file in the processing folder and process it. When processing of new files start insert record in the DB with status processing. When processing is done change record status and copy file to processed folder.
I this solution I have 2 copy operations for each file. This is because it is possible if second scheduler executes before all files are processed than some files could overlap in 2 processing jobs.
What is the best way to do it? Should I use another approach to avoid 2 copy operations? Something like to put database check during scheduler execution to see if the file is already in the processing state?
You should use the ->withoutOverlapping(); as stated in the manual of task Scheduler here.
Using this you will make sure that only one instance of the task run at any given time.
I have requirement where I have to deal with multiple files (say 300 csv files).
I need to read --> process --> write, each individual file as I need to apply some transformation logic on the data.
For each input file there would be a corresponding transformed file. so for 300 input files we would have 300 output files.
At the end, all the 300 output files are needed to be merged into a single file which would be compressed and then transferred to a remote location over FTP/SFTP.
Say, every hour we would have to deal with a new set of 300 file on which we would be required to apply the above processing, so we would be scheduling the above job per hour.
How to handle multi file processing in the above scenario using Spring Batch ?
How to make the above processing to happen in multiple threads ?
Please suggest.
Thanks in advance.
You can use spring task execution and scheduling and then use java ThreadPoolExecutor
Check this answer here at SO for a very simple example.
I'm trying to run around 15000 soap requests through JMeter. I have 15000 individual soap files in a folder.
I know that the WebService(SOAP) Request component has the option to point to a folder.
But, the problem is that the files in the folder will get picked up and run randomly and a file can get run multiple times.
This is not ideal because each request has a unique correlation id and if a file get's run twice, the second run will fail due to a duplicated correlation id.
Is there anyway, I could tell jmeter to run the files only once?
Also, as certain soap requests are dependent upon other request having already run, the ability to run these in a specified order would be desirable. Is this possible?
These seem like common problems that should have already been solved. But, I can't find much on google.
Do you guys have any ideas?
I would use the JSR223 Sampler to run a script (e.g. Groovy) to iterate through the files in the directory and store the text of each file in a String.
See, for example, this other answer about using a Groovy script to iterate a list of values.
You could put the data into a csv file and read it in using a CSV Data Set Config. If you need unique values over multiple threads then you have to create multiple files, one per thread.
You could also put the data in a database and use a JDBC Config/Sampler to access it, making sure to either a: delete the data after it is read, or b: mark it as 'read' using a flag. Both methods would prevent the same record being read twice by different threads.
If you need to run requests in order you should structure the test plan as such, requests will be made sequentially, top to bottom.