Is there a way to start the same Spring Batch job twice for two different input files? - performance

In an effort to improve the performance of a long running batch job that processes a huge file, we would like to run the same job multiple times concurrently for different partitions of the input file (i.e. for multiple separate input files). When we do that, Spring batch complains that the same job already started and can't be started again.
I read somewhere that we can pass different jobParameters for each instance to differentiate between the job instances. We tried that but it didn't work and it's still failing with the same error. Does anyone have other ideas?

It's a huge file split into smaller files.
Does anyone have other ideas?
You can start a job for each partition file (with the partition file name as a job parameter). This way, you should have different job instances and your problem should be solved.

Related

Run multiple batch jobs in parallel Mule 4

I have a batch job that reads hundreds of images from an SFTP location and then encodes them into base64 and uploads them via API using HTTP connector.
I would like to make the process run quicker and hence trying to split the payload into 2 via scatter-gather and then sending then sending payload1 to one batch job in a subflow and payload2 to another batch job in another subflow.
Is this the right approach?
Or is it possible to split the load in just one batch process, ie for one half of the payload to be processed by batch step 1 and second half will be processed by batch step 2 at the same time?
Thank you
No, it is not a good approach. Batch jobs are always executed asynchronously (ie using different threads) so there is no benefit on using scatter-gather and it has the cons of increasing resource usage.
Splitting the payload in different batch steps doesn't make sense either. You should not try to scale by adding steps.
Batch jobs should be used naturally to work in parallel by iterating on an input. It may be able to handle the splitting itself or you can manually split the input payload before. Then let it handle the concurrency automatically. There are some configurations you can use to tune it, like block sizing.

Spring batch file resume after Server failure

I have configured Spring Boot Batch to process Fixed length flat file. I read and split columns by using FlatFileItemReader, FixedLengthTokenizer and Writing data into Database by using ItemWriter, JPA Repository.
I have a scenario like, My Server was crashed or it was stopped at the time of file processing. At this point half of the file was processed(means half of the data wrote into DB). When it comes to next Job(when server was running up) the file has to start from where it stops.
For Example, A file having 1000 lines, Server was shutdown after processing 500 rows. In the next Job, The file has to start from 501 row.
I googled for solution but nothing relevant. Any help appreciated.
As far as I know, what you are asking ( restart at chunk level ) doesn't automatically exist in Spring Batch API & is something that programmer has to implement on his/her own.
Spring Batch provides Job restart feature via JobOperator.restart . This is a job level restart and a new execution id will be created for next run & whole of the job will rerun as there are other concerns like somebody put in a new file or renamed existing file to process in place of old file , how batch will know that its same input file content wise or db not changed since last run?
Due to these concerns , its imperative that programmer handles these situations via custom code.
Second concern is that when there is a server failure, job status would still be STARTED & not FAILED since it happens all of a sudden and framework couldn't update status correctly.
Following steps you need to implement ,
1.Implement a custom logic to decide if last job execution was successful or restart is needed.
2.If restart is needed, mark previous job execution as FAILED & then use JobOperator.restart(long executionId) - For a non - partitioned job , only useful impact would be the marking of job status to be correct as FAILED but whole job will restart from beginning.
There are many scenarios like,
a)job status is STARTED but all steps are marked COMPELTEDetc
b)For a partitioned job, few steps are completed , few failed & few in started etc
3.If restart is not needed, launch a new job using - JobLauncher.run.
So with above steps, you see that a real chunk level job restart is not achieved but above steps are primary things that you first need to understand & implement.
Next would be to changing your input at job restart i.e. you devise a mechanism to mark input records as processed for processed chunks ( i.e. read , processed & written ) & have a way to know what input records are not processed - then in next job run you feed modified input that is still unprocessed. So its all going to be your use case specific custom logic.
I am not aware of any inbuilt mechanism in the framework itself to achieve this. To me a Job Restart is a brand new job execution with modified / reduced input.

How to architecture file processing in laravel

I have task observe folder where files are coming from SFTP. File are big and processing one file is relatively time consuming. I am looking for best approach to do it. Here are some ideas how to do it, but I am not sure what is the best way.
Run scheduller each 5 min to check for new files
For each new file trigger event that there is new file.
Create listener which will listen for this event and which will using queues. In the listener for new files copy new file in the processing folder and process it. When processing of new files start insert record in the DB with status processing. When processing is done change record status and copy file to processed folder.
I this solution I have 2 copy operations for each file. This is because it is possible if second scheduler executes before all files are processed than some files could overlap in 2 processing jobs.
What is the best way to do it? Should I use another approach to avoid 2 copy operations? Something like to put database check during scheduler execution to see if the file is already in the processing state?
You should use the ->withoutOverlapping(); as stated in the manual of task Scheduler here.
Using this you will make sure that only one instance of the task run at any given time.

How to delete input files after successful mapreduce

We have a system that receives archives on a specified directory and on a regular basis it launches a mapreduce job that opens the archives and processes the files within them. To avoid re-processing the same archives the next time, we're hooked into the close() method on our RecordReader to have it deleted after the last entry is read.
The problem with this approach (we think) is that if a particular mapping fails, the next mapper that makes another attempt at it finds that the original file has been deleted by the record reader from the first one and it bombs out. We think the way to go is to hold off until all the mapping and reducing is complete and then delete the input archives.
Is this the best way to do this?
If so, how can we obtain a listing of all the input files found by the system from the main program? (we can't just scrub the whole input dir, new files may be present)
i.e.:
. . .
job.waitForCompletion(true);
(we're done, delete input files, how?)
return 0;
}
Couple comments.
I think this design is heartache-prone. What happens when you discover that someone deployed a messed up algorithm to your MR cluster and you have to backfill a month's worth of archives? They're gone now. What happens when processing takes longer than expected and a new job needs to start before the old one is completely done? Too many files are present and some get reprocessed. What about when the job starts while an archive is still in flight? Etc.
One way out of this trap is to have the archives go to a rotating location based on time, and either purge the records yourself or (in the case of something like S3) establish a retention policy that allows a certain window for operations. Also whatever the back end map reduce processing is doing could be idempotent: processing the same record twice should not be any different than processing it once. Something tells me that if you're reducing your dataset, that property will be difficult to guarantee.
At the very least you could rename the files you processed instead of deleting them right away and use a glob expression to define your input that does not include the renamed files. There are still race conditions as I mentioned above.
You could use a queue such as Amazon SQS to record the delivery of an archive, and your InputFormat could pull these entries rather than listing the archive folder when determining the input splits. But reprocessing or backfilling becomes problematic without additional infrastructure.
All that being said, the list of splits is generated by the InputFormat. Write a decorator around that and you can stash the split list wherever you want for use by the master after the job is done.
The simplest way would probably be do a multiple input job, read the directory for the files before you run the job and pass those instead of a directory to the job (then delete the files in the list after the job is done).
Based on the situation you are explaining I can suggest the following solution:-
1.The process of data monitoring I.e monitoring the directory into which the archives are landing should be done by a separate process. That separate process can use some metadata table like in mysql to put status entries based on monitoring the directories. The metadata entries can also check for duplicacy.
2. Now based on the metadata entry a separate process can handle the map reduce job triggering part. Some status could be checked in metadata for triggering the jobs.
I think you should use Apache Oozie to manage your workflow. From Oozie's website (bolding is mine):
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
...
Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.

Is it possible to restart a "killed" Hadoop job from where it left off?

I have a Hadoop job that processes log files and reports some statistics. This job died about halfway through the job because it ran out of file handles. I have fixed the issue with the file handles and am wondering if it is possible to restart a "killed" job.
As it turns out, there is not a good way to do this; once a job has been killed there is no way to re-instantiate that job and re-start processing immediately prior to the first failure. There are likely some really good reasons for this but I'm not qualified to speak to this issue.
In my own case, I was processing a large set of log files and loading these files into an index. Additionally I was creating a report on the contents of these files at the same time. In order to make the job more tolerant of failures on the indexing side (a side-effect, this isn't related to Hadoop at all) I altered my job to instead create many smaller jobs, each one of these jobs processing a chunk of these log files. When one of these jobs finishes, it renames the processed log files so that they are not processed again. Each job waits for the previous job to complete before running.
Chaining multiple MapReduce jobs in Hadoop
When one job fails, all of the subsequent jobs quickly fail afterward. Simply fixing whatever the issue was and the re-submitting my job will, roughly, pick up processing where it left off. In the worst-case scenario where a job was 99% complete at the time of it's failure, that one job will be erroneously and wastefully re-processed.

Resources