I am using spring batch module to read a complex file with multi-line records. First 3 lines in the file will always contain a header with few common fields.
These common fields will be used in the processing of subsequent records in the file. The job is restartable.
Suppose the input file has 10 records (please note number of records may not be same as number of lines since records can span over multiple lines).
Suppose job runs first time, starts reading the file from line 1, and processes first 5 records and fails while processing 6th record.
During this first run, since job has also parsed header part (first 3 lines in the file), application can successfully process first 5 records.
Now when failed job restarted it will start from 6th record and hence will not read the header part this time. Since application requires certain values
contained in the header record, the job fails. I would like to know possible suggestions so that restarted job always reads the header part and then starts
from where it left off (6th record in the above scenario).
Thanks in advance.
i guess, the file in question does not change between runs? then it's not necessary to re-read it, my solution builds on this assumption
if you use one step you can
implement a LineCallbackHandler
give it access to the stepExecutionContext (it's easy with annotations, but can be too with interfaces, just extend StepExecutionListenerSupport)
save the header values into the ExecutionContext
extract them from the context and use them where you want to
it should work for re-start as well, because Spring Batch reads/saves the values from the first run and will provide the complete ExecutionContext for subsequent runs
You can make 2 step job where:
First step reads first 3 lines as header information and puts everything you need to job context (and therefore save it in DB for future executions if job fails). If this step fails, header info will be read again and if it passes you are sure it will always have header info in job context.
Second step can use same file for input but this time you can tell it to skip first 3 lines and read rest as is. This way you will get restartability on that step and each time job fails it will resume where it left of.
Related
I have a batch job that reads hundreds of images from an SFTP location and then encodes them into base64 and uploads them via API using HTTP connector.
I would like to make the process run quicker and hence trying to split the payload into 2 via scatter-gather and then sending then sending payload1 to one batch job in a subflow and payload2 to another batch job in another subflow.
Is this the right approach?
Or is it possible to split the load in just one batch process, ie for one half of the payload to be processed by batch step 1 and second half will be processed by batch step 2 at the same time?
Thank you
No, it is not a good approach. Batch jobs are always executed asynchronously (ie using different threads) so there is no benefit on using scatter-gather and it has the cons of increasing resource usage.
Splitting the payload in different batch steps doesn't make sense either. You should not try to scale by adding steps.
Batch jobs should be used naturally to work in parallel by iterating on an input. It may be able to handle the splitting itself or you can manually split the input payload before. Then let it handle the concurrency automatically. There are some configurations you can use to tune it, like block sizing.
I have some troubles with the MergeRecord processor in Nifi. You can see the whole Nifi flow below: I'm getting a json array from an API, then I split it, I apply some filters and then I want to build the json array again.
Nifi workflow
I'm able to build the good json array from all the chunks, but the problem is that the processor is generating data indefinitely. When I execute the job step by step (by starting / stopping every processors one by one) everything is fine, but when the MergeRecord is running it's generating the same data even if I stop the begin of the flow (so there is no more inputs...)
You can see a screenshot below of the data in the "merged" box that are stacking
data stacked
I scheduled this processor every 10 sec, and after 30 sec you can see that it executed 3 times and generated 3 times the same file while there is no more data above. It's weird because when you look at the "original" box of the processor I can see the right original amount of data (18,43Kb). But the merged part is still increasing...
Here is the configuration of the MergeRecord:
configuration
I suppose that I'm missing something but I don't know why !
Thank you for your help,
Regards,
Thomas
I have configured Spring Boot Batch to process Fixed length flat file. I read and split columns by using FlatFileItemReader, FixedLengthTokenizer and Writing data into Database by using ItemWriter, JPA Repository.
I have a scenario like, My Server was crashed or it was stopped at the time of file processing. At this point half of the file was processed(means half of the data wrote into DB). When it comes to next Job(when server was running up) the file has to start from where it stops.
For Example, A file having 1000 lines, Server was shutdown after processing 500 rows. In the next Job, The file has to start from 501 row.
I googled for solution but nothing relevant. Any help appreciated.
As far as I know, what you are asking ( restart at chunk level ) doesn't automatically exist in Spring Batch API & is something that programmer has to implement on his/her own.
Spring Batch provides Job restart feature via JobOperator.restart . This is a job level restart and a new execution id will be created for next run & whole of the job will rerun as there are other concerns like somebody put in a new file or renamed existing file to process in place of old file , how batch will know that its same input file content wise or db not changed since last run?
Due to these concerns , its imperative that programmer handles these situations via custom code.
Second concern is that when there is a server failure, job status would still be STARTED & not FAILED since it happens all of a sudden and framework couldn't update status correctly.
Following steps you need to implement ,
1.Implement a custom logic to decide if last job execution was successful or restart is needed.
2.If restart is needed, mark previous job execution as FAILED & then use JobOperator.restart(long executionId) - For a non - partitioned job , only useful impact would be the marking of job status to be correct as FAILED but whole job will restart from beginning.
There are many scenarios like,
a)job status is STARTED but all steps are marked COMPELTEDetc
b)For a partitioned job, few steps are completed , few failed & few in started etc
3.If restart is not needed, launch a new job using - JobLauncher.run.
So with above steps, you see that a real chunk level job restart is not achieved but above steps are primary things that you first need to understand & implement.
Next would be to changing your input at job restart i.e. you devise a mechanism to mark input records as processed for processed chunks ( i.e. read , processed & written ) & have a way to know what input records are not processed - then in next job run you feed modified input that is still unprocessed. So its all going to be your use case specific custom logic.
I am not aware of any inbuilt mechanism in the framework itself to achieve this. To me a Job Restart is a brand new job execution with modified / reduced input.
After reading this answer (by Michael Minella)
Spring batch chunk processing , how does the reader work ?if the result set changes?
I assume with JdbcPagingItemReader, the query is run again for each page. In this case, when reading a new page it is possible a new record had been inserted in a position before this page starts, causing the last record of previous page to be processed again.
This means in order to prevent a record to be reprocessed I must always set a "processed already" flag manually into input data and check it before writing ?
Is this a feasible approach ?
The same question applies to a JdbcCursorItemReader when the process is interrupted (power outage) and restarted. What happens if a new record has been inserted before the current index that is saved into ExecutionContext ?
Your assumptions are right.
In case of the JdbcPagingItemReader this will also depend on the transaction isolation level of your transaction (READ_COMMITED, READ_UNCOMMITTED, ...).
In case of the JdbcCursorItemReader you have to ensure that the query returns the exact same result set (including order) in the case of a restart. Otherwise the results are unpredictable.
In the batches I'm writing, I often save the result of the selection into a csv file in the first step and configure the reader with "saveState=false", if I cannot guarantee that the selection will produce the same results in case of a crash. So, if the first step fails a restart will produce a complete new csv-file. After the first step, all the entries that need to be processed are in a file. And of course, this file cannot change and therefore, in a case of a restart, continuing processing from the last successful chunk is possible from the second step onward.
Edited:
Using a "state-column" works well, if you have a single step that does the reading (having the state-column in its where-clause), processing and writing/updating (the state-column to 'processed') the state. You just have to start the job again as a new launch, if such a job fails.
In WebSphere Liberty Java Batch,
Is it possible to pass first Step Output to Next step as input parameter.
e.g. First step is Batchlet and second step is chunk. Once first step completes its execution output should be passed to second step runtime..
I'm guessing you are thinking of this in z/OS JCL terms where a step would write output to a temporary dataset that gets passed to a subsequent step. JSR-352 doesn't get into dataset (or file) allocation. That's up to the application code. So you could certainly have a step that wrote output into a file (or dataset) and a later step could certainly read from that same file (or dataset) if it knew the same. You could make the name into a job property that was provided as a property to the batchlet and reader. You could even externalize the value of the job property as a job parameter.
But nothing is going to delete the file for you at the end of the job (like a temporary dataset would get deleted). You'll need to clean up the file yourself.
Is that what you were asking?
You can use the JobContext user data: JobContext.set/getTransientUserData().
This does not however allow you to populate a batch property (via #Inject #BatchProperty) in a parallel way to the manner in which you can supply values from XML via substitution with job parameters.
We have raised an issue to consider an enhancement for the next revision of the Batch specification to allow a property value to be set dynamically from an earlier portion of the execution.
In the meantime there is also the possibility to use CDI bean scopes to share information across steps, but this also is not integrated with batch property injection.