Here is my scenario:
a1. read records from table A
a2. process these records one by one and generate a new temp table B for each record
b1. read records from table B, process these records data and save it in a file
a3. tag the record from table A as finished status
A pseudo code to describe this scenario:
foreach item in items:
1. select large amount data where id=item.id then save the result to temp table_id
2. process all records in table_id then write then to a file
3. update item status
4. send message to client
This is my design:
create a Spring Batch job, set a date as its parameter
create a step1 to read records from table A
create a step2 to read records from temporary table B and start it in the processor of step1
I check the Spring Batch docs, I didn't find any related introduction about how to nest a step into a step's processor. seems the Step is the minimum unit in Spring Batch and it cannot be split.
Update
Here is the pseudo code about what I did now to solve the problem:
(I'm using spring boot 2.7.8)
def Job:
PagingItemReader(id) :
select date from temp_id
FlatFileItemWriter:
application implement commandlinerunner:
items = TableAReposiroy.SelectAllBetweenDate
for item : items:
Service.createTempTableBWithId(item.id)
Service.loadDataToTempTable(item.id)
job = createJob(item.id)
luancher.run(job)
update item status
A step is part of a job. It is not possible to start a step from within an item processor.
You need to break your requirement into different steps, without trying to "nest" steps into each others. In your case, you can do something like:
create a Spring Batch job, set a date as its parameter
create a step1 to read records from table A and generate a new temp table B
create a step2 to read records from temporary table B, process these records data and save it in a file and tag the record from table A as finished status
The writer of step2 would be a composite writer: it writes data to the file AND updates the status of processed records in table A. This way, if there is an issue writing the file, the status of records in table A would be rolled back (in fact, they are not processed correctly as the write operation failed, in which case they need to be reprocessed).
You should have single job with two steps, stepA and stepB, spring batch does provide provision for controlled flow of execution of steps, you can sequentially execute two steps. For each item Once stepA reaches its writer and writes data, stepB will start. You can configure stepB to read data written by stepA.
You can also pass data between steps using Job Execution context, Once stepA ends put data in Job Execution context, it can be accessed in stepB once it starts. This can help in your case because you can pass item identifier which stepA picked for processing and pass it to stepB so that stepB can have this identifier in its writer to update its final status.
Related
ItemReader is reading from ever growing existing table everytime the job runs. I am looking for option within Spring batch to query only the new records every time the schedule job runs.
If I have 50000 records read, next schedule should start from 50001.
My thought is to persist the id of the last record read by ItemReader ( last of the whole reader output and not the last of each chunk ) in DB and use in subsequent job schedule. I will return the data sorted by id from main table.
How do I know the last record in the writer? Any ideas.
I would make it explicit by passing the ID range of the records (ie. fromId and toId) that are required to be processed as the job parameters when running a batch job. Then in the ItemReader, you can rely on this ID range to return the data to process.
And somehow persist the the latest ID that is already processed to the DB (e.g. by using JobExecutionListener when a job finished) . When the next scheduled job triggers, find out the next ID that is not processed and then start another job instance with this next ID as the parameter.
We have a flow where GenerateTableFetch takes inpute from splitJson which gives TableName, ColumnName as argument. At once multiple tables are passed as input to GenerateTableFetch and next ExecuteSql executes the query.
Now i want to trigger a new process when all the files for a table has been processed by the below processor (At the end there is PutFile).
How to find that all the files created for a Table has been processed?
You may need NIFI-5601 to accomplish this, there is a patch currently under review at the time of this writing, I hope to get it into NiFi 1.9.0.
EDIT: Adding potential workarounds in the meantime
If you can use ListDatabaseTables instead of getting your table names from a JSON file, then you can set Include Count to true. Then you will get attributes for the table name and the count of its rows. Then you can divide the count by the value of the Partition Size in GTF and that will give you the number of fetches (let's call it X). Then add an attribute via UpdateAttribute called "parent" or something, and set it to ${UUID()}. Keep these attributes in the flow files going into GTF and ExecuteScript, then you can use Wait/Notify to wait until X flow files are received (setting Target Signal Count to ${X}) and using ${parent} as the Release Signal Identifier.
If you can't use ListDatabaseTables, then you may be able to have ExecuteSQLRecord after your SplitJSON, you can execute something like SELECT COUNT(*) FROM ${table.name}. If using ExecuteSQL, you may need a ConvertAvroToJSON, if using ExecuteSQLRecord use a JSONRecordSetWriter. Then you can extract the count from the flow file contents using EvaluateJsonPath.
Once you have the table name and the row count in attributes, you can continue with the flow I outlined above (i.e. determine the number of flow files that GTF will generate, etc.).
I have a Step in my job that reads from database A and then writes to database B & C.
If the select statement yields no results i want to expect it to continue to the processor and writer as usual. However, the writer() is not called!
This is because my writer is a Composite item writer which has a writer that updates a control table (database C) to say the reader read no results.
I would obviously have a new Tasklet Step to follow this Step in question, but its a partitioned step.
Is there a configuration property for the Job that allows empty reads to not be marked as 'NOOP' or similar, but as successful?
You should be able to use a StepExecutionListener for this use case instead of an ItemWriter. Within that StepExecutionListner#afterStep you can look at the items read count and if it's 0, do that db update. The writer piece is an ItemWriter, meaning it is intended to be used to write items that have been read.
Create a custom ItemReader that return a sentinel item if no items are read.
Add a custom ItemWriter mapped to sentinel item class where you update the control table.
I've never used Spring Batch before but it seems like a viable option for what I am attempting to accomplish. I have about 15 CSV files for 10 institutions that I need to process nightly. I am stashing the CSV into staging tables in an Oracle database.
The CSV File may look something like this.
DEPARTMENT_ID,DEPARTMENT_NAME,DEPARTMENT_CODE
100,Computer Science & Engineering,C5321
101,Math,M333
...
However when I process the row and add it to the database I need to fill in an institution id which would be determined based on the folder being processed at that time.
The database table would like like this
INSTITUTION_ID,DEPARTMENT_ID,DEPARTMENT_NAME,DEPARTMENT_CODE
1100,100,Computer Science & Engineering,C5321
There is also validation that needs to be done on each row in the CSV files as well. Is that something Spring Batch can handle as well?
I've seen reference to CustomItemReader and CustomItemWriter but not sure if that is what I need. The examples I've seen seem basic just dumping a CSV exactly as it is into a matching table.
Yes , all the task that you have reported can be done by spring batch -
For the Reader you may use - multi Resource Item Reader with your wild card name matching your - file names .
To validate the rows from file you can use item processor and handle the validation.
And for your case you need not use the custom item writer - you can configure the item writer as DB item writer in your XML file.
I suggest you to use the XML based approach for Spring batch implementation.
The XML will be used to configure all the architecture of your batch - as in
job -- step -- chunk -- reader -- processor -- writer
and to track errors and exceptions you can implement listeners at each stage.
-- step Execution Listener
-- Item Reader Listener
-- Item Processor Listener
-- Item Writer Listener
I am developing a data loader application which uses flat file to read data and insert in to temporary staging table.This is done using MultiResourcePartitioner(thread pool 10-11,queue 151) , where each step runs parallel for each file. There are around 15000 files , so once first set of 150 file resources is completed, jobexecutiondecider will again run another 150 files similarly, this way till 15000 files are completed.
Next step is to move data from staging table to main table and after this ,files will be compressed and moved to another location . I require your inputs ,on designing staging and archiving step,to have moderate throughput,if the job fails,during the next run ,i.e. after some 3 hours, it should clean all the stale data and move COMPLETED data to main table .when i say COMPLETED,it means a full complete file is read successfully and inserted to the staging table. Only these data are moved to the main table.we also have another table that stores file name ,filepath and completion status and updated on succesful file completion , updated afterStep method.
Note:
we do not want to directly insert in to the main table, because the table has materialized view, so removing stale data in the main table will affect business users viewing report, one time they see records, next time the records are vanished because of job running hourly and cleans unsuccessful data.
my current approach is to run staging step parallel to the multiresourcepartitioner step , and have a scheduled task executor with fixed delay of 2 mins , the tasklet bean polls staging table , move the data which has processed flag Y and delete from staged table, also monitor the partitioner step , whether it is running or stopped. if it is running tasklet will return RepeatStatus.CONTINUE, if the partitioner step is completed ,then RepeatStatus.FINISHED is done . Last step will archive the files.when the job starts again
, if some archive file is not completed , which is known from another table that stores file name etc with column is_archived is Y not archived , A archived.let me know your comments.