How to rollback to previous state if any failure happened while parallelly loading data into Snowflake - datatable

While loading data into Snowflake tables parallelly, How can I rollback to previous states in tables if any failure happened.
This is the scenario:
I have loaded data into staging layer and now I want to move data into modeling layer
There are several batches run sequentially and in each batch there are tables load parallelly.
If one batch passed successfully, it will continue with next batch execution (Ex : Batch 1 to Batch 2)
Assume that, in batch 2 "Table 7" loading failed due to data type issues or something.
Now the batch 3 execution will be stopped and entire operation will be halted.
How can I restore to the previous states in tables in batch 2 (Table 5,6) and batch 1 (Table 1,2,3,4). Means I want to rollback the entire execution and successfully loaded tables should not have the new changes(Inserts/Updates).
Programmatic Time travel procedure? or any mechanism. Hope your suggestions here.

Related

Need approach to run Spring batch continuously

We are using Spring batch which will fetch x number of records from Database and Make API call and again update database for those x records and this process we are running in 2 steps but we dont want it to be running as scheduler want this to be executed as continuous process as if step 2 is done we should re-execute job , Can someone Please help us as is this recommended approach to run batch continuously for set fo specific batch size of x records and its impact on performance?

Distribute processing of records of scheduler job

I am working on a use case where I have a cron job scheduled (via quartz) which reads certain entries from db and process them.
Now in each schedule, I can get thousands of records which need to be processed. Processing each record takes time (in seconds/minutes). Currently all those records are getting processed on single node (node elected by quartz). Now my challenge is to parallelize these records processing. Please help me in solving below concerns :
How I can distribute these records/tasks to a cluster of machines
If any machine fails after processing few records then remaining records should be processed by healthy nodes in cluster
Get a signal that all record processing is finished.
Create cron jobs to run separately on each host at the desired frequency. You will need some form of lock on each record or some form of range lock on the record set to ensure that servers process mutually exclusive set of records.
e.g. : You can add following new field to all records:
Locked By Server:
Locked for Duration (or lock expiration time):
On each run, each cron picks a set of records that have expired or empty locks and then it aquires the lock on a small set of records by putting these two entries. Then it proceeds to process them. If it crashes or gets stuck the lock expires, otherwise it is released on completion.

Shell script for hourly run to pull data if exists

I am trying to optimize our batch process to pull and insert data into a database. Currently, we have a data source that we pull our data from, create a text file, and load into our reporting database. We have that on a time schedule in Autosys, since most of the time, data is available by a certain time. However, lately, the data source has been late and we are not able to pull the data during the scheduled time and we have to manually run the shell script.
I want to have a shell script that runs the queries every hour and if the data exists, spools to a file to be loaded into the DB. If the data isn't there, then try again next hour so that we can eliminate any manual intervention.
I know I can set up a file trigger in Autosys to run the load into the database if the file exists, but I am having issues setting up the shell script only pull the data once it is available and not repeat the next hour if the file has already been spooled. I am new to UNIX so I am not sure how to proceed. Any help would be great.
You haven't stated your priority clearly. The priorities could be:
load the data as soon as it is available
load the data at least once every x minutes or hours
eliminate any need for manual intervention (which is clear from your question)
This is what you could do, assuming there is no need to load the data as soon as it is available:
increase the frequency of the Autosys job (instead of hourly, may be make it once in 30 or 15 minutes)
change the script so that:
it attempts to load only if it has been x minutes since last successful load, otherwise does nothing and ends in success
stores the last successful load timestamp in a file (which would be touched only upon a successful load)
if data doesn't arrive even after x + some buffer minutes, it might make more sense for the load job to fail so that it gets the required attention.

Pull Data from Hive to SQL Server without duplicates using Apache Nifi

Sorry I'm new in Apache Nifi. So i made a data flow regarding pulling data from Hive and storing it in SQL. There is no error on my data flow, the only problem is, its pulling data repeatedly.
My Data flow is consists of the following:
SelectHiveQL
SplitAvro
ConvertAvroToJson
ConvertJsonTOSQL
PutSQL
For example my table in hive have 20 rows only but when i run the data flow and check my table in MS SQL. It saved 5,000 rows. The SelectHiveQL pulled the data repeatedly.
What do i need to do so it will only pull 20 rows or just the exact number of rows in my Hive Table?
Thank you
SelectHiveQL (like many NiFi processors) runs on a user-specified schedule. To get a processor to only run once, you can set the run schedule to something like 30 sec, then start and immediately stop the processor. The processor will be triggered once, and stopping it does not interrupt that current execution, it just causes it not to be scheduled again.
Another way might be to set the run schedule to something very large, such that it would only execute once per some very long time interval (days, years, etc.)

Spring Batch - Staging table design - transfer data to main table

I am developing a data loader application which uses flat file to read data and insert in to temporary staging table.This is done using MultiResourcePartitioner(thread pool 10-11,queue 151) , where each step runs parallel for each file. There are around 15000 files , so once first set of 150 file resources is completed, jobexecutiondecider will again run another 150 files similarly, this way till 15000 files are completed.
Next step is to move data from staging table to main table and after this ,files will be compressed and moved to another location . I require your inputs ,on designing staging and archiving step,to have moderate throughput,if the job fails,during the next run ,i.e. after some 3 hours, it should clean all the stale data and move COMPLETED data to main table .when i say COMPLETED,it means a full complete file is read successfully and inserted to the staging table. Only these data are moved to the main table.we also have another table that stores file name ,filepath and completion status and updated on succesful file completion , updated afterStep method.
Note:
we do not want to directly insert in to the main table, because the table has materialized view, so removing stale data in the main table will affect business users viewing report, one time they see records, next time the records are vanished because of job running hourly and cleans unsuccessful data.
my current approach is to run staging step parallel to the multiresourcepartitioner step , and have a scheduled task executor with fixed delay of 2 mins , the tasklet bean polls staging table , move the data which has processed flag Y and delete from staged table, also monitor the partitioner step , whether it is running or stopped. if it is running tasklet will return RepeatStatus.CONTINUE, if the partitioner step is completed ,then RepeatStatus.FINISHED is done . Last step will archive the files.when the job starts again
, if some archive file is not completed , which is known from another table that stores file name etc with column is_archived is Y not archived , A archived.let me know your comments.

Resources