Need approach to run Spring batch continuously - spring

We are using Spring batch which will fetch x number of records from Database and Make API call and again update database for those x records and this process we are running in 2 steps but we dont want it to be running as scheduler want this to be executed as continuous process as if step 2 is done we should re-execute job , Can someone Please help us as is this recommended approach to run batch continuously for set fo specific batch size of x records and its impact on performance?

Related

How to rollback to previous state if any failure happened while parallelly loading data into Snowflake

While loading data into Snowflake tables parallelly, How can I rollback to previous states in tables if any failure happened.
This is the scenario:
I have loaded data into staging layer and now I want to move data into modeling layer
There are several batches run sequentially and in each batch there are tables load parallelly.
If one batch passed successfully, it will continue with next batch execution (Ex : Batch 1 to Batch 2)
Assume that, in batch 2 "Table 7" loading failed due to data type issues or something.
Now the batch 3 execution will be stopped and entire operation will be halted.
How can I restore to the previous states in tables in batch 2 (Table 5,6) and batch 1 (Table 1,2,3,4). Means I want to rollback the entire execution and successfully loaded tables should not have the new changes(Inserts/Updates).
Programmatic Time travel procedure? or any mechanism. Hope your suggestions here.

Nifi Job to execute a spark submit command not giving correct results

I have a spark code that appends data from a hive table to parquet files partitioned on dates. The code runs absolutely correct when executed from the spark shell and the parquet files show the exact same number of rows as present in the hive table for the corresponding date.
However, when the same code is executed by putting the code in a jar file, which is called upon by a spark submit command, and the spark submit command is scheduled to execute daily at 9 AM via Nifi, the number of rows in the parquet partition files are coming out to be less. We are on the P_NO_SLA queue, and below are some of the facts and observations we have:
•Data on the source hive table gets updated by 4 AM approx
•Initially our Nifi job was scheduled to start running at 4:45 AM but the number of records did not match. On doing a manual update from the spark shell post 6 AM, the data was an exact match.
•Hence, we scheduled the job to run at 7 AM. On doing this, when the number of records were too less (approx. 20000 on weekends) as compared to weekdays (in the range of 150000 to >200000 records), the data got updated correctly via the Nifi Job. Again a manual run was done to backfill the missing data.
•Again, we postponed the job to 9 AM. Post doing this, there were 2 days when the number of records matched (between 160000 to 200000), however, since Jul-31, the data hasn't matched at all, irrespective of the number of records on any of the days, and we are having to do a manual backfill everyday.
We are unable to figure out any specific reason that maybe causing the code to run correctly from the spark shell at any time, but giving incorrect results from Nifi when Nifi is just schedculed to execute the spark submit command to run the jar file containing the same spark code.
Please help me with understanding why this would be happening and how I can fix this.
P.S.: I have checked the Nifi log files, and could not find any of the scheduled jobs giving an error.

read data through spring batch and return data outside the job

I read everywhere how to read data in spring batch itemReader and write in database using itemWriter, but I wanted to just read data using spring batch then somehow I wanted to access this list of items outside the job. I need to perform remaining processing after job finished.
The reason I wanted to do this is because I need to perform a lot of validations on every item. I have to validate each item's variable xyz if it exists in list(which is not available within job). After performing a lot of processing I have to insert information in different tables using JPA. Please help me out!

Shell script for hourly run to pull data if exists

I am trying to optimize our batch process to pull and insert data into a database. Currently, we have a data source that we pull our data from, create a text file, and load into our reporting database. We have that on a time schedule in Autosys, since most of the time, data is available by a certain time. However, lately, the data source has been late and we are not able to pull the data during the scheduled time and we have to manually run the shell script.
I want to have a shell script that runs the queries every hour and if the data exists, spools to a file to be loaded into the DB. If the data isn't there, then try again next hour so that we can eliminate any manual intervention.
I know I can set up a file trigger in Autosys to run the load into the database if the file exists, but I am having issues setting up the shell script only pull the data once it is available and not repeat the next hour if the file has already been spooled. I am new to UNIX so I am not sure how to proceed. Any help would be great.
You haven't stated your priority clearly. The priorities could be:
load the data as soon as it is available
load the data at least once every x minutes or hours
eliminate any need for manual intervention (which is clear from your question)
This is what you could do, assuming there is no need to load the data as soon as it is available:
increase the frequency of the Autosys job (instead of hourly, may be make it once in 30 or 15 minutes)
change the script so that:
it attempts to load only if it has been x minutes since last successful load, otherwise does nothing and ends in success
stores the last successful load timestamp in a file (which would be touched only upon a successful load)
if data doesn't arrive even after x + some buffer minutes, it might make more sense for the load job to fail so that it gets the required attention.

Spring XD Batch Job Incremental

I created a job batch to extract data from csv file to a jdbc using filejdbc module, it worked properly, but when I scheduled the batch to run every 5 minutes, it did not work with the incremental load concept, it loaded all the data again, Is there any feature to schedule the batch with incremental load?
Is the solution to run the batch once, and to create a stream to do the incremental load? Will the stream load all the data again, or it will just continue from a certain point.
Please explain how can I achieve the incremental load concept using spring XD?
Thanks,
Moha.
I suppose what is missing is the concept of 'state'...the filejdbc module does not seem to know where the last import stopped. I do something similar but I use a custom batch job and I use a meta store to keep track of where the last load stopped - that is where the next incremental will pick up from, etc.
Since you're using a module that came with spring-xd itself, you may not have this flexibility but you may have to options:
a- your destination table can define unique fields that will prevent duplicates. That way, even if its trying to load ALL the data again, only new rows will get inserted. This assumes that the module is using 'insert ignore' (or something similar and not just basic insert (which will throw an error/exception). This, I must say, will end up being non-optimal pretty quickly, pretty soon.
b- If its an option, write a module that can delete the file after its uploaded into the db. You can construct a complex stream that will first do your data load and then file delete.

Resources