I am working on a use case where I have a cron job scheduled (via quartz) which reads certain entries from db and process them.
Now in each schedule, I can get thousands of records which need to be processed. Processing each record takes time (in seconds/minutes). Currently all those records are getting processed on single node (node elected by quartz). Now my challenge is to parallelize these records processing. Please help me in solving below concerns :
How I can distribute these records/tasks to a cluster of machines
If any machine fails after processing few records then remaining records should be processed by healthy nodes in cluster
Get a signal that all record processing is finished.
Create cron jobs to run separately on each host at the desired frequency. You will need some form of lock on each record or some form of range lock on the record set to ensure that servers process mutually exclusive set of records.
e.g. : You can add following new field to all records:
Locked By Server:
Locked for Duration (or lock expiration time):
On each run, each cron picks a set of records that have expired or empty locks and then it aquires the lock on a small set of records by putting these two entries. Then it proceeds to process them. If it crashes or gets stuck the lock expires, otherwise it is released on completion.
Related
Spring scheduler is been triggered every one hour and the scheduler will be triggered from all the nodes deployed.
The scheduler reads a list of data from DB and does the process for each data in the list..
The processing of the data should not be duplicated across multiple nodes.
Each node should be able to uniquely identify the data in the list that it can process.
Is there any microservice pattern/distributed architecture pattern to achieve the same using distributed cache.
Please note:-
Would not be able to acquire lock on DB.
Each data in the list will have a unique id.
ItemReader is reading from ever growing existing table everytime the job runs. I am looking for option within Spring batch to query only the new records every time the schedule job runs.
If I have 50000 records read, next schedule should start from 50001.
My thought is to persist the id of the last record read by ItemReader ( last of the whole reader output and not the last of each chunk ) in DB and use in subsequent job schedule. I will return the data sorted by id from main table.
How do I know the last record in the writer? Any ideas.
I would make it explicit by passing the ID range of the records (ie. fromId and toId) that are required to be processed as the job parameters when running a batch job. Then in the ItemReader, you can rely on this ID range to return the data to process.
And somehow persist the the latest ID that is already processed to the DB (e.g. by using JobExecutionListener when a job finished) . When the next scheduled job triggers, find out the next ID that is not processed and then start another job instance with this next ID as the parameter.
I am trying to optimize our batch process to pull and insert data into a database. Currently, we have a data source that we pull our data from, create a text file, and load into our reporting database. We have that on a time schedule in Autosys, since most of the time, data is available by a certain time. However, lately, the data source has been late and we are not able to pull the data during the scheduled time and we have to manually run the shell script.
I want to have a shell script that runs the queries every hour and if the data exists, spools to a file to be loaded into the DB. If the data isn't there, then try again next hour so that we can eliminate any manual intervention.
I know I can set up a file trigger in Autosys to run the load into the database if the file exists, but I am having issues setting up the shell script only pull the data once it is available and not repeat the next hour if the file has already been spooled. I am new to UNIX so I am not sure how to proceed. Any help would be great.
You haven't stated your priority clearly. The priorities could be:
load the data as soon as it is available
load the data at least once every x minutes or hours
eliminate any need for manual intervention (which is clear from your question)
This is what you could do, assuming there is no need to load the data as soon as it is available:
increase the frequency of the Autosys job (instead of hourly, may be make it once in 30 or 15 minutes)
change the script so that:
it attempts to load only if it has been x minutes since last successful load, otherwise does nothing and ends in success
stores the last successful load timestamp in a file (which would be touched only upon a successful load)
if data doesn't arrive even after x + some buffer minutes, it might make more sense for the load job to fail so that it gets the required attention.
I'm evaluating Nifi for our ETL process.
I want to build the following flow:
Fetch a lot of data from SQL database -> Split into chunks 1000 records
each -> Count error records in each chunk -> Count total number of error
records -> If it exceeds a threshold Fail process -> else save each chunk to the database.
The problem I can't resolve is how to wait until all chunks are validated.
If for example I have 5 validation tasks working concurrently, I need some
kind of barrier to wait until all chunks are processed and only after that
run error count processor because I don't want to save invalid data and
delete it if the threshold is reached.
The other question I have is if there is any possibility to run this
validation processor on multiple nodes in parallel and still have the
possibility to wait until they all are completed.
One solution to this is to use the ExecuteScript processor as a "relief valve" to hold a simple count in memory triggered off of the first receipt of a flowfile with a specific attribute value (store in the local/cluster state with basically a Map of key attribute-value to value count). Once that value reaches a threshold, you can generate a new flowfile to route to the success relationship containing the attribute value that has finished. In this case, send the other results (the flowfiles that need to be batched) to a MergeContent processor and set the minimum batching size to whatever you like. The follow-on processor to the valve should have its Scheduling Strategy set to Event Driven so it only runs when it receives a flowfile from the valve.
Updating count in distributed MapCache is not the correct way as fetch and update are separate and cannot be made in atomic processor which just increments counts.
http://apache-nifi-users-list.2361937.n4.nabble.com/How-do-I-atomically-increment-a-variable-in-NiFi-td1084.html
I would like to know whether trident batches are executed in parallel i.e. multiple batches can run at a time?
Apart from this I have few questions which are too small to be posted individually. If they are quite large enough, feel free to comment to post them individually.
What if processing only a particular tuple in a batch is failed?
Then the batch will be replayed, resulting in the reprocessing of tuples that are successfully processed previously? For example, word count, in which every tuple contains a word but only a few tuples were successfully counted? For example, if there are three words called man and the count shows only 2 which means that one tuple is failed in processing?
In this tutorial only, previous txid is stored. What about the previous transaction ids?
For example, there are three batches 1,2,3,4. Now, after batch #1, #2 are executed and batch #1 is replayed. Then txid will be 2 since the most recently processed batch is batch #2 and there is no way to recognize that whether batch #1 is previously processed or not.
If so, then the batches must be executed in order. That means until the batch #1 is successfully finished batch #2 cannot be executed. If it is the case, then where is the parallelism in executing the batches?
What if only a particular function is not executed properly for a batch in a topology?
For example, I have two functions, one is to persist the message into database and the other is to produce to kafka queue. And here, persisting in the database is successful however pushing to the kafka queue is failed due to some node failures (say for example). Then, I would want only the function that pushes to the kafka queue to be executed for that particular batch. Is there a way to do in trident? For this, I will need to store not only the txid but also a list of functions that are to be processed for that txid. How could it be done?
As best I understand:
Any failure is considered failing for the batch and it will be replayed by the spout. The transactional state stores the value and transaction id from the last operation. If counting "man" failed, its txid would be less than the current txid and it should add this batches data to the stored value. Otherwise, it can ignore the replay because it knows the data from this batch has already been counted for this key.
State transactions are processed in strict txid order, but only by the stateful components. Functions can execute on upcoming transaction tuples.
It sounds like you want States instead of Functions. The state will remember that it's already completed the batch, and ignore it when replayed.