Update operation concurrency on multiple nodes - oracle

I have a single application , maintained on two different nodes on cloud. I have a scheduler in the application which triggers every 5 minutes, which perform some update operation in database. How can I avoid the two operations to cause anomaly in database. Is there a way one application may know, that other instance is already been triggered or any sort of inter node communication that may happen in cloud foundry.
Many Thanks

A couple options come to mind for Cloud Foundry:
Create a distributed "lock" with your database. This could be as simple as a table or record in the DB that the scheduler checks out first before it does anything else. Once it has the lock, the scheduler can work. If it fails to obtain the lock, it goes back to sleep. Then when it's done, it returns the lock.
If you have lots of work to do, you could divide it into sections and have locks for each section, that way you could spread the work out across your different instances. This gets more complicated though, so you'd have to weigh the advantages against the extra complication to see if it's worth it for your use case.
Only run the scheduler on the first node. You can determine the first node by looking at your application instance number. Either the env variable CF_INSTANCE_INDEX or VCAP_APPLICATION, which contains JSON and has an instance_index property. For either option, the value will be 0 for the first instance. If it's 0, the scheduler runs. If it's greater than zero, the scheduler doesn't run.
Hope that helps!

Related

Spring and scheduled tasks on different Data Centers

I have one spring scheduler , which I will be deploying in 2 different data center.
My data centers will be in active and passive mode. I am looking for a mechanism where passive data center scheduler start working where that data center become active .
We can do it using manually changing some configurations to true/false but , I am looking for a automated process.
-Initial state:
Data center A active - Scheduler M is running.
Data center B passive - Scheduler M is turned off.
-May be after 3 days.
Data center A passive - Scheduler M turned off.
Data center B active - Scheduler M is starting
I don't know your business requirements but unless you want multiple instances running but only one active, the purpose you will have a load balancer would be to spread the load to multiple instances of the same application rather to stick with only one instance.
Anyway I think an easy way of doing this without using a very sophisticated mechanism (coming with a lot of complexity depending where you run your application) would be this:
Have shared location such as a semaphore table in your database storing the ID of the application instance owning the scheduler process
Have a timeout set for each task. Say if the scheduler is supposed to run every two minutes set the timeout to two minutes.
Have your schedulers always kick off on all application instances
Once the tasks kicks off first check if it is the one owning the processing. If yes do the work, if not go at point 7.
After doing the work record the time stamp of the task completion in the semaphore table
Wait for the time to pass for the next kick off
If not the one owning the processing check when the task last run in the semaphore table. If the time since last run is greater than the timeout set for that process take the ownership of the process (recording your application instance id in the semaphore table)
We applied this and it ran very well with one of our applications. In reality it was much more complex than explained above as we had a lot of application instances and we had to avoid starting an ownership battle between them. To address this we put in place a "Permission to process request" concept so no matter how many instances wanted to take control it was only one which was granted.
For another application with similar requirements we used a much much easier way to achieve this but the price we paid was some extra learning curve in using ILock from Hazelcast IMGB framework. That is really very easy but keep in mind the Hazelcat community edition comes with absolutely no security and paying for a Hazelcast license just to achieve this may be a bit of expense.
Again all depends on you use case, for us the semaphore table was good enough in first scenario but prove bad in the second one as the multiple processes trying to update the same table at the same time ended up with a lot of database contention which took us to Hazelcast.
Other ideas would be a custom health check implementation that could trigger activating one scheduler or the other depending of response received.
Hope that helps, just ideas from our experience. Good luck.

Using Quartz for long running job

I'm planning to use Quartz scheduler to process a one-time job.
My use case is, I need to migrate BLOB from one storage to another and blob's can be as big as 100GB, so a particular job can run really long enough to get the work done.
The reason I'm using Quartz because of its clustering support, fault tolerance and retry capabilities in case job fails etc. Only thing I'm concerned about is, I might have a lot of miss fire trigger scenario and a lot of database lock which can hamper live production traffic on those database hosts. I will probably be scheduling 10s of thousands of job in one shot.
Few of the things that I figured out is
I can set a high value for org.quartz.jobStore.misfireThreshold so that miss fire does not happen. I don't really care about the time when the job get's picked up as it's background job and no SLA as such. Only thing I care about is that eventually job getting picked up and getting work done.
I can also set batch mode properties org.quartz.scheduler.batchTriggerAcquisitionMaxCount and org.quartz.scheduler.batchTriggerAcquisitionFireAheadTimeWindow. I understand the batch max count property should be like equal to the thread pool size which can give the biggest bang on performance but what should be the value of fire ahead of time window be?
I'm using Quartz with Spring boot and will be leveraging org.quartz.impl.jdbcjobstore.JobStoreCMT. What I understand is execute method of the job get wrapped in the transaction, will this cause any problem since transaction will be open for a long time as the job might take hours to complete? Is this something ok? I will be using Oracle database.
Am I missing something here? Can someone share their experience with a similar use case?
Thanks!

How quartz detect nodes fails

My production environment running a java scheduler job using quartz 2.1.4. on weblogic cluster server with 4 machine and only one schedule job execute at one cluster node (node 1) normally for few months, but node 2 sudden find the node 1 fail at take over the executing job last night. In fact, the node 1 without error (according to the server, network, database, application log), this event caused duplicate message created due to 2 process concurrent execute.
What is the mechanism of quartz to detect node fails? By ping scan, or heart beat ping via UCP broadcast, or database respond time other? Any configuration on it?
I have read the quartz configuration guide
http://quartz-scheduler.org/documentation/quartz-2.1.x/configuration/ConfigJDBCJobStoreClustering
, but there is no answer.
I am using JDBCJobstore. After details checking, we found that there is a database (Oracle) statement executing abnormal long (from 5 sec to 30 sec). The incident happened on this period of time. Do you think it related?
my configuration is
`
org.quartz.threadPool.threadCount=10
org.quartz.threadPool.threadPriority=5
org.quartz.jobStore.misfireThreshold = 10000
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
`
Anyone have this information? Thanks.
I know the answer is very late, but maybe somebody like both of us will still need it.
Short version: it is all handled by DB. Important property would be org.quartz.jobStore.clusterCheckinInterval.
Long version (all credits go to http://flylib.com/books/en/2.65.1.91/1/ ) :
Detecting Failed Scheduler Nodes
When a Scheduler instance performs the check-in routine, it looks to
see if there are other Scheduler instances that didn't check in when
they were supposed to. It does this by inspecting the SCHEDULER_STATE
table and looking for schedulers that have a value in the
LAST_CHECK_TIME column that is older than the property
org.quartz.jobStore.clusterCheckinInterval (discussed in the next
section). If one or more nodes haven't checked in, the running
Scheduler assumes that the other instance(s) have failed.
Additionally the next paragraph might also be important:
Running Nodes on Separate Machines with Unsynchronized Clocks
As you can ascertain by now, if you run nodes on different machines and the
clocks are not synchronized, you can get unexpected results. This is
because a timestamp is being used to inform other instances of the
last time one node checked in. If that node's clock was set for the
future, a running Scheduler might never realize that a node has gone
down. On the other hand, if a clock on one node is set in the past, a
node might assume that the node has gone down and attempt to take over
and rerun its jobs. In either case, it's not the behavior that you
want. When you're using different machines in a cluster (which is the
normal case), be sure to synchronize the clocks. See the section
"Quartz Clustering Cookbook," later in this chapter for details on how
to do this.

Oracle bulk jobs, how to prevent locks

We have several nightly jobs running inside an Oracle 11g R2 instance, not all of these jobs are under our control. Some of them are external data loads run by third parties. The jobs are implemented as PL/SQL packages and run using DBMS_SCHEDULER facilities.
Some of these jobs operate on the same set of data, a table with user entries, e. g. updating personal data, removing retired users, adding newly joined users. Since the jobs mostly use bulk statements to run the updates, we have run into blocking locks several times now, having to kill single jobs to allow others to run through.
What are good ways to prevent jobs from colliding with each other?
I am thinking about things like:
setting up a meta-job which knows about dependencies and coordinates the dependent jobs
creating schedules which keep conflicting jobs as separate as possible
coordinating jobs with the third parties to prevent conflicts between "external" and "internal" jobs
not using bulk statements (updating everything at once with a single MERGE or UPDATE) but instead update one by one committing the intermediate results
Especially the last option seems a plausible approach to me in order to reduce the probability of blocking locks. But I know that performance suffers a lot when I switch our jobs from bulk updates to looping over cursors.
This may be a good use of the DBMS_LOCK package. DBMS_LOCK allows you access to the same enqueue/locking model that Oracle uses internally.
You can establish an enqueue, and then multiple processes may take that enqueue in various lock modes. Locks will show up like any other enqueue, with type 'UL' (for user lock).
For example, suppose you have three processes that can all run concurrently, but then you have a process that needs to wait for all three of those processes to run, and needs to run by itself, and then it's followed by two more processes that can run concurrently once that process completes.
You could have the first three processes take the UL enqueue in 'S' (shared) mode, and they will all be able to run concurrently. Then run the process that needs to run by itself, but at the beginning of the code, have it take the UL enqueue in 'X' (exclusive) mode. That process will wait for the three processes holding enqueue in shared mode to complete. Now, you can also run the last two processes, again, with shared mode. They will queue behind the process that's requesting exclusive mode locks, and everything runs in the order you want.
That's a simple example. With more than one UL type lock, and multiple modes that locks can be held in, your processes and locking strategy may be arbitrarily complex.
Hope that helps.
It is very hard to give any advice without knowing all the details.
Simplest thing would be to schedule jobs not to overlap (if process permits).
If you cannot do that, then probably there is no easy solution, especially if there are jobs you cannot modify.
Smaller transactions make it less likely to collide, however Murphy might/will hit you anyway. I would start the jobs in the 'right' order...

how to implement custom cloud worker

I am designing a cloud app and need a worker process which scours my database looking for work, and then performs it.
Most of the info I seem to find on the subject of background tasks in the cloud involves some kind of scheduler and/or queuing system.
What I have doesn't quite fit into the "run this task every 5 minutes" or "add this to the queue to be executed later" models. I think the main difference to my problem is that the workers themselves find work to do, rather than being assigned it by a periodic scheduler or an external process that generates work.
What I have is basically a giant table where each entry has three fields:
job: a small task to be performed, lets say it gets the last message from a twitter account and stores it in the database
the interval at which to perform that job: say every 5 minutes, N.B. the interval is arbitrary and different for each entry in the table
the last date when the job was performed
The way I would implement this is to have a worker which has an infinite loop. When it enters the loop, it scours the database a)looking for items whose date + interval < currentTime, b)when it finds one, it sets date = currentTime, and c)then executes the job. If there is no work ATM, it sleep for a few seconds, then tries again.
I will have many parallel workers scouring the database simultaneously, which is why I do b) first and then c) in the paragraph above. Since there are parallel workers, action a) and b) are atomic operations on the database to prevent work being duplicated. If the worker crashes after a) and b), but before it manages to finish the work, it's no big deal, and the workers can just do it at the next interval; reason for this is that the work is not performed in a time-invariant system so a backlog scenario of failed jobs has no benefit as the tasks have to be performed at their exact intervals, so it's better to skip 1 interval than to have uneven intervals between which the tasks were executed.
My question is whether that is a reasonable implementation strategy? If so, how do I bring this process to life on the cloud (I am using Heroku, but may switch to EC2 in the future)? I still haven't written any code so I would welcome other suggestions (maybe I misunderstood the use cases/applications for queue systems).
This sounds so close to using something like a scheduled job that you might as well tread the well beaten path and do it the more conventional way. There's no reason why you can't schedule a job to run once every few seconds.
However, this idea of looking for work sounds dodgy. What happens if two workers find the same task to run at the same time for instance? Also, are there not triggers in the application which can indicate that work needs doing? It seems strange that you have code 'looking for work'.
You can go a very long way with simple periodic background tasks, so I would exhaust all possibilities in that area before rolling your own.

Resources