Spring and scheduled tasks on different Data Centers

Spring and scheduled tasks on different Data Centers - spring-boot

I have one spring scheduler , which I will be deploying in 2 different data center.
My data centers will be in active and passive mode. I am looking for a mechanism where passive data center scheduler start working where that data center become active .
We can do it using manually changing some configurations to true/false but , I am looking for a automated process.
-Initial state:
Data center A active - Scheduler M is running.
Data center B passive - Scheduler M is turned off.
-May be after 3 days.
Data center A passive - Scheduler M turned off.
Data center B active - Scheduler M is starting

I don't know your business requirements but unless you want multiple instances running but only one active, the purpose you will have a load balancer would be to spread the load to multiple instances of the same application rather to stick with only one instance.
Anyway I think an easy way of doing this without using a very sophisticated mechanism (coming with a lot of complexity depending where you run your application) would be this:
Have shared location such as a semaphore table in your database storing the ID of the application instance owning the scheduler process
Have a timeout set for each task. Say if the scheduler is supposed to run every two minutes set the timeout to two minutes.
Have your schedulers always kick off on all application instances
Once the tasks kicks off first check if it is the one owning the processing. If yes do the work, if not go at point 7.
After doing the work record the time stamp of the task completion in the semaphore table
Wait for the time to pass for the next kick off
If not the one owning the processing check when the task last run in the semaphore table. If the time since last run is greater than the timeout set for that process take the ownership of the process (recording your application instance id in the semaphore table)
We applied this and it ran very well with one of our applications. In reality it was much more complex than explained above as we had a lot of application instances and we had to avoid starting an ownership battle between them. To address this we put in place a "Permission to process request" concept so no matter how many instances wanted to take control it was only one which was granted.
For another application with similar requirements we used a much much easier way to achieve this but the price we paid was some extra learning curve in using ILock from Hazelcast IMGB framework. That is really very easy but keep in mind the Hazelcat community edition comes with absolutely no security and paying for a Hazelcast license just to achieve this may be a bit of expense.
Again all depends on you use case, for us the semaphore table was good enough in first scenario but prove bad in the second one as the multiple processes trying to update the same table at the same time ended up with a lot of database contention which took us to Hazelcast.
Other ideas would be a custom health check implementation that could trigger activating one scheduler or the other depending of response received.
Hope that helps, just ideas from our experience. Good luck.

Related

Using Quartz for long running job

I'm planning to use Quartz scheduler to process a one-time job.
My use case is, I need to migrate BLOB from one storage to another and blob's can be as big as 100GB, so a particular job can run really long enough to get the work done.
The reason I'm using Quartz because of its clustering support, fault tolerance and retry capabilities in case job fails etc. Only thing I'm concerned about is, I might have a lot of miss fire trigger scenario and a lot of database lock which can hamper live production traffic on those database hosts. I will probably be scheduling 10s of thousands of job in one shot.
Few of the things that I figured out is
I can set a high value for org.quartz.jobStore.misfireThreshold so that miss fire does not happen. I don't really care about the time when the job get's picked up as it's background job and no SLA as such. Only thing I care about is that eventually job getting picked up and getting work done.
I can also set batch mode properties org.quartz.scheduler.batchTriggerAcquisitionMaxCount and org.quartz.scheduler.batchTriggerAcquisitionFireAheadTimeWindow. I understand the batch max count property should be like equal to the thread pool size which can give the biggest bang on performance but what should be the value of fire ahead of time window be?
I'm using Quartz with Spring boot and will be leveraging org.quartz.impl.jdbcjobstore.JobStoreCMT. What I understand is execute method of the job get wrapped in the transaction, will this cause any problem since transaction will be open for a long time as the job might take hours to complete? Is this something ok? I will be using Oracle database.
Am I missing something here? Can someone share their experience with a similar use case?
Thanks!

How to keep webserver responsive while executing many asynchronous background tasks

I am working on a web application that provides its users to optionally execute long-running processes 'in background'. An example would be some long-running report generation, or deleting thousands of objects simultaneously.
I've implemented this using an ExecutorService defined as FixedThreadPool using a ThreadFactory. The ThreadFactory is built like this:
ThreadFactoryBuilder()
.setNameFormat(clientId + "-BackgroundTask-%d")
.setDaemon(true)
.setPriority(Thread.MIN_PRIORITY)
.build()
I execute the task like this:
Future<TaskStatus> future = clientExecutors.get(clientId).submit(
backgroundTask::execute);
taskFutures.put(backgroundTask.getTaskId(), future);
How can I enforce my webserver to always priorize handling new incoming requests (as fast as possible) over executing background tasks?
In other words: It should never ever happen, that a user has to wait long time while browsing the site, just because there are a lot of background-tasks executing. As you can see from above, I tried to do this by setting .setPriority(Thread.MIN_PRIORITY). However that does not seem to be sufficient.
Furthermore, as for now, I've set some arbitrary value for the FixedThreadPool size (10) and use it globally for the entire background-handling of the application (and all its customers).
Instead I would like to define a threadpool for each customer, to make sure each customer has the same privilege to run a certain amount of tasks in the background. Say, each customer has a FixedThreadPool of size 5, and on the server I'll have a max. of 50 different customers. That would add up to 250 running background tasks at the same time.
The most important requirement here is: it does not matter, how long these background-tasks need to execute (say 2 minutes, or 20 minutes). What is important, is that each customer has the ability to send 5 tasks to be executed in background, and each of those are worked on equally.
I've tested running 30 cpu-intensive background tasks and it turns out that while these are running and cpu is near 100%, new incoming requests take a very long time to be handled.
So obviously, I am doing it wrong.
Update 12.09.2017
I've read about microservices and while it sounds great I see a great challenge in splitting the necessary parts from our monolithic application. Mostly because nearly every operation might turn into a long running process given a big enough data selection.
Furthermore, wouldn't I run into the same problem with my microservice, i.e. the server running the microservice would suffer the same performance degradation. Well the only good thing would, that the rest of the web app would not suffer from it anymore.
I've read some posts about introducing Thread.sleep(1) or Thread.sleep in general into CPU-heavy operations to reduce the amount of CPU used in these operations. I've also read about someone who introduced this as an aspect so that he can even change the amount of time waited dynamically in order to have some control about how much cpu would be used.
However, my gut tells me that ain't right either. What do you think about introducing Thread.sleep to lower the amount of CPU used for a task? Is this common practice? If not, what would be the right approach?

I would highly consider changing your system architecture to offload these long-running requests to a separate instance instead of running them in-process with the general request-service application. In general I think it is an anti-pattern to handle both batch / online (or long / short running) processing in the same application instance.
Ideally you'd build a standalone microservice to handle these requests, but you could also simply just deploy X instances of your existing application, and configure your load balancer to route requests to the long running invocation paths (e.g. POST /myapp/longrunningjob) only to the instances dedicated to running these long-running processes.

Update operation concurrency on multiple nodes

I have a single application , maintained on two different nodes on cloud. I have a scheduler in the application which triggers every 5 minutes, which perform some update operation in database. How can I avoid the two operations to cause anomaly in database. Is there a way one application may know, that other instance is already been triggered or any sort of inter node communication that may happen in cloud foundry.
Many Thanks

A couple options come to mind for Cloud Foundry:
Create a distributed "lock" with your database. This could be as simple as a table or record in the DB that the scheduler checks out first before it does anything else. Once it has the lock, the scheduler can work. If it fails to obtain the lock, it goes back to sleep. Then when it's done, it returns the lock.
If you have lots of work to do, you could divide it into sections and have locks for each section, that way you could spread the work out across your different instances. This gets more complicated though, so you'd have to weigh the advantages against the extra complication to see if it's worth it for your use case.
Only run the scheduler on the first node. You can determine the first node by looking at your application instance number. Either the env variable CF_INSTANCE_INDEX or VCAP_APPLICATION, which contains JSON and has an instance_index property. For either option, the value will be 0 for the first instance. If it's 0, the scheduler runs. If it's greater than zero, the scheduler doesn't run.
Hope that helps!

How quartz detect nodes fails

My production environment running a java scheduler job using quartz 2.1.4. on weblogic cluster server with 4 machine and only one schedule job execute at one cluster node (node 1) normally for few months, but node 2 sudden find the node 1 fail at take over the executing job last night. In fact, the node 1 without error (according to the server, network, database, application log), this event caused duplicate message created due to 2 process concurrent execute.
What is the mechanism of quartz to detect node fails? By ping scan, or heart beat ping via UCP broadcast, or database respond time other? Any configuration on it?
I have read the quartz configuration guide
http://quartz-scheduler.org/documentation/quartz-2.1.x/configuration/ConfigJDBCJobStoreClustering
, but there is no answer.
I am using JDBCJobstore. After details checking, we found that there is a database (Oracle) statement executing abnormal long (from 5 sec to 30 sec). The incident happened on this period of time. Do you think it related?
my configuration is
`
org.quartz.threadPool.threadCount=10
org.quartz.threadPool.threadPriority=5
org.quartz.jobStore.misfireThreshold = 10000
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
`
Anyone have this information? Thanks.

I know the answer is very late, but maybe somebody like both of us will still need it.
Short version: it is all handled by DB. Important property would be org.quartz.jobStore.clusterCheckinInterval.
Long version (all credits go to http://flylib.com/books/en/2.65.1.91/1/ ) :
Detecting Failed Scheduler Nodes
When a Scheduler instance performs the check-in routine, it looks to
see if there are other Scheduler instances that didn't check in when
they were supposed to. It does this by inspecting the SCHEDULER_STATE
table and looking for schedulers that have a value in the
LAST_CHECK_TIME column that is older than the property
org.quartz.jobStore.clusterCheckinInterval (discussed in the next
section). If one or more nodes haven't checked in, the running
Scheduler assumes that the other instance(s) have failed.
Additionally the next paragraph might also be important:
Running Nodes on Separate Machines with Unsynchronized Clocks
As you can ascertain by now, if you run nodes on different machines and the
clocks are not synchronized, you can get unexpected results. This is
because a timestamp is being used to inform other instances of the
last time one node checked in. If that node's clock was set for the
future, a running Scheduler might never realize that a node has gone
down. On the other hand, if a clock on one node is set in the past, a
node might assume that the node has gone down and attempt to take over
and rerun its jobs. In either case, it's not the behavior that you
want. When you're using different machines in a cluster (which is the
normal case), be sure to synchronize the clocks. See the section
"Quartz Clustering Cookbook," later in this chapter for details on how
to do this.

how to implement custom cloud worker

I am designing a cloud app and need a worker process which scours my database looking for work, and then performs it.
Most of the info I seem to find on the subject of background tasks in the cloud involves some kind of scheduler and/or queuing system.
What I have doesn't quite fit into the "run this task every 5 minutes" or "add this to the queue to be executed later" models. I think the main difference to my problem is that the workers themselves find work to do, rather than being assigned it by a periodic scheduler or an external process that generates work.
What I have is basically a giant table where each entry has three fields:
job: a small task to be performed, lets say it gets the last message from a twitter account and stores it in the database
the interval at which to perform that job: say every 5 minutes, N.B. the interval is arbitrary and different for each entry in the table
the last date when the job was performed
The way I would implement this is to have a worker which has an infinite loop. When it enters the loop, it scours the database a)looking for items whose date + interval < currentTime, b)when it finds one, it sets date = currentTime, and c)then executes the job. If there is no work ATM, it sleep for a few seconds, then tries again.
I will have many parallel workers scouring the database simultaneously, which is why I do b) first and then c) in the paragraph above. Since there are parallel workers, action a) and b) are atomic operations on the database to prevent work being duplicated. If the worker crashes after a) and b), but before it manages to finish the work, it's no big deal, and the workers can just do it at the next interval; reason for this is that the work is not performed in a time-invariant system so a backlog scenario of failed jobs has no benefit as the tasks have to be performed at their exact intervals, so it's better to skip 1 interval than to have uneven intervals between which the tasks were executed.
My question is whether that is a reasonable implementation strategy? If so, how do I bring this process to life on the cloud (I am using Heroku, but may switch to EC2 in the future)? I still haven't written any code so I would welcome other suggestions (maybe I misunderstood the use cases/applications for queue systems).

This sounds so close to using something like a scheduled job that you might as well tread the well beaten path and do it the more conventional way. There's no reason why you can't schedule a job to run once every few seconds.
However, this idea of looking for work sounds dodgy. What happens if two workers find the same task to run at the same time for instance? Also, are there not triggers in the application which can indicate that work needs doing? It seems strange that you have code 'looking for work'.
You can go a very long way with simple periodic background tasks, so I would exhaust all possibilities in that area before rolling your own.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio