Using Quartz for long running job - spring

I'm planning to use Quartz scheduler to process a one-time job.
My use case is, I need to migrate BLOB from one storage to another and blob's can be as big as 100GB, so a particular job can run really long enough to get the work done.
The reason I'm using Quartz because of its clustering support, fault tolerance and retry capabilities in case job fails etc. Only thing I'm concerned about is, I might have a lot of miss fire trigger scenario and a lot of database lock which can hamper live production traffic on those database hosts. I will probably be scheduling 10s of thousands of job in one shot.
Few of the things that I figured out is
I can set a high value for org.quartz.jobStore.misfireThreshold so that miss fire does not happen. I don't really care about the time when the job get's picked up as it's background job and no SLA as such. Only thing I care about is that eventually job getting picked up and getting work done.
I can also set batch mode properties org.quartz.scheduler.batchTriggerAcquisitionMaxCount and org.quartz.scheduler.batchTriggerAcquisitionFireAheadTimeWindow. I understand the batch max count property should be like equal to the thread pool size which can give the biggest bang on performance but what should be the value of fire ahead of time window be?
I'm using Quartz with Spring boot and will be leveraging org.quartz.impl.jdbcjobstore.JobStoreCMT. What I understand is execute method of the job get wrapped in the transaction, will this cause any problem since transaction will be open for a long time as the job might take hours to complete? Is this something ok? I will be using Oracle database.
Am I missing something here? Can someone share their experience with a similar use case?
Thanks!

Related

Spring and scheduled tasks on different Data Centers

I have one spring scheduler , which I will be deploying in 2 different data center.
My data centers will be in active and passive mode. I am looking for a mechanism where passive data center scheduler start working where that data center become active .
We can do it using manually changing some configurations to true/false but , I am looking for a automated process.
-Initial state:
Data center A active - Scheduler M is running.
Data center B passive - Scheduler M is turned off.
-May be after 3 days.
Data center A passive - Scheduler M turned off.
Data center B active - Scheduler M is starting
I don't know your business requirements but unless you want multiple instances running but only one active, the purpose you will have a load balancer would be to spread the load to multiple instances of the same application rather to stick with only one instance.
Anyway I think an easy way of doing this without using a very sophisticated mechanism (coming with a lot of complexity depending where you run your application) would be this:
Have shared location such as a semaphore table in your database storing the ID of the application instance owning the scheduler process
Have a timeout set for each task. Say if the scheduler is supposed to run every two minutes set the timeout to two minutes.
Have your schedulers always kick off on all application instances
Once the tasks kicks off first check if it is the one owning the processing. If yes do the work, if not go at point 7.
After doing the work record the time stamp of the task completion in the semaphore table
Wait for the time to pass for the next kick off
If not the one owning the processing check when the task last run in the semaphore table. If the time since last run is greater than the timeout set for that process take the ownership of the process (recording your application instance id in the semaphore table)
We applied this and it ran very well with one of our applications. In reality it was much more complex than explained above as we had a lot of application instances and we had to avoid starting an ownership battle between them. To address this we put in place a "Permission to process request" concept so no matter how many instances wanted to take control it was only one which was granted.
For another application with similar requirements we used a much much easier way to achieve this but the price we paid was some extra learning curve in using ILock from Hazelcast IMGB framework. That is really very easy but keep in mind the Hazelcat community edition comes with absolutely no security and paying for a Hazelcast license just to achieve this may be a bit of expense.
Again all depends on you use case, for us the semaphore table was good enough in first scenario but prove bad in the second one as the multiple processes trying to update the same table at the same time ended up with a lot of database contention which took us to Hazelcast.
Other ideas would be a custom health check implementation that could trigger activating one scheduler or the other depending of response received.
Hope that helps, just ideas from our experience. Good luck.

Spring Boot, Cron job synchronization

In my Spring Boot application, based on the Cron job(runs every 5 minutes) I need to process 2000 products in my database.
Right now the process time of these 2000 products takes more than 5 minutes. I ran into the issue where the second Cron job runs when the first one is not completed yet.
Is there in Spring/Cron out of the box functionality that will allow to synchronize these jobs and wait for the previous job completion before starting the next one?
Please advise how to properly implement such kind of system. Anyway, the following technologies are also available Neo4j, MongoDB, Kafka. Please advise how to properly design/implement this functionality using the Spring/Cron separately or even together with the mentioned technologies.
1) You may try to use #Scheduled(fixedDelay = 5*60*1000). It will guarantee that next invocation will happen strictly in 5 minutes after previous one is finished. But this may break your scheduling requirements
2) You can limit the underlying ThreadExecutor's pool size to 1 thread, so next invocation will have to wait until previous is finished, but this, again, can break the logic, since it would affect all periodic tasks invoked by #Scheduled
3) You can use Quartz instead of spring's native #Scheduled. It's more complicated to configure, but allows to achieve the desired behaviour via #DisallowConcurrentExecution annotation or via setting JobDetail::isConcurrentExectionDisallowed in your job details

How to keep webserver responsive while executing many asynchronous background tasks

I am working on a web application that provides its users to optionally execute long-running processes 'in background'. An example would be some long-running report generation, or deleting thousands of objects simultaneously.
I've implemented this using an ExecutorService defined as FixedThreadPool using a ThreadFactory. The ThreadFactory is built like this:
ThreadFactoryBuilder()
.setNameFormat(clientId + "-BackgroundTask-%d")
.setDaemon(true)
.setPriority(Thread.MIN_PRIORITY)
.build()
I execute the task like this:
Future<TaskStatus> future = clientExecutors.get(clientId).submit(
backgroundTask::execute);
taskFutures.put(backgroundTask.getTaskId(), future);
How can I enforce my webserver to always priorize handling new incoming requests (as fast as possible) over executing background tasks?
In other words: It should never ever happen, that a user has to wait long time while browsing the site, just because there are a lot of background-tasks executing. As you can see from above, I tried to do this by setting .setPriority(Thread.MIN_PRIORITY). However that does not seem to be sufficient.
Furthermore, as for now, I've set some arbitrary value for the FixedThreadPool size (10) and use it globally for the entire background-handling of the application (and all its customers).
Instead I would like to define a threadpool for each customer, to make sure each customer has the same privilege to run a certain amount of tasks in the background. Say, each customer has a FixedThreadPool of size 5, and on the server I'll have a max. of 50 different customers. That would add up to 250 running background tasks at the same time.
The most important requirement here is: it does not matter, how long these background-tasks need to execute (say 2 minutes, or 20 minutes). What is important, is that each customer has the ability to send 5 tasks to be executed in background, and each of those are worked on equally.
I've tested running 30 cpu-intensive background tasks and it turns out that while these are running and cpu is near 100%, new incoming requests take a very long time to be handled.
So obviously, I am doing it wrong.
Update 12.09.2017
I've read about microservices and while it sounds great I see a great challenge in splitting the necessary parts from our monolithic application. Mostly because nearly every operation might turn into a long running process given a big enough data selection.
Furthermore, wouldn't I run into the same problem with my microservice, i.e. the server running the microservice would suffer the same performance degradation. Well the only good thing would, that the rest of the web app would not suffer from it anymore.
I've read some posts about introducing Thread.sleep(1) or Thread.sleep in general into CPU-heavy operations to reduce the amount of CPU used in these operations. I've also read about someone who introduced this as an aspect so that he can even change the amount of time waited dynamically in order to have some control about how much cpu would be used.
However, my gut tells me that ain't right either. What do you think about introducing Thread.sleep to lower the amount of CPU used for a task? Is this common practice? If not, what would be the right approach?
I would highly consider changing your system architecture to offload these long-running requests to a separate instance instead of running them in-process with the general request-service application. In general I think it is an anti-pattern to handle both batch / online (or long / short running) processing in the same application instance.
Ideally you'd build a standalone microservice to handle these requests, but you could also simply just deploy X instances of your existing application, and configure your load balancer to route requests to the long running invocation paths (e.g. POST /myapp/longrunningjob) only to the instances dedicated to running these long-running processes.

Update operation concurrency on multiple nodes

I have a single application , maintained on two different nodes on cloud. I have a scheduler in the application which triggers every 5 minutes, which perform some update operation in database. How can I avoid the two operations to cause anomaly in database. Is there a way one application may know, that other instance is already been triggered or any sort of inter node communication that may happen in cloud foundry.
Many Thanks
A couple options come to mind for Cloud Foundry:
Create a distributed "lock" with your database. This could be as simple as a table or record in the DB that the scheduler checks out first before it does anything else. Once it has the lock, the scheduler can work. If it fails to obtain the lock, it goes back to sleep. Then when it's done, it returns the lock.
If you have lots of work to do, you could divide it into sections and have locks for each section, that way you could spread the work out across your different instances. This gets more complicated though, so you'd have to weigh the advantages against the extra complication to see if it's worth it for your use case.
Only run the scheduler on the first node. You can determine the first node by looking at your application instance number. Either the env variable CF_INSTANCE_INDEX or VCAP_APPLICATION, which contains JSON and has an instance_index property. For either option, the value will be 0 for the first instance. If it's 0, the scheduler runs. If it's greater than zero, the scheduler doesn't run.
Hope that helps!

how to implement custom cloud worker

I am designing a cloud app and need a worker process which scours my database looking for work, and then performs it.
Most of the info I seem to find on the subject of background tasks in the cloud involves some kind of scheduler and/or queuing system.
What I have doesn't quite fit into the "run this task every 5 minutes" or "add this to the queue to be executed later" models. I think the main difference to my problem is that the workers themselves find work to do, rather than being assigned it by a periodic scheduler or an external process that generates work.
What I have is basically a giant table where each entry has three fields:
job: a small task to be performed, lets say it gets the last message from a twitter account and stores it in the database
the interval at which to perform that job: say every 5 minutes, N.B. the interval is arbitrary and different for each entry in the table
the last date when the job was performed
The way I would implement this is to have a worker which has an infinite loop. When it enters the loop, it scours the database a)looking for items whose date + interval < currentTime, b)when it finds one, it sets date = currentTime, and c)then executes the job. If there is no work ATM, it sleep for a few seconds, then tries again.
I will have many parallel workers scouring the database simultaneously, which is why I do b) first and then c) in the paragraph above. Since there are parallel workers, action a) and b) are atomic operations on the database to prevent work being duplicated. If the worker crashes after a) and b), but before it manages to finish the work, it's no big deal, and the workers can just do it at the next interval; reason for this is that the work is not performed in a time-invariant system so a backlog scenario of failed jobs has no benefit as the tasks have to be performed at their exact intervals, so it's better to skip 1 interval than to have uneven intervals between which the tasks were executed.
My question is whether that is a reasonable implementation strategy? If so, how do I bring this process to life on the cloud (I am using Heroku, but may switch to EC2 in the future)? I still haven't written any code so I would welcome other suggestions (maybe I misunderstood the use cases/applications for queue systems).
This sounds so close to using something like a scheduled job that you might as well tread the well beaten path and do it the more conventional way. There's no reason why you can't schedule a job to run once every few seconds.
However, this idea of looking for work sounds dodgy. What happens if two workers find the same task to run at the same time for instance? Also, are there not triggers in the application which can indicate that work needs doing? It seems strange that you have code 'looking for work'.
You can go a very long way with simple periodic background tasks, so I would exhaust all possibilities in that area before rolling your own.

Resources