how to implement custom cloud worker - heroku

I am designing a cloud app and need a worker process which scours my database looking for work, and then performs it.
Most of the info I seem to find on the subject of background tasks in the cloud involves some kind of scheduler and/or queuing system.
What I have doesn't quite fit into the "run this task every 5 minutes" or "add this to the queue to be executed later" models. I think the main difference to my problem is that the workers themselves find work to do, rather than being assigned it by a periodic scheduler or an external process that generates work.
What I have is basically a giant table where each entry has three fields:
job: a small task to be performed, lets say it gets the last message from a twitter account and stores it in the database
the interval at which to perform that job: say every 5 minutes, N.B. the interval is arbitrary and different for each entry in the table
the last date when the job was performed
The way I would implement this is to have a worker which has an infinite loop. When it enters the loop, it scours the database a)looking for items whose date + interval < currentTime, b)when it finds one, it sets date = currentTime, and c)then executes the job. If there is no work ATM, it sleep for a few seconds, then tries again.
I will have many parallel workers scouring the database simultaneously, which is why I do b) first and then c) in the paragraph above. Since there are parallel workers, action a) and b) are atomic operations on the database to prevent work being duplicated. If the worker crashes after a) and b), but before it manages to finish the work, it's no big deal, and the workers can just do it at the next interval; reason for this is that the work is not performed in a time-invariant system so a backlog scenario of failed jobs has no benefit as the tasks have to be performed at their exact intervals, so it's better to skip 1 interval than to have uneven intervals between which the tasks were executed.
My question is whether that is a reasonable implementation strategy? If so, how do I bring this process to life on the cloud (I am using Heroku, but may switch to EC2 in the future)? I still haven't written any code so I would welcome other suggestions (maybe I misunderstood the use cases/applications for queue systems).

This sounds so close to using something like a scheduled job that you might as well tread the well beaten path and do it the more conventional way. There's no reason why you can't schedule a job to run once every few seconds.
However, this idea of looking for work sounds dodgy. What happens if two workers find the same task to run at the same time for instance? Also, are there not triggers in the application which can indicate that work needs doing? It seems strange that you have code 'looking for work'.
You can go a very long way with simple periodic background tasks, so I would exhaust all possibilities in that area before rolling your own.

Related

Spring and scheduled tasks on different Data Centers

I have one spring scheduler , which I will be deploying in 2 different data center.
My data centers will be in active and passive mode. I am looking for a mechanism where passive data center scheduler start working where that data center become active .
We can do it using manually changing some configurations to true/false but , I am looking for a automated process.
-Initial state:
Data center A active - Scheduler M is running.
Data center B passive - Scheduler M is turned off.
-May be after 3 days.
Data center A passive - Scheduler M turned off.
Data center B active - Scheduler M is starting
I don't know your business requirements but unless you want multiple instances running but only one active, the purpose you will have a load balancer would be to spread the load to multiple instances of the same application rather to stick with only one instance.
Anyway I think an easy way of doing this without using a very sophisticated mechanism (coming with a lot of complexity depending where you run your application) would be this:
Have shared location such as a semaphore table in your database storing the ID of the application instance owning the scheduler process
Have a timeout set for each task. Say if the scheduler is supposed to run every two minutes set the timeout to two minutes.
Have your schedulers always kick off on all application instances
Once the tasks kicks off first check if it is the one owning the processing. If yes do the work, if not go at point 7.
After doing the work record the time stamp of the task completion in the semaphore table
Wait for the time to pass for the next kick off
If not the one owning the processing check when the task last run in the semaphore table. If the time since last run is greater than the timeout set for that process take the ownership of the process (recording your application instance id in the semaphore table)
We applied this and it ran very well with one of our applications. In reality it was much more complex than explained above as we had a lot of application instances and we had to avoid starting an ownership battle between them. To address this we put in place a "Permission to process request" concept so no matter how many instances wanted to take control it was only one which was granted.
For another application with similar requirements we used a much much easier way to achieve this but the price we paid was some extra learning curve in using ILock from Hazelcast IMGB framework. That is really very easy but keep in mind the Hazelcat community edition comes with absolutely no security and paying for a Hazelcast license just to achieve this may be a bit of expense.
Again all depends on you use case, for us the semaphore table was good enough in first scenario but prove bad in the second one as the multiple processes trying to update the same table at the same time ended up with a lot of database contention which took us to Hazelcast.
Other ideas would be a custom health check implementation that could trigger activating one scheduler or the other depending of response received.
Hope that helps, just ideas from our experience. Good luck.

Cron vs queued task

My application has an Order model with an execution_datetime attribute. I'd like to send some distinct notifications. For example
execution_datetime minus 12 hours: email to carrier
execution_datetime minus 3 hours: sms to customer
execution_datetime plus 1 hour: email to customer
The above timings are not strict and can be approximated; slight deviations are acceptable. Also, the execution_datetime can change in the meantime...
I'm unsure whether to use cron or queued tasks for this. Some thoughts of my own:
Cron:
Business logic will need to be written to fetch applicable orders and execute accordingly
Is execution guaranteed? Should some sort of database flag be implemented to indicate a notification has been sent, and then perhaps fetch all due orders that are unflagged as some sort of failsafe?
Queued tasks:
Task is scheduled on creation of the order? If so, suppose the execution time is changed. How to modify the scheduled task? You'd need to somewhere keep track of the task ID?
Or perhaps a cron job that mass schedules applicable tasks every day?
I look forward to your suggestions.
Great question! I am interested in this discussion.Let me chip in with a scenario from my personal experience.
In my application, I have a Listing model and they have a promotion_ends_at column. Obviously, the listing promotion ends sometimes in the future.
So, like you also mentioned, there are two ways to do this.
When the listing is created, I could queue a job that will end the promotion on the listing in the future). The delay of that job would be the time the promotion has to end (and that could me months away).
I could also have a cron job that runs regularly that manages listings that their promotions should end on a specific date.
We were using SQS as our queue service and since the maximum delay on SQS is 15 mins, option 1 was not feasible. We, then, moved to Redis where we could queue delayed jobs with a long delay easily.
However, like you also said, the promotion_ends_at column could be updated during that time. So, either, you would have to keep track of the job to de-queue it or you could re-check whether the job should still run when it is about to execute.
For example, you could fresh() the model and check whether your condition is still valid. In my case, I would fresh my Listing and check if the promotion_ends_at is in the past. However, this means that we would have a lot of stale jobs that would probably be discarded anyway.
We finally went with a simple cron job that mass schedules the job on the day that they need to be run. I also think that running delayed jobs is a business logic and maybe the queue shouldn't be held responsible for running jobs delayed far too much in the future.

Update operation concurrency on multiple nodes

I have a single application , maintained on two different nodes on cloud. I have a scheduler in the application which triggers every 5 minutes, which perform some update operation in database. How can I avoid the two operations to cause anomaly in database. Is there a way one application may know, that other instance is already been triggered or any sort of inter node communication that may happen in cloud foundry.
Many Thanks
A couple options come to mind for Cloud Foundry:
Create a distributed "lock" with your database. This could be as simple as a table or record in the DB that the scheduler checks out first before it does anything else. Once it has the lock, the scheduler can work. If it fails to obtain the lock, it goes back to sleep. Then when it's done, it returns the lock.
If you have lots of work to do, you could divide it into sections and have locks for each section, that way you could spread the work out across your different instances. This gets more complicated though, so you'd have to weigh the advantages against the extra complication to see if it's worth it for your use case.
Only run the scheduler on the first node. You can determine the first node by looking at your application instance number. Either the env variable CF_INSTANCE_INDEX or VCAP_APPLICATION, which contains JSON and has an instance_index property. For either option, the value will be 0 for the first instance. If it's 0, the scheduler runs. If it's greater than zero, the scheduler doesn't run.
Hope that helps!

Ruby on Rails, Resque

I have a resque job class that is responsible for producing a report on user activity. The class queries the database and then performs numerous calculations/data parsing to send out an email to certain people. My question is, should resque jobs like this, that have numerous method (200 lines or so of code), be filled with all class methods and respond to the single ResqueClass.perform method? Or, should I be instantiating a new instance of this resque class to represent the single report that is being produced? If both methods properly calculate the data and email it, is there a convention or best practice on how it should be handled for background jobs?
Thank You
Both strategies are valid. I generally approach this from the perspective of concurrency. While your job is running, the resque worker servicing your job is busy, so if you have N workers and N of these jobs running, you're going to have to wait until one is done before anything else in the queue gets processed.
Maybe that's ok - if you just have one report at a time then you in effect will dedicate one worker to running the report, your others can do other things. But if you have a pile of these and it takes a while, you might impact other jobs in your queue.
The downside is that if your report dies, you may need logic to pick up where you left off. If you instantiate the report once per user, you'd simply need to retry the failed jobs - no "where was I" logic is required.

Infinite loop or repetitive run for daemon

Which is the better to write a "daemon" based on oracle schedules:
The one that is run once and then is in infinite loop and sleeps for 5 seconds if there is nothing to do (to not waste CPU cycles).
The one that is started, checked if it is something to do. If not - ends execution and is run after 5 seconds by schedule.
Which one and why do you prefer? Or may be it is some another implementation?
I personally prefer an infinite loop to a scheduled task. With an infinite loop you can see a broader cross-activation overview - Eg You can count number of failures in a row/similar very easily and add error-recovery.
A scheduled task is effectively stateless unless you manually give it state (File/Db/???)
It sounds like you might want to look at using an a queue to do the processing rather than a schedule job. The process can block on the queue waiting for new work.

Resources