Repeated tasks - spawn new processes or run continuously? - methods

We have about 10 different Python scripts that download data from the web, read data from a database and write data back to that database. They do so repeatedly every 10 seconds (or 10 seconds after the last task has completed).
The question is, what is the best approach at running these tasks? I can think of a few ways:
a while True that runs the task then sleeps for the interval. It could be guarded by a watchdog like supervisord, making sure it is always up.
having the script execute the task just once, and invoking the script externally once every 10 seconds by another process.
having the script execute the task lets say for 1 hour (every 10 seconds for an hour), and having a watchdog make sure that task runs again once the hour is over.
I would like to avoid long running processes that actually do something because I don't want to deal with memory problems etc over long periods of time.
Additional Information
The scripts are different because they each retrieve data from a different source, and query, calculate and insert different data into the database.
The tasks are performed every 10 seconds since the data being retrieve is in real-time, and we need to not only keep updating it very frequently, but also keep all the historical data in the database.
There are a lot of resources being used by the scripts - MySQL connections, HTTP connections, Redis connections, etc. We have encountered issues with using the long-running approach before, specifically with MySQL connections (things like MySQL server has gone away, even though all connections had been closed). Hence the inclination toward having the scripts run in shorter periods of time.
What are some common approaches at this?

Unless your scripts somehow leak memory (quite unlikely), they should all be the same. So, for sheer simplicity (your time programming/debugging is much more expensive than a few miliseconds of the machine's time, even each 10 seconds!) I'd go for the single script that checks each 10 seconds.
OTOH, checking each 10 seconds sounds like busywork. Can't you set up so that whatever you are monitoring tells you when there are changes? Or batch the records up so you can retrieve, say, a day's worth at at time?

If you are running on linux, cron has granularity of a minute. We have processes we run constantly. Rather than watch them, the script will open a semaphore that gets released when the program finishes normally or not. This way if it runs long and it gets called again by cron, the copy will exit when it can't get the lock. This way you can call it a often as you need to without it stepping on a possibly still running copy.

Related

SOS-Berlin JobScheduler process queue logic

We're running into an issue with the SOS-Berlin JobScheduler running on Windows that is difficult to diagnose* and I would appreciate any guidance.
*Difficult because I don't know Scala (though I do know C++ and Java). It's difficult to navigate this code-base (some of it's in German).
We have a process-class called Foo, that will sometimes burst up outside the limit of how many processes can be run. So, for example, we limit the process-class to 30 processes and 60 want to run. This leaves 30 running and 30 "waiting for process."
The problem is that JobScheduler doesn't seem to prioritize the 30 that are waiting for a process. Instead, any new job that gets fired after the burst receives processes, leaving some jobs waiting indefinitely. Once the number of jobs "waiting for process" hits zero, the jobs clear out immediately.
Further, it seems that when there are a large number of jobs "waiting for process," the run time for tasks doubles or triples. A job that normally takes 20 seconds to run, will spike to 1-2 minutes, further amplifying the issue as processes are not released back to the pool.
Admittedly, we're running an older version of JS, which we're planning to upgrade this/next week. However, I'm wondering if there is something fundamental we're missing. We've turned down the logging, looked for DB locks, added memory to the heap, shut-down some other processes on the server. We've also increased the process pool, but we don't want to push it too far, lest we crush the server. Nothing seems to be alleviating the issue.
Any tuning help would be appreciated!
As a follow-up, we determined the cause of the issue.
Another user had been using the temp directory to store intermediate generated files. The user was not clearing out these files, resulting in 100's of thousands of files in the directory. They were not very large so we didn't notice. For some reason Job Scheduler started to choke based on this. I'm not clear on the reasons.
Clearing the temp directory, scolding the user, and fixing his script fixed the issue.

How to keep webserver responsive while executing many asynchronous background tasks

I am working on a web application that provides its users to optionally execute long-running processes 'in background'. An example would be some long-running report generation, or deleting thousands of objects simultaneously.
I've implemented this using an ExecutorService defined as FixedThreadPool using a ThreadFactory. The ThreadFactory is built like this:
ThreadFactoryBuilder()
.setNameFormat(clientId + "-BackgroundTask-%d")
.setDaemon(true)
.setPriority(Thread.MIN_PRIORITY)
.build()
I execute the task like this:
Future<TaskStatus> future = clientExecutors.get(clientId).submit(
backgroundTask::execute);
taskFutures.put(backgroundTask.getTaskId(), future);
How can I enforce my webserver to always priorize handling new incoming requests (as fast as possible) over executing background tasks?
In other words: It should never ever happen, that a user has to wait long time while browsing the site, just because there are a lot of background-tasks executing. As you can see from above, I tried to do this by setting .setPriority(Thread.MIN_PRIORITY). However that does not seem to be sufficient.
Furthermore, as for now, I've set some arbitrary value for the FixedThreadPool size (10) and use it globally for the entire background-handling of the application (and all its customers).
Instead I would like to define a threadpool for each customer, to make sure each customer has the same privilege to run a certain amount of tasks in the background. Say, each customer has a FixedThreadPool of size 5, and on the server I'll have a max. of 50 different customers. That would add up to 250 running background tasks at the same time.
The most important requirement here is: it does not matter, how long these background-tasks need to execute (say 2 minutes, or 20 minutes). What is important, is that each customer has the ability to send 5 tasks to be executed in background, and each of those are worked on equally.
I've tested running 30 cpu-intensive background tasks and it turns out that while these are running and cpu is near 100%, new incoming requests take a very long time to be handled.
So obviously, I am doing it wrong.
Update 12.09.2017
I've read about microservices and while it sounds great I see a great challenge in splitting the necessary parts from our monolithic application. Mostly because nearly every operation might turn into a long running process given a big enough data selection.
Furthermore, wouldn't I run into the same problem with my microservice, i.e. the server running the microservice would suffer the same performance degradation. Well the only good thing would, that the rest of the web app would not suffer from it anymore.
I've read some posts about introducing Thread.sleep(1) or Thread.sleep in general into CPU-heavy operations to reduce the amount of CPU used in these operations. I've also read about someone who introduced this as an aspect so that he can even change the amount of time waited dynamically in order to have some control about how much cpu would be used.
However, my gut tells me that ain't right either. What do you think about introducing Thread.sleep to lower the amount of CPU used for a task? Is this common practice? If not, what would be the right approach?
I would highly consider changing your system architecture to offload these long-running requests to a separate instance instead of running them in-process with the general request-service application. In general I think it is an anti-pattern to handle both batch / online (or long / short running) processing in the same application instance.
Ideally you'd build a standalone microservice to handle these requests, but you could also simply just deploy X instances of your existing application, and configure your load balancer to route requests to the long running invocation paths (e.g. POST /myapp/longrunningjob) only to the instances dedicated to running these long-running processes.

Bash script - Maintain multiple instances running

How can I ensure, that multiple instances of certain program are always running?
Let's say that I want to make sure that 4 instances of a certain program are always running.
If one instance is killed, new one should start.
If 5 instances are running, one should be killed.
This is not really a shell question, because the approach is the same, whichever shell you are using.
I think the cleanest solution is to have a "watchdog", which checks the running processes (using ps) and, if necessary, starts a new one or kills an unnecessary one.
One way - which I have used in a similar situation - is to write a cron job, which regularly (say: every 5 minutes) starts the watchdog and let it do his work.
If such an interval is too long for your case (i.e. if you need checking it more often than every minute), you could have the watchdog run continuously, in a loop. Still, you will need a cron job, which controls in turn the watchdog from time to time - just in case the watchdogs dies. In this case you might consider running it as a daemon.

how to implement custom cloud worker

I am designing a cloud app and need a worker process which scours my database looking for work, and then performs it.
Most of the info I seem to find on the subject of background tasks in the cloud involves some kind of scheduler and/or queuing system.
What I have doesn't quite fit into the "run this task every 5 minutes" or "add this to the queue to be executed later" models. I think the main difference to my problem is that the workers themselves find work to do, rather than being assigned it by a periodic scheduler or an external process that generates work.
What I have is basically a giant table where each entry has three fields:
job: a small task to be performed, lets say it gets the last message from a twitter account and stores it in the database
the interval at which to perform that job: say every 5 minutes, N.B. the interval is arbitrary and different for each entry in the table
the last date when the job was performed
The way I would implement this is to have a worker which has an infinite loop. When it enters the loop, it scours the database a)looking for items whose date + interval < currentTime, b)when it finds one, it sets date = currentTime, and c)then executes the job. If there is no work ATM, it sleep for a few seconds, then tries again.
I will have many parallel workers scouring the database simultaneously, which is why I do b) first and then c) in the paragraph above. Since there are parallel workers, action a) and b) are atomic operations on the database to prevent work being duplicated. If the worker crashes after a) and b), but before it manages to finish the work, it's no big deal, and the workers can just do it at the next interval; reason for this is that the work is not performed in a time-invariant system so a backlog scenario of failed jobs has no benefit as the tasks have to be performed at their exact intervals, so it's better to skip 1 interval than to have uneven intervals between which the tasks were executed.
My question is whether that is a reasonable implementation strategy? If so, how do I bring this process to life on the cloud (I am using Heroku, but may switch to EC2 in the future)? I still haven't written any code so I would welcome other suggestions (maybe I misunderstood the use cases/applications for queue systems).
This sounds so close to using something like a scheduled job that you might as well tread the well beaten path and do it the more conventional way. There's no reason why you can't schedule a job to run once every few seconds.
However, this idea of looking for work sounds dodgy. What happens if two workers find the same task to run at the same time for instance? Also, are there not triggers in the application which can indicate that work needs doing? It seems strange that you have code 'looking for work'.
You can go a very long way with simple periodic background tasks, so I would exhaust all possibilities in that area before rolling your own.

How to make Ruby run some task every 10 minutes?

I would like to do a cron job every 10 minutes, but my system only does 1 hour. So I'm looking for a method to do this. I've seen Timer and sleep but I'm not sure how to do this or even better yet a resource for achieving this.
Take a look at http://rufus.rubyforge.org/rufus-scheduler/
rufus-scheduler is a Ruby gem for scheduling pieces of code (jobs). It understands running a job AT a certain time, IN a certain time, EVERY x time or simply via a CRON statement.
rufus-scheduler is no replacement for cron/at since it runs inside of Ruby.
To do this reliably, invest in a VPS and create the 10-minute cron job as desired. Trying to emulate cron all on your own is very likely to fail in unforeseen ways.
Creating a sleeping process is not the way to go about this; if your server doesn't give you the freedom to make your own cron as you like it, you probably can't create your own background process for this sort of thing, either. You might be able to, on each request, take a look and see how many of the jobs need done (if it was 25 minutes since last request, you might have to do two), and go back and do them retroactively.
But, seriously. You need your own server to do this dependably.

Resources