I am using a load balancer and 2 servers use the database to listen for jobs. The problem is that when I dispatch a job and it get picked by a server but it is not completed before the retry time runs out of a queue job, it will start running on the second server and fail with a message saying that the 'Job has attempted to run to many times or run for too long'. But the first server continues to complete the job and it finishes successfully. I tried everything, starting from using Queue::before, writing functions to handle the logic for a job that is in progress but I had no luck, maybe someone can help. Thank you in advance.
Related
I've got multiple servers sharing a database - on each of them a cron job fires ever 5 min checking if a text message log entry doesn't exist, creates a text message log entry and sends out a text message. I thought that there would never be a situation where text messages are sent multiple times, as one server should be first.
Well - I was wrong and that scenario did happen:
A - check if log exists - it doesn't
B - check if log exists - it doesn't
A - create log
B - create log
A - send message
B - send message
I've changed this behaviour to introduce queue, which should mitigate the issue. While the crons will still fire, multiple jobs will be queued, and workers should pick up given jobs at different times, thus preventing of sending of message twice. Though it might as well end up being:
A - pick up job 1
B - pick up job 2
A - check if log exists - it doesn't
B - check if log exists - it doesn't
Etc or A and B might as well pickup the same job at exactly the same time.
The solution would be, I guess, to run one worker server. But then I've the situation that jobs from multiple servers are queued many times, and I can't check if they're already enqueued as we end up with first scenario.
I'm at loss on how to proceed here - while multiple server, one worker server setup will work, I don't want to end up with instances of the same job (coming from different servers) multiple times in the queue.
Maybe the solution to go for is to have one cron/queue/worker server, but I don't have experience with Laravel/multiserver environment to set it up.
The other problematic thing for me is - how to test this? I can't, I guess, test it locally unless there's a way I can spin VM instances that are synchronized with each other.
The easy answer:
The code that checks the database for the existing database entry could use a database transaction with a level high enough to make sure that everyone else that is trying to do the same thing at the same time will be blocked and wait for the job to finish/commit.
A really naive solution (assuming mysql) would be LOCK TABLES entries WRITE; followed by the logic, then UNLOCK TABLES when you're done.
This also means that no one can access the table while your job is doing the check. I hope the check is really quick, because you'll block all access to the table for a small time period every five minutes.
WRITE lock:
The session that holds the lock can read and write the table.
Only the session that holds the lock can access the table. No other session can access it until the lock is released.
Lock requests for the table by other sessions block while the WRITE lock is held.
Source: https://dev.mysql.com/doc/refman/5.7/en/lock-tables.html
That was a really boring answer, so I'll move on to the answer you're probably more interested in...
The server architecture answer:
Your wish to only have one job per time interval in your queue means that you should only have one machine dispatching the jobs. This is easiest done with one dedicated machine that only dispatches jobs from scheduled commands. (Laravel 5.5 introduced the ability to dispatch jobs directly from the scheduler; see Scheduling Queued Jobs)
You can then have an several worker machines processing the queue, and only one of them will pick up the job and execute it. Two worker machines will never execute the same job at the same time if everything works as usual*.
I would split up the web machines from the worker machines so that they can scale independently. I prefer having my web machines dedicated to web traffic, they are not processing jobs to make sure that any large amount of queued jobs will not affect my http response times.
So, I recommend the following machine types in your setup;
The scheduler - one single machine that runs the schedule and dispatches jobs.
Worker machines that handles your queue.
Web machines that handles visitors' traffic.
All machines will have identical source code for your Laravel application. They will also also have an identical configuration. The only think that is unique per machine type is ...
The scheduler has php artisan schedule:run in the crontab.
The workers have supervisor (or something similar) that runs php artisan queue:work.
The web servers have nginx + php-fpm and handles incoming web requests.
This setup will make sure that you will only get one job per 5 minute since there is only one machine that is pushing it. This setup will also make sure that the cpu load generated by the workers aren't affecting the web requests.
One issue with my answer is obvious; that single scheduler machine is a single point of failure. If it dies you will no longer have any of these scheduled jobs dispatched to the queue. That touches areas like server monitoring and health checks, which is out-of-scope of your question and are also highly dependant on your hosting provider.
Regarding that little asterisk; I can make up weird scenarios where a job is executed on several machines. This involves jobs that sleeps for longer than the timeout, while at the same time you've got an environment without support for terminating the job. This will cause the first worker to keep executing the job (since it cannot terminate it), and a second worker will consider the job as timed-out and retry it.
Since Laravel 5.6+ you can ensure your scheduled tasks only run on a single instance using the onOneServer function e.g.
$schedule->command('loggingTask')
->everyFiveMinutes()
->onOneServer();
This requires an APC or Redis cache to be set up because it seems to use a mutual exclusion lock, probably RedisLock if Redis is set up.
Using a queue you shouldn't really have such a problem because popping a task off a queue should be an atomic operation.
Source
I have a Laravel application where the Application servers are behind a Load Balancer. On these Application servers, I have cron jobs running, some of which should only be run once (or run on one instance).
I did some research and found that people seem to favor a lock-system, where you keep all the cron jobs active on each application box, and when one goes to process a job, you create some sort of lock so the others know not to process the same job.
I was wondering if anyone had more details on this procedure in regards to AWS, or if there's a better solution for this problem?
You can build distributed locking mechanisms on AWS using DynamoDB with strongly consistent reads. You can also do something similar using Redis (ElastiCache).
Alternatively, you could use Lambda scheduled events to send a request to your load balancer on a cron schedule. Since only one back-end server would receive the request that server could execute the cron job.
These solutions tend to break when your autoscaling group experiences a scale-in event and the server processing the task gets deleted. I prefer to have a small server, like a t2.nano, that isn't part of the cluster and schedule cron jobs on that.
Check out this package for Laravel implementation of the lock system (DB implementation):
https://packagist.org/packages/jdavidbakr/multi-server-event
Also, this pull request solves this problem using the lock system (cache implementation):
https://github.com/laravel/framework/pull/10965
If you need to run stuff only once globally (so not once on every server) and 'lock' the thing that needs to be run, I highly recommend using AWS SQS because it offers exactly that: run a cron to fetch a ticket. If you get one, parse it. Otherwise, do nothing. So all crons are active on all machines, but tickets are 'in flight' when some machine requests a ticket and that specific ticket cannot be requested by another machine.
Is it possible to somehow view sidekiq completed job list - for example, find all PurchaseWorkers with params (1)? Yesterday in my app delayed method that was supposed to run didn't and associated entity (lets say 'purchase') got stuck in limbo with state "processing". I am trying to understand whats the reason: job wasn't en-queued at all or was en-queued but for some reason exited unexpectedly. There were no errors in sidekiq log.
Thanks.
This is old but I wanted to see the same thing since I'm not sure if jobs I scheduled ran or not!
Turns out, Sidekiq doesn't have anything built in to see jobs that completed and still doesn't seem to.
If it err'd and never completes it should be in the 'dead' queue. But to check that something actually ran seems to be beyond Sidekiq by default.
The FAQ suggests installing 3rd party plugins to track and log information: https://github.com/mperham/sidekiq/wiki/FAQ#how-can-i-tell-when-a-job-has-finished One of them allows for having a callback to do follow up (maybe add a record for completed jobs elsewhere?)
You can also setup Sidekiq to log to somewhere other than STDOUT (default) so you can output log information about your jobs. In this case, logging that it's complete or catching errors if for some reason it is never landing in the retrying or dead jobs queue when there is a problem. See https://github.com/mperham/sidekiq/wiki/Logging
To see jobs still in queue you can use the Rails console and look at the queue by queue name https://www.rubydoc.info/gems/sidekiq/Sidekiq/Queue
One option is the default stats provided by sidekiq - https://github.com/mperham/sidekiq/wiki/Monitoring#using-the-built-in-dashboard
The best options is to use the Web UI provided here - https://github.com/mperham/sidekiq/wiki/Monitoring#web-ui
I'm creating a mechanism in my web server whereby a scheduled task will execute every 15 minutes and notify users if any activity has occurred within that time frame. It would work as follows:
Annotate a with #Scheduled and schedule to run every 15 minutes
When the task runs, scrape the database for any changes within 15 minutes of the current time
A couple problems I can see:
If I have to restart the server and it's down for longer than 15 minutes, I would need to look back longer than 15 minutes so that no activity is missed.
I m running a number of tomcat servers and only one of them needs to execute the task. Otherwise, duplicate emails will be sent to users.
Has anyone dealt with this before? I'm thinking that this should really be a task external to the web servers... that would solve the issue of duplicate emails being sent, but it wouldn't solve the server bounce issue.
Any ideas on how to solve would be greatly appreciated!
I would have done the following steps to perform the scheduling:
On Application startup query for tasks from database (only those which don't have a dirty flag set to false) and schedule it.
On each run of scheduled task put a dirty flag to suggest the task has run
Because I will be retrieving those tasks only which are marked as dirty, the issue of multiple emails should not occur even on server startup.
From gearman's main page, they mention running with multiple job servers so if a job server dies, the clients can pick up a new job server. Given the statement and diagram below, it seems that the job servers do not communicate with each other.
Our question is what happens to those jobs that are queued in the job server that died? What is the best practice to have high-availability for these servers to make sure jobs aren't interrupted in a failure?
You are able to run multiple job servers and have the clients and workers connect to the first available job server they are configured with. This way if one job server dies, clients and workers automatically fail over to another job server. You probably don't want to run too many job servers, but having two or three is a good idea for redundancy.
Source
As far as I know there is no proper way to handle this at the moment, but as long as you run both job servers with permanent queues (using MySQL or another datastore - just don't use the same actual queue for both servers), you can simply restart the job server and it'll load its queue from the database. This will allow all the queued tasks to be submitted to available workers, even after the server has died.
There is however no automagical way of doing this when a job server goes down, so if both the job server and the datastore goes down (a server running both locally goes down) will leave the tasks in limbo until it gets back online.
The permanent queue is only read on startup (and inserted / deleted from as tasks are submitted and completed).
I'm not sure about the complexity required to add such functionality to gearmand and whether it's actually wanted, but simple "task added, task handed out, task completed"-notifications between servers shouldn't been too complicated to handle.