I'm creating a mechanism in my web server whereby a scheduled task will execute every 15 minutes and notify users if any activity has occurred within that time frame. It would work as follows:
Annotate a with #Scheduled and schedule to run every 15 minutes
When the task runs, scrape the database for any changes within 15 minutes of the current time
A couple problems I can see:
If I have to restart the server and it's down for longer than 15 minutes, I would need to look back longer than 15 minutes so that no activity is missed.
I m running a number of tomcat servers and only one of them needs to execute the task. Otherwise, duplicate emails will be sent to users.
Has anyone dealt with this before? I'm thinking that this should really be a task external to the web servers... that would solve the issue of duplicate emails being sent, but it wouldn't solve the server bounce issue.
Any ideas on how to solve would be greatly appreciated!
I would have done the following steps to perform the scheduling:
On Application startup query for tasks from database (only those which don't have a dirty flag set to false) and schedule it.
On each run of scheduled task put a dirty flag to suggest the task has run
Because I will be retrieving those tasks only which are marked as dirty, the issue of multiple emails should not occur even on server startup.
Related
I've got multiple servers sharing a database - on each of them a cron job fires ever 5 min checking if a text message log entry doesn't exist, creates a text message log entry and sends out a text message. I thought that there would never be a situation where text messages are sent multiple times, as one server should be first.
Well - I was wrong and that scenario did happen:
A - check if log exists - it doesn't
B - check if log exists - it doesn't
A - create log
B - create log
A - send message
B - send message
I've changed this behaviour to introduce queue, which should mitigate the issue. While the crons will still fire, multiple jobs will be queued, and workers should pick up given jobs at different times, thus preventing of sending of message twice. Though it might as well end up being:
A - pick up job 1
B - pick up job 2
A - check if log exists - it doesn't
B - check if log exists - it doesn't
Etc or A and B might as well pickup the same job at exactly the same time.
The solution would be, I guess, to run one worker server. But then I've the situation that jobs from multiple servers are queued many times, and I can't check if they're already enqueued as we end up with first scenario.
I'm at loss on how to proceed here - while multiple server, one worker server setup will work, I don't want to end up with instances of the same job (coming from different servers) multiple times in the queue.
Maybe the solution to go for is to have one cron/queue/worker server, but I don't have experience with Laravel/multiserver environment to set it up.
The other problematic thing for me is - how to test this? I can't, I guess, test it locally unless there's a way I can spin VM instances that are synchronized with each other.
The easy answer:
The code that checks the database for the existing database entry could use a database transaction with a level high enough to make sure that everyone else that is trying to do the same thing at the same time will be blocked and wait for the job to finish/commit.
A really naive solution (assuming mysql) would be LOCK TABLES entries WRITE; followed by the logic, then UNLOCK TABLES when you're done.
This also means that no one can access the table while your job is doing the check. I hope the check is really quick, because you'll block all access to the table for a small time period every five minutes.
WRITE lock:
The session that holds the lock can read and write the table.
Only the session that holds the lock can access the table. No other session can access it until the lock is released.
Lock requests for the table by other sessions block while the WRITE lock is held.
Source: https://dev.mysql.com/doc/refman/5.7/en/lock-tables.html
That was a really boring answer, so I'll move on to the answer you're probably more interested in...
The server architecture answer:
Your wish to only have one job per time interval in your queue means that you should only have one machine dispatching the jobs. This is easiest done with one dedicated machine that only dispatches jobs from scheduled commands. (Laravel 5.5 introduced the ability to dispatch jobs directly from the scheduler; see Scheduling Queued Jobs)
You can then have an several worker machines processing the queue, and only one of them will pick up the job and execute it. Two worker machines will never execute the same job at the same time if everything works as usual*.
I would split up the web machines from the worker machines so that they can scale independently. I prefer having my web machines dedicated to web traffic, they are not processing jobs to make sure that any large amount of queued jobs will not affect my http response times.
So, I recommend the following machine types in your setup;
The scheduler - one single machine that runs the schedule and dispatches jobs.
Worker machines that handles your queue.
Web machines that handles visitors' traffic.
All machines will have identical source code for your Laravel application. They will also also have an identical configuration. The only think that is unique per machine type is ...
The scheduler has php artisan schedule:run in the crontab.
The workers have supervisor (or something similar) that runs php artisan queue:work.
The web servers have nginx + php-fpm and handles incoming web requests.
This setup will make sure that you will only get one job per 5 minute since there is only one machine that is pushing it. This setup will also make sure that the cpu load generated by the workers aren't affecting the web requests.
One issue with my answer is obvious; that single scheduler machine is a single point of failure. If it dies you will no longer have any of these scheduled jobs dispatched to the queue. That touches areas like server monitoring and health checks, which is out-of-scope of your question and are also highly dependant on your hosting provider.
Regarding that little asterisk; I can make up weird scenarios where a job is executed on several machines. This involves jobs that sleeps for longer than the timeout, while at the same time you've got an environment without support for terminating the job. This will cause the first worker to keep executing the job (since it cannot terminate it), and a second worker will consider the job as timed-out and retry it.
Since Laravel 5.6+ you can ensure your scheduled tasks only run on a single instance using the onOneServer function e.g.
$schedule->command('loggingTask')
->everyFiveMinutes()
->onOneServer();
This requires an APC or Redis cache to be set up because it seems to use a mutual exclusion lock, probably RedisLock if Redis is set up.
Using a queue you shouldn't really have such a problem because popping a task off a queue should be an atomic operation.
Source
I am a Newbie to WMB and We have a requirement for Kicking Off a Msg Flow at 10 P.M each night.
After alot of Googling, I have suggested 2 ways to them -
1. Use a CronJob to put a msg on the Input Q to start the flow.
2. Use Timeout Notification node.
They have declined option 1 saying that IBM doesn't support cron jobs anymore, so we can't put that on the server.
For option 2, they are still fine but I have a question - Today i deploy the flow at the exact same time when I want it to et triggered after 24 Hrs but what happens when the Server is rebooted or the Flow is Stopped and started.
Will the timer also start again from that moment and If Yes is there any Workaround this problem of rebooting or restarting that can be followed, so the flow is kicked off on the exact same time at 10 P.M irrespective even if it was redeployed or something like that.
We also have TWS in our environment, But I could not find any Integration documents or scenarios of TWS integration with IIB, Could you kindly give your valuable advices or comments - How can I reach to an efficient solution.
Thanks
Sumit
Yes, the timer will start when the flow containing the Timeout Notification starts.
A workaround could be to construct a flow that starts with a Timeout Notification, and in a Compute node connected directly to the output of the Timeout Notification, you check the CURRENT_TIMESTAMP, and only continue executing the flow if it is around the right time. To make this work the Timeout Notification should have a timeout value set, that allows for the required precision of the instant.
Use timer control node to pass specify timer to trigger
And for that particular the timer notification node will receive message and kick starts
We use Spring Batch for some long running maintenance jobs. Very occasionally a job may get stuck on database/network hiccups. Is there a way email can be sent out upon those occasions, say, if any one step takes more than 2 hours to finish, a group of people will get an email alert?
To send the mail, you can use this class from Spring : org.springframework.mail.javamail.JavaMailSenderImpl
and to check your condition, you can just implement a loop inside a org.springframework.batch.core.StepListener.
This is if u want to receive a mail AFTER a step has finished and took more than 2 hours to finish.
To receive a mail while the step is still running, and has passed the 2 hours limit, it's harder and would require some multithreading development or some external job able to monitor your main job (through org.springframework.batch.core.explore.JobExplorer).
Thanks for the replies. We end up using some other monitoring tools systematically monitoring the batch job's performance and other systems' performances.
I am wondering if there is a way to monitor these automatically. Right now, in our production/QA/Dev environments - we have bunch of services running that are critical to the application. We also have automatic ETLs running on windows task scheduler at a set time of the day. Currently, I have to log into each server and see if all the services are running fine or not, or check event logs for any errors, or check task scheduler to see if ETLs ran well etc etc... I have to do all the manually... I am wondering if there is a tool out there that will do the monitoring for me and send emails only in case something needs attention (like ETLs fail to run, or service get stopped for whatever reason or errors in event log etc). Thanks for the help.
Paessler PRTG Network Monitor can do all that. we have very good experience with it.
http://www.paessler.com/prtg/features
Nagios is the best tool for monitoring. It checks for the server status as well the defined services in it and if any service goes down or system goes down, sends the mail to specified mail id.
Refer the : http://nagios.org/
Thanks for the above information. I looked at the above options but they have a price.. what I did is an inexpensive way to address my concerns..
For my windows task scheduler jobs that run every night - I installed this tool/service from codeplex that is working great.
http://motash.codeplex.com/documentation#CommentsAnchor
For Windows services - I am just setting the "Recovery" Tab in each service "property" with actions to do when it fails. (like restart, reboot, or run a program which could be an email that will notify)
I built a simple tool (https://cronitor.io) for monitoring periodic/scheduled tasks. The name is a play on "cron" from the unix world, but it is system/task agnostic. All you have to do is make an http request to a unique tracking URL whenever your job runs. If your job doesn't check-in according to the rules you define then it will send you an email/sms message.
It also allows you to track the duration of your jobs by making calls at the beginning and end of your task. This can be really useful for long running jobs since you can be alerted if they start taking too long to run. For example, I once had a backup task that was scheduled every hour. About six months after I set it up it started taking longer than an hour to run!
There is https://eyewitness.io - which is for monitoring server cron tasks, queues and websites. It makes sure each of your cron jobs run when they are supposed to, and alerts you if they failed to be run.
Suppose I include a rather long-running startup task into my Azure role - running something like up to several minutes. What happens if the startup task runs "too long".
I'm currently testing on Compute Emulator and observe the following.
I have a 450 megabytes .zip file together with Info-Zip unzip. The startup task unzips the archive. Deployment starts and I look into Task Manager. Numerous service processes start, then unzip.exe is run. After about two minutes all those processes stop and then start anew and unzip.exe starts again.
So it looks like a deployment is allowed to run for about two minutes, then is forcefully reset and started again.
Is this the expected behavior? Does it persist on real cloud? Are there any hard limits on how long a role startup can take? How do I address this situation except moving the unpacking into RoleEntryPoint.OnStart()?
I had the same question, so tried an experiment. I ran a Startup Task - taskType="simple" so that it would block the Roles from beginning to execute - and let it run for 50 hours. The Fabric Controller did not complain and the portal did not show any error. It finished its long "do nothing" loop after the 50 hours was up, then this Startup Task exited, and my Web Role started up fine.
So my emperical test says Startup Tasks can take a long time! At least 50 hours.
This should inform the load balancer that your process is still busy:
http://msdn.microsoft.com/en-us/library/microsoft.windowsazure.serviceruntime.roleinstancestatuscheckeventargs.setbusy.aspx
I have run startup tasks that run for a pretty long time (think 20-30 mins) and the role is simply in a 'Busy' state. I don't think there is a hard limit for how long the role will stay in that state as long as the Startup task is still executing and did not exit with a non-zero return code (in fact, this is a gotcha for most first time startup task creators when they pop a prompt). The FC is technically still running just fine, so there would be no reason to 'recover' the role (i.e. heartbeats are still going).
The dev emulator just notices when the role hasn't started and warns you. If you click the 'keep waiting' option, it will continue to run the Startup task to completion. The cloud does not do this of course (warn you).
Never tried a task that ran super long, so there might be a very long limit. I seem to recall 3 hrs was a magic number in some timeout cases like role recycles, but I have never tried...
There are some heartbeats that the Azure Fabric Agent will do against the role. If these are not acknowledged (say a long-running blocking process), this could cause the role to be flagged as unavailable.
You might try putting your startup process into a background thread that runs independently. This should help you keep the role from being recycled while the process is starting up. Just keep in mind you may need to make some adjustments if you get requests before the role fully starts up. There's also a way (that I can't seem to recall ATM) to flag the role and take it out of the load balancer temporarially while your process completes.