Why are all of my Airflow dags one run behind? - etl

I'm setting up Airflow right now and loving it, except for the fact that my dags are perpetually running behind. See the picture below - this was taken on 2/19 at 15:50 UTC, and you can see that for each of the dags, they should have run exactly one more time between the last time they ran and the present time (there are a couple for which this is not true - those ones are currently turned off). Is there some piece of configuration I missed?

False alarm! Airflow just labels execution times differently than how I expected. Turns out an hourly job that runs at 15:00 is labels "14:00" and includes data up to 14:00+1:00.
From https://airflow.apache.org/scheduler.html:
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.
Let’s Repeat That The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.

Execution time is the lower bound of the batch.
Ex:
Say your execution schedule is hourly and its the run corresponding to the 13:00 schedule.
Your execution_time will be 12:00.
This is because we usually run the batch for 12:00 - 13:00 at 13:00(after the data is available for the batch).
But in my experience, we sometimes use the schedule based on the time its scheduled for(because we want the schedule to start and there are checks inside of the DAG/job that verify data readiness). In those cases, I just end up using next_execution_time(13:00) instead of execution_time(12:00).

Related

How does Laravels task scheduling work without persisting the last completed date?

Laravel is (correctly) running scheduled tasks via the App\Console\Kernel#schedule method. It does this without the need for a persistance layer. Previously ran scheduled tasks aren't saved to the database or stored in anyway.
How is this "magic" achieved? I want to have a deeper understanding.
I have looked through the source, and I can see it is somewhat achieved by rounding down the current date and diffing that to the schedule frequency, along with the fact that it is required to run every minute, it can say with a certain level of confidence that it should run a task. That is my interpretation, but I still can't fully grasp how it is guaranteeing to run on schedule and how it handles failure or things being off by a few seconds.
EDIT Edit due to clarity issue pointed out in comment.
By "a few seconds" I mean how does the "round down" method work, even when it is ran every minute, but not at the same second - example: first run 00:01.00, 00:01:02, 00:02:04
Maybe to clarify further, and to assist in understanding how it works, is there any boundary guarantees on how it functions? If ran multiple times per minute will it execute per minute tasks multiple times in the minute?
Cronjob can not guarantee seconds precisely. That is why generally no cronjob interval is less than a minute. So, in reality, it doesn't handle "things being off by a few seconds."
What happens in laravel is this, after running scheduling command for the first time the server asks "Is there a queued job?" every minute. If none, it doesn't do anything.
For example, take the "daily" cronjob. Scheduler doesn't need to know when was the last time it ran the task or something like this. When it encounters the daily cronjob it simply checks if it is midnight. If it is midnight it runs the job.
Also, take "every thirty minute" cronjob. Maybe you registered the cronjob at 10:25. But still the first time it will run on 10:30, not on 10:55. It doesn't care what time you registered or when was the last time it ran. It only checks if the current minute is "00" or divisible by thirty. So at 10:30 it will run. Again, it will run on 11:00. and so on.
Similarly a ten minute cronjob by default will only check if the current minute is divisible by ten or not. So, regardless of the time you registered the command it will run only on XX:00, XX:10, XX:20 and so on.
That is why by default it doesn't need to store previously ran scheduled task. However, you can store it into a file if you want for monitoring purpose.

How to run a job on specific times every day from Monday to Friday in Rundeck?

Is there any way to run the job every day from Monday to Friday at 45 mins time interval from 2.45 pm to 5 pm. (The job should run at 2.45pm , 3.30pm, 4.15, 5.00). Please help.
Thank you.
I know of two ways:
One way is to create four jobs scheduled at a single time every Mon-Fri. Give each a single step, which is a job reference to the real job. For example, the first job would be scheduled to run at 2.45pm every Mon-Fri, the second at 3.30pm, etc. You're not making copies of the existing job, just single-step jobs that reference the original.
That approach lets you manage each scheduled time individually.
Another way is to schedule it every 15 minutes from 2.00pm to 5.45pm, and add an initial step that fails if the time is before 2.45pm or after 5.01pm. If you don't want to see these failures, you could add an error step to that first step which marks the job succeeded even if it's really skipping all the real steps.
Paul M. Lambert
Platform Solutions Architect
Rundeck, Inc.

Scheduling a task run

I have a script that must run at a certain hour for the amount of time I specify.
I'm looking at the clockwork gem (https://github.com/tomykaira/clockwork) which seems to be the closest piece of software I might eventually use to accomplish this, unfortunately it doesn't seem to give the ability to set a duration (start at 3PM stop 5PM), meaning I have to split the feature in 2, starting the script is going to be clockwork's job, stopping it is in the script itself with a custom solution.
Very suboptimal and messy.
How does people do this in Ruby? TIA
There is great gem called whenever for same job. With it you can set exact time for your task, like:
every 1.day, :at => '4:30 am' do
runner "MyModel.task_to_run_at_four_thirty_in_the_morning"
end
But you'll have to have two stages, one for starting one for stopping your job, which seems to be more natural than job which kills itself at some time by my opinion.
Somewhat janky, but there is another solution. I'm not sure what you are using to host your app, but on Heroku you can set up a scheduler to run every 10 minutes, on the hour, or daily. Then inside the method that the scheduler calls, you can determine the current time. Say you only want to run it between 3pm and 5pm, you would just wrap your code inside an if statement that verifies the current time is between 3pm and 5pm (watch out for time conversions with UTC).
Hope this helps.

Testing Quartz JDBC Job Store

I am using Quartz JDBC Job store (org.quartz.impl.jdbcjobstore.JobStoreTX) and MySQL for scheduling jobs.
I have the following setup:
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.StdJDBCDelegate
org.quartz.jobStore.dataSource=foo
org.quartz.dataSource.foo.driver=com.mysql.jdbc.Driver
org.quartz.dataSource.foo.URL=jdbc:mysql://localhost:3306/myDB
org.quartz.dataSource.foo.user=user
org.quartz.dataSource.foo.password=*****
org.quartz.dataSource.foo.maxConnections=5
org.quartz.dataSource.foo.validateOnCheckout=true
org.quartz.dataSource.foo.validationQuery=SELECT 1
I am able to schedule a job with Quartz picking up the job from the database when the time is due.
There are some jobs that can be scheduled up to 3 or 4 weeks in the future. How do I test this?
Right now I go manually change the system time. For example, if I schedule a job to run on 2/5/2013 12:45 PM, then I go change the system clock time to 2/5/2013 12:43 PM, then wait for a couple of minutes to see if Quartz picks up the job from the DB. This works fine for me.
I don't want to change the system clock time every time I need to test. Is there a better way to do this?
I noticed that changing the system time frequently sometimes messes up with Quartz with some jobs not picked up.
You could use the Quartz TriggerUtils methods to find out whether the future executions are the expected.
More specifically the computeFireTimes(org.quartz.spi.OperableTrigger trigg, Calendar cal, int numTimes) method returns a list of Dates that are the next fire times of a Trigger.
I hope this helps.

How to run a per second cron job every two minutes

I have to set up a cron job on my hosting provider.
This cron job needs to run every second. It's not intensive, just doing a check.
The hosting provider however only allows cron jobs to be run every two minutes. (can't change hosting btw)
So, I'm clueless on how to go about this?
My thoughts so far:
If it can only run every two minutes, I need to make it run every second for two minutes. 1) How do I make my script run for two minutes executing a function every second?
But it's important that there are no interruptions. 2) I have to ensure that it runs smoothly and that it remains constantly active.
Maybe I can also try making it run forever, and run the cron job every two minutes checking whether it is running? 3) Is this possible?
My friend mentioned using multithreading to ensure it's running every second. 4) any comments on this?
Thanks for any advice. I'm using ZF.
Approach #3 is the standard solution. For instance you can have the cron job touch a file every time it runs. Then on startup you can check whether that file has been touched recently, and if it has then exit immediately. Else start running. (Other approaches include using file locking, or else writing the pid to a file and on startup check whether that pid exists and is the expected program.)
As for the one second timeout, I would suggest calling usleep at the end of your query, supplying the number of milliseconds from now to when you next want to run. If you do a regular sleep then you'll actually run less than once a second because sleeps sometimes last longer than expected, and your check takes time. As long as your check takes under a second to run, this should work fine.
I don't think cron allows second level resolution. http://unixhelp.ed.ac.uk/CGI/man-cgi?crontab+5
field allowed values
----- --------------
minute 0-59
hour 0-23
day of month 1-31
month 1-12 (or names, see below)
day of week 0-7 (0 or 7 is Sun, or use names)
So, even if your hosting provider allows you can't run a process that repeats every second. However, you can user command something like watch for repeated execution of your script. see here

Resources