My example DAG:
Task1-->Task2-->Task3
I have a pipeline with a BashOperator task that should not stop (at least for a few hours).
Task1: It watches a folder for zip files and extracts them to another folder
#!/bin/bash
inotifywait -m /<path> -e create -e moved_to|
while read dir action file; do
echo "The file '$file' appeared in directory '$dir' via '$action'"
unzip -o -q "/<path>/$file" "*.csv" -d /<output_path>
rm path/$file
done
Task2: PythonOperator(loads the CSV into MySQL database after cleaning)
The problem is that my task is always running due to the loop, and I want it to proceed to the next task after (execution_date+ x hours).
I was thinking of changing the trigger rules of the downstream task.I have tried the execution_timeout in BashOperator but the task shows as failed on the graph.
What should be my approach to solve this kind of problem?
There are several ways to address the issue you are facing.
Option 1: Use execution_time on parent task and trigger_rule='all_done' on child task. This is basically what you suggested but just for clarifications - Airflow doesn't mind that one of the task in the pipeline has failed. In your use case you describe it as a valid state for the task so it's OK but not very intuitive as people often associate failed with something that is wrong so it's understandable that this is not the preferred solution.
Option 2: Airflow has AirflowSkipException. You can set timer in your python code. If timer exceed the time you defined then do:
from airflow.exceptions import AirflowSkipException
raise AirflowSkipException(f"Snap. Time is OUT")
This will set parent task to status Skipped then the child task can use trigger_rule='none_failed'. In this way if parent task fails it's due to an actual failure (but not timeout). Valid execution will yield either success status or skipped.
Related
Does anyone had a problem snakemake recognizing a timed-out job. I submit jobs to a cluster using qsub with a time-out set per rule:
snakemake --jobs 29 -k -p --latency-wait 60 --use-envmodules \
--cluster "qsub -l walltime={resources.walltime},nodes=1:ppn={threads},mem={resources.mem_mb}mb"
If a job fails within a script, the next one in line will be executed. When a job however hits the time-out defined in a rule, the next job in line is not executed, reducing the total number of jobs run in parallel on the cluster over time. A timed-out job raises according to the MOAB scheduler (PBS server) a -11 exit status. As far as I understood any non-zero exit status means failure - or does this only apply to positive integers?!
Thanks in advance for any hint:)
If you don't provide a --cluster-status script, snakemake internally checks job status by touching some hidden files in the submitted job script. When a job times out, snakemake (on the node) doesn't get a chance to report the failure to the main snakemake instance as qsub will kill it.
You can try a cluster profile or just grab a suitable cluster status file (be sure to chmod it as an exe and have qsub report a parsable job id).
When building files with rake, the build system is smart enough to tell whether or not it needs to actually run a task if e.g. the file already exists and the dependencies are not more recent.
Is there a standard way to skip other tasks? I'm thinking of something maybe like
task :containers do
sh "docker-composer up"
end
# the following doesn't exist
task :containers, if: `docker ps | grep mycontainer`.empty?
You can use the next keyword to "skip out" of a task whenever you want, e.g.
task :containers do
next if `docker ps | grep mycontainer`.empty?
sh "docker-composer up"
end
and that won't interrupt the flow of other tasks in the queue.
Alternatively, you could just wrap your task code in an if statement inside the task definition, and maybe print something out if the condition fails.
i am trying to run a cron job which will execute my shell script, my shell script is having hive & pig scripts. I am setting the cron job to execute after every 2 mins but before my shell script is getting finish my cron job starts again is it going to effect my result or once the script finishes its execution then only it will start. I am in a bit of dilemma here. Please help.
Thanks
I think there are two ways to better resolve this, a long way and a short way:
Long way (probably most correct):
Use something like Luigi to manage job dependencies, then run that with Cron (it won't run more than one of the same job).
Luigi will handle all your job dependencies for you and you can make sure that a particular job only executes once. It's a little more work to get set-up, but it's really worth it.
Short Way:
Lock files have already been mentioned, but you can do this on HDFS too, that way it doesn't depend on where you run the cron job from.
Instead of checking for a lock file, put a flag on HDFS when you start and finish the job, and have this as a standard thing in all of your cron jobs:
# at start
hadoop fs -touchz /jobs/job1/2016-07-01/_STARTED
# at finish
hadoop fs -touchz /jobs/job1/2016-07-01/_COMPLETED
# Then check them (pseudocode):
if(!started && !completed): run_job; add_completed; remove_started
At the start of the script, have a check:
#!/bin/bash
if [ -e /tmp/file.lock ]; then
rm /tmp/file.lock # removes the lock and continue
else
exit # No lock file exists, which means prev execution has not completed.
fi
.... # Your script here
touch /tmp/file.lock
There are many others ways of achieving the same. I am giving a simple example.
how can I kill hudson job from bash script when the log file doesn't change? (hudson is freezed).
Context: I have a bash script that check if a log file had change after X seconds and I want to modified it to check that if the timeout raises, and there's no error in console, this means that hudson job is freezed, so I want to be notified about this.
It might be easier to use the Build Timeout plugin.
Finally the solution was to use the following command:
#!/bin/bash
#if the log file does not change
if [ "something" ]; then
kill -9 $(pidof eclipse)
fi
This kills the eclipse instance (who's calls hudson), and continues with the build of the others elements and that it's Ok for my task.
I have a question.
I want to run more instance of same job in parallel from within a script: I have a loop in which I invoke jobs with dsjob and without option "-wait" and "-jobstatus".
I want that jobs completed before script termination, but I don't know how to verify if job instance terminated.
I though to use wait command but it is not appropriate.
Thanks in advance
First,you should assure job compile option "Allow Multiple Instance" choose.
Second:
#!/bin/bash
. /home/dsadm/.bash_profile
INVOCATION=(1 2 3 4 5)
cd $DSHOME/bin
for id in ${INVOCATION[#]}
do
./dsjob -run -mode NORMAL -wait test demo.$id
done
project -- test
job -- demo
$id -- invocation id
the two line in shell scipt:guarantee the environment path can work.
Run the jobs like you say without the -wait, and then loop around running dsjob -jobinfo and parse the output for a job status of 1 or 2. When all jobs return this status, they are all finished.
You might find, though, that you check the status of the job before it actually starts running and you might pick up an old status. You might be able to fix this by first resetting the job instance and waiting for a status of "Not running", prior to running the job.
Invoke the jobs in loop without wait or job-status option
after your loop , check the jobs status by dsjob command
Example - dsjob -jobinfo projectname jobname.invocationid
you can code one more loop for this also and use sleep command under that
write yours further logic as per status of the jobs
but its good to create Job Sequence to invoke this multi-instance job simultaneously with the help of different invoaction-ids
create a sequence job if these are in same process
create different sequences or directly create different scripts to trigger these jobs simultaneously with invocation- ids and schedule in same time.
Best option create a standard generalized script where each thing will be getting created or getting value as per input command line parameters
Example - log files on the basis of jobname + invocation-id
then schedule the same script for different parameters or invocations .