Is it possible to tell slurm that it must execute specific, for example post-script.py, script after the submitted task has been completed?
Not submit new task, just run it on login-node
Something like...
#SBATCH --at-end-run="bash post-script.sh"
Or is it only option to check if task has been completed every N-minutes?
The short answer is that there is no such option in Slurm.
If post-script.sh can run on a compute node, the best option would be
if it is short: to add it at the end of the job submission script
if it is long; to submit it in its own job and use --dependency options to make start at the end of the first job.
If you have root privileges, you can use strigger to run post-script.sh after the job has completer. That would run on the slurmctld server.
If the post-script.sh must run on the login node, for external network access for instance, then the options first mentioned would work if you are able/allowed to SSH from a compute node to a login node. This is sometimes prevented/forbidden, but if not, then you can run ssh login.node bash post-script.sh at the end of the submission script or in a job of itself.
If that is not a possibility, then "busy polling" is indeed needed. You can do it in a Bash loop making sure not to put too large a burden on the Slurm server (every 5 minutes is OK, every 5 seconds is useless and harmful to the system).
You can also use a dedicated workflow management tool such as Maestro that will allow you to define a job and a dependent task to run on the login node.
See some general information about workflows on HPC systems here.
Related
I’m trying to run a maintenance workflow on all agents.
It’s a cleaning job like “docker system prune”. I wish it runs on all self-host agents (about 12 agents) weekly on Sunday night.
I noticed that workflows can run with on schedule event. This is great.
But I didn’t find a way to make all self-hosted agents to run the workflow. Any suggestions?
I believe that this problem is more about your operating system(s), not about workflows on GitHub side. It might be possible to do as workflow, but then your agents are requiring either to be on host operating system, or have access for Docker socket. (I don't know how you are hosting them). Either way, might be insecure depending on if host is used on something else as well.
As GitHub docs are stating, you are responsible about your operating system.
In general, you can schedule maintenance jobs with cron, might be the most used one. How to install it, depends on your operating system.
To add scheduled work, run command crontab -e, select editor and add line to the end:
0 3 * * 0 /usr/bin/docker system prune -f to run 03:00 AM Sunday weekly.
However, if you really want to use workflows, you could read some docs on here. It states, that "Labels allow you to send workflow jobs to specific types of self-hosted runners, based on their shared characteristics." So you could create specific maintenance job for every runner with different label. This requires many scheduled jobs as runners are not intended to launch multiple times for same job.
I have a scenario where I would like a build to start running on one agent (Job 1), and then after doing some work, I'd like it to run a step on a special agent (pool) of machines with specially licensed software. (Job 2). When that is done I'd like the rest of the build to complete on the original agent (Job 3).
I have been able to use "Variable Tools for Azure DevOps Services" to successfully pass any number of variables between agent jobs, even when they are running on different machines. It is no problem for me to pass a UNC path from Job1 to Job2 / Job3, etc.
However, what I am seeing is that no matter what I do, agent jobs are always running in parallel, and there is no way to get them to run serially, unless they are locked to the same agent on the same machine, which defeats the whole purpose.
Does anyone know of a means to accomplish this? Right now in tests, I have to use "Start-Sleep" or something similar, and repeatedly monitor an external event. A terribly inelegant work-around.
I found the answer. A job properties contains a field called "dependencies". You can make it serial by setting a dependency on the previous job.
In Azure Devops for the agent job you will get below options
You can select any option based on your requirements.
I recently submitted a training job with a command that looked like:
gcloud ai-platform jobs submit training foo --region us-west2 --master-image-uri us.gcr.io/bar:latest -- baz qux
(more on how this command works here: https://cloud.google.com/ml-engine/docs/training-jobs)
There was a bug in my code which cause the job to just keep running, rather than terminate. Two weeks and $61 later, I discovered my error and cancelled the job. I want to make sure I don't make that kind of mistake again.
I'm considering using the timeout command within the training container to kill the process if it takes too long (typical runtime is about 2 or 3 hours), but rather than trust the container to kill itself, I would prefer to configure GCP to kill it externally.
Is there a way to achieve this?
As a workaround, you could write a small script that runs your command and then sleeps the time you want until running a cancel job command.
As a timeout definition is not available in AI Platform training service, I took the liberty to open a Public Issue with a Feature Request to record the lack of this command. You can track the PI progress here.
Except the script mentioned above, you can also try:
TimeOut Keras callback, or timeout= Optuna param (depending on which library you actually use)
Cron-triggered Lambda (Cloud Function)
What does "source job" refer to in the description of action_unknown?
action_unknown
The action to perform when the user has multiple jobs on the node
and the RPC does not locate the **source job**. If the RPC mechanism works
properly in your environment, this option will likely be relevant only
when connecting from a login node. Configurable values are:
newest (default)
Pick the newest job on the node. The "newest" job is chosen based
on the mtime of the job's step_extern cgroup; asking Slurm would
require an RPC to the controller. Thus, the memory cgroup must be in
use so that the code can check mtimes of cgroup directories. The user
can ssh in but may be adopted into a job that exits earlier than the
job they intended to check on. The ssh connection will at least be
subject to appropriate limits and the user can be informed of better
ways to accomplish their objectives if this becomes a problem.
allow
Let the connection through without adoption.
deny
Deny the connection.
https://slurm.schedmd.com/pam_slurm_adopt.html
slurm_pam_adopt will try to capture an incoming SSH session into the cgroup corresponding to the job currently running on the host. This option is meant to decide what to do when there are several jobs running for the user that initiates the ssh command.
The 'source job' is the jobid of the process that initiates the ssh call. Typically, if you use an interactive ssh session from the frontend, there is not 'source job', but if the ssh command is run from within a submission script, then the 'source job' is the one corresponding to that submission script.
Is it possible for CakePHP to execute a cakephp shell task on background for
i.e running long reports. I would also want to update the current
status back to the user via updating a table during the report
generation and querying using Ajax.
Yes, you can run shells in the background via normal system calls like
/path/to/cake/console/cake -app /path/to/app/ <shell> <task>
The tricky part is to start one asynchronously from PHP; the best option would be to put jobs in a queue and run the shell as a cron job every so often, which then processes the queue. You can then also update the status of the job in the queue and poll that information via AJAX.
Consider implementing it as a daemon: http://pear.php.net/package/System_Daemon