I have a Server 2008 scheduled job that does the following:
Tests to see if the active/passive mscs service is running on this node.
If so then maps a shared folder.
Moves a bunch of fies from shared folder to clustered drive.
Unmaps shared folder.
The job runs every 5 minutes. The files are procuded by another system and though not completley time critical delating the process much more than this is not acceptable to the business.
So far so good.
What I'm seeing is that although the script works correctly, the job history says that the second time it attempts to run the job, 'there is a previous copy of the job still running'.
Does anyone have any thoughts on:
Why this is happening?
And how to go about debugging it?
If this were Unix/Linux this would not be a problem but this is a complete mystery to me.
Related
I was running some jobs in the SLURM of my PC, and the computer rebooted.
Once the computer was back on, I saw in the squeue that the jobs that were running before reboot were not running anymore due to a drain state. It seemed they had been automatically requeued after the reboot.
I couldn't submit more jobs, because the node was drained. So I did scancel the jobs that were automatically requeued.
The problem is that I cannot free the node. I tried a few things:
Restarting slurmctld and slurmd
"undraining" the nodes as explained in another question, but no success. The commands ran without any output (I assume this is good), but the state of the node did not change.
I then tried manually rebooting the system to see if anything would change
Running scontrol show node neuropc gives
[...]
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
[...]
Reason=Low RealMemory [slurm#2023-02-05T22:06:33]
Weirdly, the System Monitor shows that all the 8 cores keep having activity between 5% and 15%, whereas in the Process tab it shows only one app (TeamViewer) using less than of 4% processor.
So I suspect the job I was running somehow was kept running after reboot or are still on hold by SLURM.
I use Ubuntu 20.04 and slurm 19.05.5.
To strictly answer the question ; no they cannot. They might or might not be requeued depending on the Slurm configuration, and restarted either from scratch or from the latest checkpoint if the job is able to do checkpoint/restart. But there is not way a running process can survive a server reboot.
This answer solved my problem. Copying it here:
This could be that RealMemory=541008 in slurm.conf is too high for your system. Try lowering the value. Lets suppose you have indeed 541 Gb of RAM installed: change it to RealMemory=500000, do a scontrol reconfigure and then a scontrol update nodename=transgen-4 state=resume.
If that works, you could try to raise the value a bit.
When using condor for distributing jobs across a dedicated computer cluster, one first submits the jobs to the cluster and then waits for them to actually start running. Depending on multiple factors, they might stay in an idle state for quite some time, even hours.
Let us say I just compiled the code that is going to be run in the jobs. I can submit the jobs via a condor submission file. I then realize I would like to change the original code, either because there is some bug in it, or else because I want to try different parameters. In the case the code finishes compiling while the jobs are still in an idle state, which version is going to be run in the cluster? In other words, does condor somehow stores a snapshot of the code when the jobs are submitted, or it just picks it when the jobs start running?
Despite thinking the first option sounds way more reasonable, I have evidence from my own work that the second is the one that actually happens.
When condor_submit is run, the executable is copied to the spool directory under the scheduler. This is called spooling. If you want to be able to change the executable after submission, probably the best thing to do is to make your executable a shell script that calls the real executable, and put the executable into the transfer_input_files list.
Fixed: I ran an application called Application Verifier while debugging my allinone project a while ago. Application Verifier kept running in the background and was keeping a log every time I ran allinone.exe. I suspect the growing number of log files was the cause allinone.exe started to slow down. I was able to find that out using the Process Monitor, which I didn't know about earlier. Thank you!
We're using a scheduled task to run a simple executable every 2 minutes, it has a working directory set (no quotes round the path) but other than that most options are left as defaults.
Running the task manually by right-clicking on it and selecting Run works just fine, however it never executes automatically. When the time comes to run the task, it just increments the "Next Run Time" field by 2 minutes and that's it. The Last Run Time Field is always the last time the task was manually executed.
The Last Run Result is always 0x0.
I've tried setting it to run as the current user, or as an alternate user set up with administrative privileges and a stored password, but still no luck.
There doesn't appear to be anything immediately obvious in the system event log either to indicate the cause of the failure.
As a bit of background, this is a headless Win10 Pro machine (only ever accessed via LogMeIn) running control software for external hardware. It reboots every morning at 03:00 and since it's on an physically isolated network, automatically logs in to a user account with administrative privileges and no password.
I suspect it may be a permissions issue with the insecure way the system is set up, however at this point there's little to go on.
Any ideas?
Thanks
For posterity, this turned out to be an issue with the way the repetition on the scheduled task was set up. Initially it was set up as a daily task to run every 2 minutes for 24 hours. Whilst this showed the correct next run time in the task scheduler window, for some reason it never executed - Bug?
The fix was instead to set the task up as a one time event that repeats every 2 minutes indefinitely, which seems to be working properly now.
So my situation is that I am running an app on the Windows Task Scheduler. This app is run once a day at 1pm. the app does some queries and transfers data to an FTP site. All that is working great except on the weekends when i am not here the app is run and the GUI is still displayed for me to review. This seems to make it stop running on the scheduler until I shut down the app. So on Saturday it will run and the app will remain displayed for me to review when I get back on Monday. but on Sunday when the scheduler attempts to run it again it will fail because the app has not been closed down.
First let me confirm that this is how the Task Scheduler is supposed to work. Second, what are my alternatives for scheduling to run every day and keep the GUI displayed so that I can review. The app can run multiple times as each session does not interfere with the other sessions. So if I'm gone for a week on vacation I would expect that when i get back that 7 instances of the app have been run and are waiting for my review.
Thanks
AGP
Your best bet is to eliminate the UI and log messages to the Event Log or a log file. The UI could be spawned from the CLI as a separate process if you prefer, but it should be done so in as its own non-child process.
Alternatively, you could run a batch file instead of the process directly. In the batch file, invoke "START path_to_exe" instead of the EXE. That will cause the batch file to "finish" instantly, and the exe to be run in its own process. This is not a good long term solution, but will give you a temporary solution to your immediate problem.
This is the default behavior of the Scheduled Task system, as it doesn't know that the job is complete until the application actually exits. Therefore, if your application is still open after 24 hours, the next run will simply be skipped because the current run is "still going" as far as the scheduler is concerned.
Personally I would re-visit the way that you handle your job process, as your are setting up a scenario that will be hard to manage long term.
I recommend writing to a log file instead of displaying a UI for any output and/or errors. This way, the application can write, then exit, and you can review the log at your convenience. This is a very common solution for automated processes.