Code updates during submission of condor jobs - cluster-computing

When using condor for distributing jobs across a dedicated computer cluster, one first submits the jobs to the cluster and then waits for them to actually start running. Depending on multiple factors, they might stay in an idle state for quite some time, even hours.
Let us say I just compiled the code that is going to be run in the jobs. I can submit the jobs via a condor submission file. I then realize I would like to change the original code, either because there is some bug in it, or else because I want to try different parameters. In the case the code finishes compiling while the jobs are still in an idle state, which version is going to be run in the cluster? In other words, does condor somehow stores a snapshot of the code when the jobs are submitted, or it just picks it when the jobs start running?
Despite thinking the first option sounds way more reasonable, I have evidence from my own work that the second is the one that actually happens.

When condor_submit is run, the executable is copied to the spool directory under the scheduler. This is called spooling. If you want to be able to change the executable after submission, probably the best thing to do is to make your executable a shell script that calls the real executable, and put the executable into the transfer_input_files list.

Related

Queue Task Scheduling

I have many command scheduling in kernel file. let's say all run ->daily(), did they run one after one or run together. and if so, how can I let them run one after one?
Kernel commands can start at the same time, even if other tasks still running. To change it use withoutOverlapping() method on your command like eg.
$schedule->command('command:start')->withoutOverlapping();

Considerations when porting a MS VC++ program (single machine) to a rocks cluster

I am trying to port a MS VC++ program to run on a rocks cluster! I am not very good with linux but I am eager to learn and I imagine porting it wouldn't be an impossible task for me. However, I do not understand how to take advantage of the cluster nodes. because it seems that the code execute only runs on the front end server (obviously).
I have read a little about MPI and its seems like I should use MPI to comminicate between nodes. The program is currently written such that I have a main thread that synchronizes all worker threads. The main thread also recieves commands to manipulate the simulation or query its state. If the simulation is properly setup, communication between executing threads can be significantly minimized. What I don't understand is how do I start the process on the compute nodes and how do I handle failures in nodes? And maybe there should be other things I should also consider when porting my program to run in a cluster?
The first step is porting the threaded MS VC++ program to run on a single Linux machine.
Once you have gotten past that point, then modify your program to use MPI in addition to threads (or instead of threads). You can do this on a single computer as well.
To run the program on multiple nodes on your cluster, you will need to submit the program to whatever scheduling system you cluster uses. The command for this is dependent on the scheduling software used for your Rocks cluster. Ask your administrator. It may look something like mpirun -np 32 yourprogram.
Handling failures is the nodes is a broad question. Your first pass should probably just report the failure, then fail the program. If the program doesn't take to long to compute on the cluster, then restarting the program, adjusting for the failed node, may be good enough. Beyond that, your application can write to disk intermediate information needed to resume where it left off. This is called checkpointing your application. Thus, when a node fails, the job fails, but restarting the job doesn't start from the beginning. Much more advanced would be trying to actually detect node failures and reschedule the work unit that was on the failed node. This assumes that the work unit doesn't have non-idempotent side effects. This sort of thing gets really complicated. Checkpointing is likely good enough.

Run a process each time a new file is created in a directory in linux

I'm developing an app. The operating system I'm using is linux. I need to run if possible a ruby script on the file created in the directory. I need to keep this script always running. The first thing I thought about is inotify:
The inotify API provides a mechanism for monitoring file system events. Inotify can be used to monitor individual files, or to monitor directories.
It's exactly what I need, then I found "rb-inotify", a wrapper fir inotify.
Do you think there is a better way of doing what I need than using inotify? Also, I really don't understand the way that I have to use rb-inotify.
I just create, for example, a rb file with:
notifier = INotify::Notifier.new
notifier.watch("directory/to/check",:create) do |event|
#do task with event.name file
end
notifier.run
Then I just ruby myRBNotifier.rb, and it will stay looping for ever. How do I stop it? Any idea? Is this a good approach?
I'd recommend looking at god. It's designed for this sort of task, and makes it pretty easy to build a monitoring system for background and daemon apps.
As for the main code itself, inotify isn't cross-platform, so if you have a possibility you'll need to run on Windows or Mac OS then you'll need a different solution. It's not too hard to write a little piece of code that checks your target directory periodically for a change. If you need to know what changed, read and cache the directory entries then compare them the next time your code runs. Use sleep between runs to wait some period of time before looping.
The old-school method of doing similar things is to use cron to fire off a job at regular intervals. That job can be your script that checks whether the file list changed by comparing it to the cached version, then acting as needed if something is different.
Just run your script in the background with
ruby myRBNotifier.rb &
When you need to stop it, find the process id and use kill on it:
ps ux
kill [whatever pid your process gets from the OS]
Does that answer your question?
If you're running on a mac/unix machine, look at the launchctl man page. You can set up a process to run and execute a ruby script whenever a file changes. It's highly configurable.

Windows job scheduler still executing when time for next operation

I have a Server 2008 scheduled job that does the following:
Tests to see if the active/passive mscs service is running on this node.
If so then maps a shared folder.
Moves a bunch of fies from shared folder to clustered drive.
Unmaps shared folder.
The job runs every 5 minutes. The files are procuded by another system and though not completley time critical delating the process much more than this is not acceptable to the business.
So far so good.
What I'm seeing is that although the script works correctly, the job history says that the second time it attempts to run the job, 'there is a previous copy of the job still running'.
Does anyone have any thoughts on:
Why this is happening?
And how to go about debugging it?
If this were Unix/Linux this would not be a problem but this is a complete mystery to me.

Windows Task Scheduler will run app only once

So my situation is that I am running an app on the Windows Task Scheduler. This app is run once a day at 1pm. the app does some queries and transfers data to an FTP site. All that is working great except on the weekends when i am not here the app is run and the GUI is still displayed for me to review. This seems to make it stop running on the scheduler until I shut down the app. So on Saturday it will run and the app will remain displayed for me to review when I get back on Monday. but on Sunday when the scheduler attempts to run it again it will fail because the app has not been closed down.
First let me confirm that this is how the Task Scheduler is supposed to work. Second, what are my alternatives for scheduling to run every day and keep the GUI displayed so that I can review. The app can run multiple times as each session does not interfere with the other sessions. So if I'm gone for a week on vacation I would expect that when i get back that 7 instances of the app have been run and are waiting for my review.
Thanks
AGP
Your best bet is to eliminate the UI and log messages to the Event Log or a log file. The UI could be spawned from the CLI as a separate process if you prefer, but it should be done so in as its own non-child process.
Alternatively, you could run a batch file instead of the process directly. In the batch file, invoke "START path_to_exe" instead of the EXE. That will cause the batch file to "finish" instantly, and the exe to be run in its own process. This is not a good long term solution, but will give you a temporary solution to your immediate problem.
This is the default behavior of the Scheduled Task system, as it doesn't know that the job is complete until the application actually exits. Therefore, if your application is still open after 24 hours, the next run will simply be skipped because the current run is "still going" as far as the scheduler is concerned.
Personally I would re-visit the way that you handle your job process, as your are setting up a scenario that will be hard to manage long term.
I recommend writing to a log file instead of displaying a UI for any output and/or errors. This way, the application can write, then exit, and you can review the log at your convenience. This is a very common solution for automated processes.

Resources