Running Oprofile with MPI - bash

I'm having issues using Oprofile to profile a parallel program that I call via mpirun. The command I'd like to use is:
$ operf mpirun -n 4 [program and arguments]
Unfortunately, when I do this, operf starts logging, but something funny happens when the MPI program is finished - operf seems to not recognize that it's returned (MPI-spawned processes no longer appear in htop, but operf still does), and things just hang waiting for me to interrupt them.
Is there an option I can pass to operf or mpirun which will make the two play nicely together? Failing that, is there a bash trick I can use to automatically kill operf when my MPI program is finished?
Edit: Previously thought that it Oprofile wasn't always generating results, but it turns out that I was just confused and looking in the wrong location. The only problem is that operf doesn't recognize that the MPI program has terminated.

Try using this line, it works like a charm:
sudo operf mpirun --allow-run-as-root -x LD_LIBRARY_PATH="build/" -np 2 (path_to_the_file)

Related

GNU parallel kills command in bash script on HPC

I am running GNU parallel to run a bash script but it seems GNU parallel automatically kills my program and I am not sure why. It run normally when I run the inside script individually.
I wonder why this happen and how to solve it?
Your help is really appreciated!
Here is my code:
parallel --progress --joblog ${home}/data/hsc-admmt/Projects/log_a.sh -j 5 :::: a.sh
Here is the message at the end of output of GNU parallel
/scratch/eu82/bt0689/data/hsc-admmt/Projects/sim_causal_mantel_generate.sh: line 54: 3050285 Killed $home/data/hsc-admmt/Tools/mtg2 -plink plink_all${nsamp}${nsnp}_1 -simreal snp.lst1
I just had the same problem and found this possible explanation:
“Some jobs need a lot of memory, and should only be started when there is enough memory free. Using --memfree GNU parallel can check if there is enough memory free. Additionally, GNU parallel will kill off the youngest job if the memory free falls below 50% of the size. The killed job will put back on the queue and retried later.”
(from https://www.gnu.org/software/parallel/parallel_tutorial.html).
In my case the killed job was not resumed. I’m not sure if this is the reason for your problem but it would explain mine since the error only occurs to me when I parallelize my script for more than 3 jobs.

How to send input to a console/CLI program running on remote host using bash?

I have a script that I normally launch using the following syntax:
ssh -Yq user#host "xterm -e '. /home/user/bin/prog1 $arg1;prog2'"
(note: I've removed some of the complexities of the command, so please excuse if there are any syntax errors in the ssh command; it should not be relevant to the question)
This launches an xterm window that runs prog1, and after completion runs prog2. prog2 is a console-style program that performs some setup, then several seconds later waits for user input.
Is there a way via bash script (preferably without downloading external packages) that I can send data to prog2 that's running on $host?
I've looked into << and expect, but it's way over my head. My intuition is that there's probably a straightforward way of doing this, but I can't figure out what terms to search for. I also understand that I can remotely send keystrokes to a host using xdotools or something similar, but I'm hesitant to request a new package installation unless I know that's the only reasonable solution.
Thanks!

How to pause the execution of a program after 10 seconds and get a backtrace?

A legacy program most likely gets into an infinite loop on certain pathological inputs. I have >1000 such instances, however, I suspect that the vast majority of them trigger the same bug. Therefore, I would like to reduce the >1000 instances to the fundamentally different ones. The first step is to pause the application after, say, 10 seconds and collect the backtrace.
If I run:
gdb --batch --command=backtrace.txt --args ./legacy_program
with backtrace.txt
run
bt
and I hit Ctrl + C after 10 seconds in the same terminal I get exactly the backtrace I want.
Now, I would like to do that automatically. I have tried sending SIGINT (the expected equivalent of Ctrl + C) from another terminal but I do not get the backtrace anymore. Here are some of my failed attempts based on
GDB how to stop execution without a breakpoint?
Neither of these have any effect:
pkill -SIGINT gdb
kill -SIGINT 5717
where 5717 is the pid of the only gdb running. Sending SIGINT to the legacy_program the same way does kill it but then I do not get the backtrace:
Program received signal SIGINT, Interrupt.
Quit
How can I programmatically pause the execution of the legacy_program after 10 seconds and get a backtrace?
This post was motivated by my frustration not being able to find an answer to this question here at StackOverflow.
Also note that
[it is not merely OK to ask and answer your own question, it is explicitly encouraged.](https://blog.stackoverflow.com/2011/07/its-ok-to-ask-and-answer-your-own-questions/)
Apparently, it is a known (bug) feature in gdb, see
GDB is not trapping SIGINT. Ctrl+C terminates program when should break gdb. Try sending SIGSTOP instead from the other terminal:
pkill -STOP legacy_program
It works on my machine.
Note that you do not have to run the legacy_program in the debugger. Enable core dumps
ulimit -c unlimited
and send the program SIGTRAP to make it crash, then get the backtrace from the core dump. So, start the program:
./legacy_program
From another terminal:
pkill -TRAP legacy_program
The backtrace can be obtained like this:
gdb --batch -ex=bt ./legacy_program core

Making qsub block until job is done?

Currently, I have a driver program that runs several thousand instances of a "payload" program and does some post-processing of the output. The driver currently calls the payload program directly, using a shell() function, from multiple threads. The shell() function executes a command in the current working directory, blocks until the command is finished running, and returns the data that was sent to stdout by the command. This works well on a single multicore machine. I want to modify the driver to submit qsub jobs to a large compute cluster instead, for more parallelism.
Is there a way to make the qsub command output its results to stdout instead of a file and block until the job is finished? Basically, I want it to act as much like "normal" execution of a command as possible, so that I can parallelize to the cluster with as little modification of my driver program as possible.
Edit: I thought all the grid engines were pretty much standardized. If they're not and it matters, I'm using Torque.
You don't mention what queuing system you're using, but SGE supports the '-sync y' option to qsub which will cause it to block until the job completes or exits.
In TORQUE this is done using the -x and -I options. qsub -I specifies that it should be interactive and -x says run only the command specified. For example:
qsub -I -x myscript.sh
will not return until myscript.sh finishes execution.
In PBS you can use qsub -Wblock=true <command>

Can a standalone ruby script (windows and mac) reload and restart itself?

I have a master-workers architecture where the number of workers is growing on a weekly basis. I can no longer be expected to ssh or remote console into each machine to kill the worker, do a source control sync, and restart. I would like to be able to have the master place a message out on the network that tells each machine to sync and restart.
That's where I hit a roadblock. If I were using any sane platform, I could just do:
exec('ruby', __FILE__)
...and be done. However, I did the following test:
p Process.pid
sleep 1
exec('ruby', __FILE__)
...and on Windows, I get one ruby instance for each call to exec. None of them die until I hit ^C on the window in question. On every platform I tried this on, it is executing the new version of the file each time, which I have verified this by making simple edits to the test script while the test marched along.
The reason I'm printing the pid is to double-check the behavior I'm seeing. On windows, I am getting a different pid with each execution - which I would expect, considering that I am seeing a new process in the task manager for each run. The mac is behaving correctly: the pid is the same for every system call and I have verified with dtrace that each run is trigging a call to the execve syscall.
So, in short, is there a way to get a windows ruby script to restart its execution so it will be running any code - including itself - that has changed during its execution? Please note that this is not a rails application, though it does use activerecord.
After trying a number of solutions (including the one submitted by Byron Whitlock, which ultimately put me onto the path to a satisfactory end) I settled upon:
IO.popen("start cmd /C ruby.exe #{$0} #{ARGV.join(' ')}")
sleep 5
I found that if I didn't sleep at all after the popen, and just exited, the spawn would frequently (>50% of the time) fail. This is not cross-platform obviously, so in order to have the same behavior on the mac:
IO.popen("xterm -e \"ruby blah blah blah\"&")
The classic way to restart a program is to write another one that does it for you. so you spawn a process to restart.exe <args>, then die or exit; restart.exe waits until the calling script is no longer running, then starts the script again.

Resources