makePSOCKcluster hangs on win x64 after calling system

makePSOCKcluster hangs on win x64 after calling system - windows

I am experiencing a hard to debug problem with makePSOCKcluster from the parallel package on R x64 on Windows. It does not happen on R i386 on Windows, nor on any OSX or Linux. Unfortunately it does not happen consistently either, only occasionally and quite randomly.
What happens is that the makePSOCKcluster function times out and freezes the R session, but only if earlier in the session some (arbitrary) system() calls were performed. The video and script below illustrate the problem more clearly.
Some stuff I tried without success:
Disable antivirus/firewalls.
Waiting a couple of seconds between calling system and makePSOCKcluser.
Using different system calls.
How would I further narrow this down? Here the video and the script used in the video is:
cmd_exists <- function(command){
iswin <- identical(.Platform$OS.type, "windows");
if(iswin){
test <- suppressWarnings(try(system(command, intern=TRUE, ignore.stdout=TRUE, ignore.stderr=TRUE, show.output.on.console=FALSE), silent=TRUE));
} else {
test <- suppressWarnings(try(system(command, intern=TRUE, ignore.stdout=TRUE, ignore.stderr=TRUE), silent=TRUE));
}
!is(test, "try-error")
}
options(hasgit = cmd_exists("git --version"));
options(haspandoc = cmd_exists("pandoc --version"));
options(hastex = cmd_exists("texi2dvi --version"));
cluster <- parallel::makePSOCKcluster(1);

makePSOCKCluster, or more generally makeCluster, can hang for any number of reasons when creating the so-called worker processes, which involves starting new R sessions using the Rscript command that will execute the .slaveRSOCK function, which will create a socket connection back to the master and then execute the slaveLoop function where it will eventually execute the tasks sent to it by the master. Wehen something goes wrong when starting any of the worker processes, the master will hang while executing socketConnection, waiting for the worker to connect to it even though that worker may have died or never even been created successfully.
Using the outfile argument is great because it often reveals the error that causes the worker process to die and thus the master to hang. But if that reveals nothing, then go to manual mode. In manual mode, the master prints the command to start each worker instead of executing the command itself. It's more work, but it gives you complete control, and you can even debug into the workers if you need to.
Here's an example:
> library(parallel)
> cl <- makePSOCKcluster(1, manual=TRUE, outfile='log.txt')
Manually start worker on localhost with
'/usr/lib/R/bin/Rscript' -e 'parallel:::.slaveRSOCK()' MASTER=localhost
PORT=10187 OUT=log.txt TIMEOUT=2592000 METHODS=TRUE XDR=TRUE
Next open a new terminal window (command prompt, or whatever), and paste in that Rscript command. As soon as you've executed it, makePSOCKcluster should return since we only requested one worker. Of course, if something goes wrong, it won't return, but if you're lucky, you'll get an error message in your terminal window and you'll have an important clue that will hopefully lead to a solution to your problem. If you're not so lucky, the Rscript command will also hang, and you'll have to dive in even deeper.
To debug the worker, you don't execute the displayed Rscript command because you need an interactive session. Instead, you start an R session with a command such as:
$ R --vanilla --args MASTER=localhost PORT=10187 OUT=log.txt TIMEOUT=2592000 METHODS=TRUE XDR=TRUE
In that R session, you can put a breakpoint on the .slaveRSOCK function and then execute it:
> debug(parallel:::.slaveRSOCK)
> parallel:::.slaveRSOCK()
Now you can start stepping through the code, possibly setting breakpoints on the slaveLoop and makeSOCKmaster functions.

Related

Terminate shell pipe from interactive go cli

I have a Go program that consumes "live" input from a shell pipe, eg:
tail -f some/file | my-program
my-program is an interactive program built with rivo/tview. I want to be able to close my program with Ctrl-C and have it also terminate the tail -f that supplies input to it.
Currently I have to hit Ctrl-C twice to get back to my shell prompt. Any way I can get back to my prompt by hitting Ctrl-C once?
Adjusted my program per #torek's explanation of progress groups and observation that I can get the progress group ID using unix.Getpgid(pid):
import (
"os"
"golang.org/x/sys/unix"
)
func main() {
// do stuff with piped input
pid := os.Getpid()
pgid, err := unix.Getpgid(pid)
if err != nil {
log.Fatalf("could not get process group id for pid: %v\n", pid)
}
processGroup, err := os.FindProcess(pgid)
if err != nil {
log.Fatalf("could not find process for pid: %v\n", pgid)
}
processGroup.Signal(os.Interrupt)
}
This delivers my desired behavior from my original question.
I opted to not use syscall because of the warning I found:
Deprecated: this package is locked down. Callers should use the corresponding package in the golang.org/x/sys repository instead. That is also where updates required by new systems or versions should be applied. See https://golang.org/s/go1.4-syscall for more information.
I plan to update my program to detect whether or not it was given a pipe using the strategy outlined in this article, so when a pipe is detected, I'll do the above process group signaling on interrupt.
Any issues with that?

We'll assume a Unix-like system, using a shell that understands and engages in job control (and they all do now). When you run a command, the shell creates something called a process group or "pgroup" to hold each of the processes that make up the command. If the command is a pipeline (as this one is), each process in the pipeline gets the same pgroup-ID (see setpgid).
If the command is run in the forgeground (without &), the controlling terminal has this particular pgid assigned to it. Pressing one of the signal-generating keys, such as CTRL-C or CTRL-\, sends the corresponding signal (SIGINT and SIGQUIT in these cases) to the pgroup, using an internal killpg or equivalent. This sends the signal to every member of the pgroup.
(Backgrounding a process is simply *cough* a matter of taking back the pgid on the controlling tty, then restarting the processes in the pipeline. To make that happen is not so simple, though, as indicated by the "restarting" here.)
The likely source of the problem here is that an interactive program will place the controlling terminal into cbreak or raw mode and disable some or all signalling from keyboard keys, so that, for instance, CTRL-C no longer causes the kernel's tty module to send a signal at all. Instead, if you see a key that should cause suspension (CTRL-Z) or termination, the program has to do its own suspending or terminating. Programmers sometimes assume that this consists of simply suspending or terminating—but since the entire pipeline never got the signal in question, that's not the case, unless the entire shell pipeline consisted solely of the interactive program.
The fix is to have the program send the signal to its own pgroup, after doing any necessary cleanup (temporarily or permanently) of the controlling terminal.

LLDB Restart process without user input

I am trying to debug a concurrent program in LLDB and am getting a seg fault, but not on every execution. I would like to run my process over and over until it hits a seg fault. So far, I have the following:
b exit
breakpoint com add 1
Enter your debugger command(s). Type 'DONE' to end.
> run
> DONE
The part that I find annoying, is that when I get to the exit function and hit my breakpoint, when the run command gets executed, I get the following prompt from LLDB:
There is a running process, kill it and restart?: [Y/n]
I would like to automatically restart the process, without having to manually enter Y each time. Anyone know how to do this?

You could kill the previous instance by hand with kill - which doesn't prompt - then the run command won't prompt either.
Or:
(lldb) settings set auto-confirm 1
will give the default (capitalized) answer to all lldb queries.
Or if you have Xcode 6.x (or current TOT svn lldb) you could use the lldb driver's batch mode:
$ lldb --help
...
-b
--batch
Tells the debugger to running the commands from -s, -S, -o & -O,
and then quit. However if any run command stopped due to a signal
or crash, the debugger will return to the interactive prompt at the
place of the crash.
So for instance, you could script this in the shell, running:
lldb -b -o run
in a loop, and this will stop if the run ends in a crash rather than a normal exit. In some circumstances this might be easier to do.

Automate a Ruby command without it exiting

This hopefully should be an easy question to answer. I am attempting to have mumble-ruby run automatically I have everything up and running except after running this simple script it runs but ends. In short:
Running this from terminal I get "Press enter to terminate script" and it works.
Running this via a cronjob runs the script but ends it and runs cli.disconnect (I assume).
I want the below script to run automatically via a cronjob at a specified time and not end until the server shuts down.
#!/usr/bin/env ruby
require 'mumble-ruby'
cli = Mumble::Client.new('IP Address', Port, 'MusicBot', 'Password')
cli.connect
sleep(1)
cli.join_channel(5)
stream = cli.stream_raw_audio('/tmp/mumble.fifo')
stream.volume = 2.7
print 'Press enter to terminate script';
gets
cli.disconnect

Assuming you are on a Unix/Linux system, you can run it in a screen session. (This is a Unix command, not a scripting function.)
If you don't know what screen is, it's basically a "detachable" terminal session. You can open a screen session, run this script, and then detach from that screen session. That detached session will stay alive even after you log off, leaving your script running. (You can re-attach to that screen session later if you want to shut it down manually.)
screen is pretty neat, and every developer on Unix/Linux should be aware of it.
How to do this without reading any docs:
open a terminal session on the server that will run the script
run screen - you will now be in a new shell prompt in a new screen session
run your script
type ctrl-a then d (without ctrl; the "d" is for "detach") to detach from the screen (but still leave it running)
Now you're back in your first shell. Your script is still alive in your screen session. You can disconnect and the screen session will keep on trucking.
Do you want to get back into that screen and shut the app down manually? Easy! Run screen -r (for "reattach"). To kill the screen session, just reattach and exit the shell.
You can have multiple screen sessions running concurrently, too. (If there is more than one screen running, you'll need to provide an argument to screen -r.)
Check out some screen docs!
Here's a screen howto. Search "gnu screen howto" for many more.

Lots of ways to skin this cat... :)
My thought was to take your script (call it foo) and remove the last 3 lines. In your /etc/rc.d/rc.local file (NOTE: this applies to Ubuntu and Fedora, not sure what you're running - but it has something similar) you'd add nohup /path_to_foo/foo 2>&1 > /dev/null& to the end of the file so that it runs in the background. You can also run that command right at a terminal if you just want to run it and have it running. You have to make sure that foo is made executable with chmod +x /path_to_foo/foo.

Use an infinite loop. Try:
while running do
sleep(3600)
end
You can use exit to terminate when you need to. This will run the loop once an hour so it doesnt eat up processing time. An infinite loop before your disconnect method will prevent it from being called until the server shuts down.

How to launch crashing (rarely) application in subprocess

I'm having python application which needs to execute proprietary application (which crashes from time to time) about 20 000 times a day.
The problem is when application crashes, Windows automatically triggers WerFault which will keep program hanging, thus python's subprocess.call() will wait forever for user input (that application has to run on weekends, on holidays, 24/7... so this is not acceptable).
If though about using sleep; poll; kill; terminate but that would mean losing ability to use communicate(), application can run from few miliseconds to 2 hours, so setting fixed timeout will be ineffective
I also tried turning on automatic debugging (use a script which would take a crash dump of an application and terminate id), but somehow this howto doesn't work on my server (WerFault still appears and waits for user input).
Several other tutorials like this didn't take any effect either.
Question:
is there a way how to prevent WerFault from displaying (waiting for user input)? this is more system then programming question
Alternative question: is there a graceful way in python how to detect application crash (whether WerFault was displayed)

Simple (and ugly) answer, monitor for WerFault.exe instances from time to time, specially the one associated with the PID of the offending application. And kill it. Dealing with WerFault.exe is complicated but you don't want to disable it -- see Windows Error Reporting service.
Get a list of processes by name that match WerFault.exe. I use psutil package. Be careful with psutil because processes are cached, use psutil.get_pid_list().
Decode its command line by using argparse. This might be overkill but it leverages existing python libraries.
Identify the process that is holding your application according to its PID.
This is a simple implementation.
def kill_proc_kidnapper(self, child_pid, kidnapper_name='WerFault.exe'):
"""
Look among all instances of 'WerFault.exe' process for an specific one
that took control of another faulting process.
When 'WerFault.exe' is launched it is specified the PID using -p argument:
'C:\\Windows\\SysWOW64\\WerFault.exe -u -p 5012 -s 68'
| |
+-> kidnapper +-> child_pid
Function uses `argparse` to properly decode process command line and get
PID. If PID matches `child_pid` then we have found the correct parent
process and can kill it.
"""
parser = argparse.ArgumentParser()
parser.add_argument('-u', action='store_false', help='User name')
parser.add_argument('-p', type=int, help='Process ID')
parser.add_argument('-s', help='??')
kidnapper_p = None
child_p = None
for proc in psutil.get_pid_list():
if kidnapper_name in proc.name:
args, unknown_args = parser.parse_known_args(proc.cmdline)
print proc.name, proc.cmdline
if args.p == child_pid:
# We found the kidnapper, aim.
print 'kidnapper found: {0}'.format(proc.pid)
kidnapper_p = proc
if psutil.pid_exists(child_pid):
child_p = psutil.Process(child_pid)
if kidnapper_p and child_pid:
print 'Killing "{0}" ({1}) that kidnapped "{2}" ({3})'.format(
kidnapper_p.name, kidnapper_p.pid, child_p.name, child_p.pid)
self.taskkill(kidnapper_p.pid)
return 1
else:
if not kidnapper_p:
print 'Kidnapper process "{0}" not found'.format(kidnapper_name)
if not child_p:
print 'Child process "({0})" not found'.format(child_pid)
return 0
Now, taskkill function invokes taskkill commmand with correct PID.
def taskkill(self, pid):
"""
Kill task and entire process tree for this process
"""
print('Task kill for PID {0}'.format(pid))
cmd = 'taskkill /f /t /pid {0}'.format(pid)
subprocess.call(cmd.split())

I see no reason as to why your program needs to crash, find the offending piece of code, and put it into a try-statement.
http://docs.python.org/3.2/tutorial/errors.html#handling-exceptions

Can a standalone ruby script (windows and mac) reload and restart itself?

I have a master-workers architecture where the number of workers is growing on a weekly basis. I can no longer be expected to ssh or remote console into each machine to kill the worker, do a source control sync, and restart. I would like to be able to have the master place a message out on the network that tells each machine to sync and restart.
That's where I hit a roadblock. If I were using any sane platform, I could just do:
exec('ruby', __FILE__)
...and be done. However, I did the following test:
p Process.pid
sleep 1
exec('ruby', __FILE__)
...and on Windows, I get one ruby instance for each call to exec. None of them die until I hit ^C on the window in question. On every platform I tried this on, it is executing the new version of the file each time, which I have verified this by making simple edits to the test script while the test marched along.
The reason I'm printing the pid is to double-check the behavior I'm seeing. On windows, I am getting a different pid with each execution - which I would expect, considering that I am seeing a new process in the task manager for each run. The mac is behaving correctly: the pid is the same for every system call and I have verified with dtrace that each run is trigging a call to the execve syscall.
So, in short, is there a way to get a windows ruby script to restart its execution so it will be running any code - including itself - that has changed during its execution? Please note that this is not a rails application, though it does use activerecord.

After trying a number of solutions (including the one submitted by Byron Whitlock, which ultimately put me onto the path to a satisfactory end) I settled upon:
IO.popen("start cmd /C ruby.exe #{$0} #{ARGV.join(' ')}")
sleep 5
I found that if I didn't sleep at all after the popen, and just exited, the spawn would frequently (>50% of the time) fail. This is not cross-platform obviously, so in order to have the same behavior on the mac:
IO.popen("xterm -e \"ruby blah blah blah\"&")

The classic way to restart a program is to write another one that does it for you. so you spawn a process to restart.exe <args>, then die or exit; restart.exe waits until the calling script is no longer running, then starts the script again.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio