Debugging an unexpected process exit with a strange exit code

Debugging an unexpected process exit with a strange exit code - windows

We have a small Java program which runs on our test machines as a daemon which we use to start servers for testing.
On windows we have it run the servers 'under' procdump so as to capture a core dump if the server crashes.
Recently, we've been seeing the servers start successfully, but then exit with the code STATUS_DEBUGGER_INACTIVE (0xC0000354). This is definitely not an exit code returned from the server via returning from main, given our logging of stderr/stdout, and since we just never return that value.
We get this exit code by scraping the procdump output for the PID of the monitored process, and then using JNA to open a win32 handle to the server process & using getexitcodeprocess
I believe that procdump may be dying/being killed at the same time, since there's no usual 'process exited without creating dump' message.
To try & debug this I added logic to our java program which enables silent process exit monitoring for the server process (See https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/registry-entries-for-silent-process-exit), and I tested it by starting a server & killing it with the task manager, and a dump was created.
But when running 'in production', I'm not seeing any dumps created, even though the unexpected exits are still occurring. What does it mean if this 'silent process exit monitoring' doesn't catch my process exit?
I haven't been able to find much about STATUS_DEBUGGER_INACTIVE online, but I did find this https://github.com/adobe/chromium/blob/cfe5bf0b51b1f6b9fe239c2a3c2f2364da9967d7/base/process_util_win.cc#L41
What is this 'special meaning' and where is it documented?

Related

Ruby force close IO.popen java process

How can I force quit/kill a java process that was stared with IO.popen("command", "r+")?
I am running a script a small java program from ruby doing the following:
pipe = IO.popen("nice -n 19 java -Xmx2g -Djava.awt.headless=true -jar java_program.jar", 'r+')
Then I use stdio to send arguments back and forth, like this:
pipe.puts "data"
result = pipe.gets
This works fine for most data I send, but for some the java process seems to lock up or something, and I would like to force close / kill the java process.
I am currently doing the following, which does not seem to kill the java process (from another thread which watches over this stuff):
thepid = nil
thepid = pipe.pid if pipe.respond_to?(:pid)
pipe.puts('') if pipe.respond_to?(:puts) #This is the java_program's approach to closing, writing empty string to stdio will cause it to start it's shutdown procedure.
pipe.close if pipe.respond_to?(:close)
Process.kill('KILL', thepid) if thepid && Process.getpgid( thepid )
The java process lingers and refuses to die. What can I do to actually force the process to exit (it uses lots of ram :( )
Also: Is there a cross platform way of doing this?

What you may be seeing here is that you're killing the nice process and not the java process it launches.
You could avoid this by launching the java process directly and then altering the nice level using renice on the PID you get.
It's also worth checking that you're trying to kill the correct process. As you point out, spawning a second instance by accident would mean you're killing the wrong one.
A tool like popen3 allows for a lot more control over the child process and gives you the ability to feed input and capture output directly.

using $? when running several commands in parallel in bash

I'm creating a startup/shutdown script for WebSEAL. It's written to allow several instances to be stopped/started in parallel. The only problem is verifying that it completed without issue. With other infrastructures, I could simply grep for a particular keyword in the output (which I redirect to a log file), but WebSEAL does not give any success/error message.
Instead, I thought to use the $? to throw the exit status into a dynamic variable that will be checked after the startups have occured (during log consolidation).
Here is the code that starts/stops and then creates the variable
${PDCOMMAND} >> ${LOGDIR}/${APP}.txt 2>&1 &
let return_${APP}=$?
PDCOMMAND is a valid startup/stop command: aka pdweb start my_instance
APP is the name of the instance: aka my_instance
The goal is that return_${APP} (return_my_instance) will have a value of 0 (success) or 1 (failure) when I check it at a later point in the script.
Are there problems using the $? for a command that may have not technically completed at the time that it was set, or does it set it upon completion of that? So let's say I have 3 instances
instance_1, instance_2, instance_3
if I ran the following:
pdweb start instance1 &
let return_instance_1 = $?
pdweb start instance2 &
let return_instance_2 = $?
pdweb start instance_3 &
let_return_instance_3 = $?
would return_instance_[1|2|3] have the correct values if they started in unequal amounts of time? If instance_3 starts before instance_1, for example, will it still output the result of instance_3 to return_instance_3?
Basically, I'm trying to figure out how the command line treats an asynchronous request in regards to the exit status.
Thanks in advance

No; the exit status code is only available when the command finishes. (That's why it's called "exit status".) If you successfully spawned a service and it is up and running, it does not yet have an exit status.
If I am able to correctly guess what you are trying to accomplish, you could reap the values of $! after starting each instance, wait for a "reasonable" time (a few seconds?) and check that the processes you started are still running. If they have terminated, there was a problem.

makePSOCKcluster hangs on win x64 after calling system

I am experiencing a hard to debug problem with makePSOCKcluster from the parallel package on R x64 on Windows. It does not happen on R i386 on Windows, nor on any OSX or Linux. Unfortunately it does not happen consistently either, only occasionally and quite randomly.
What happens is that the makePSOCKcluster function times out and freezes the R session, but only if earlier in the session some (arbitrary) system() calls were performed. The video and script below illustrate the problem more clearly.
Some stuff I tried without success:
Disable antivirus/firewalls.
Waiting a couple of seconds between calling system and makePSOCKcluser.
Using different system calls.
How would I further narrow this down? Here the video and the script used in the video is:
cmd_exists <- function(command){
iswin <- identical(.Platform$OS.type, "windows");
if(iswin){
test <- suppressWarnings(try(system(command, intern=TRUE, ignore.stdout=TRUE, ignore.stderr=TRUE, show.output.on.console=FALSE), silent=TRUE));
} else {
test <- suppressWarnings(try(system(command, intern=TRUE, ignore.stdout=TRUE, ignore.stderr=TRUE), silent=TRUE));
}
!is(test, "try-error")
}
options(hasgit = cmd_exists("git --version"));
options(haspandoc = cmd_exists("pandoc --version"));
options(hastex = cmd_exists("texi2dvi --version"));
cluster <- parallel::makePSOCKcluster(1);

makePSOCKCluster, or more generally makeCluster, can hang for any number of reasons when creating the so-called worker processes, which involves starting new R sessions using the Rscript command that will execute the .slaveRSOCK function, which will create a socket connection back to the master and then execute the slaveLoop function where it will eventually execute the tasks sent to it by the master. Wehen something goes wrong when starting any of the worker processes, the master will hang while executing socketConnection, waiting for the worker to connect to it even though that worker may have died or never even been created successfully.
Using the outfile argument is great because it often reveals the error that causes the worker process to die and thus the master to hang. But if that reveals nothing, then go to manual mode. In manual mode, the master prints the command to start each worker instead of executing the command itself. It's more work, but it gives you complete control, and you can even debug into the workers if you need to.
Here's an example:
> library(parallel)
> cl <- makePSOCKcluster(1, manual=TRUE, outfile='log.txt')
Manually start worker on localhost with
'/usr/lib/R/bin/Rscript' -e 'parallel:::.slaveRSOCK()' MASTER=localhost
PORT=10187 OUT=log.txt TIMEOUT=2592000 METHODS=TRUE XDR=TRUE
Next open a new terminal window (command prompt, or whatever), and paste in that Rscript command. As soon as you've executed it, makePSOCKcluster should return since we only requested one worker. Of course, if something goes wrong, it won't return, but if you're lucky, you'll get an error message in your terminal window and you'll have an important clue that will hopefully lead to a solution to your problem. If you're not so lucky, the Rscript command will also hang, and you'll have to dive in even deeper.
To debug the worker, you don't execute the displayed Rscript command because you need an interactive session. Instead, you start an R session with a command such as:
$ R --vanilla --args MASTER=localhost PORT=10187 OUT=log.txt TIMEOUT=2592000 METHODS=TRUE XDR=TRUE
In that R session, you can put a breakpoint on the .slaveRSOCK function and then execute it:
> debug(parallel:::.slaveRSOCK)
> parallel:::.slaveRSOCK()
Now you can start stepping through the code, possibly setting breakpoints on the slaveLoop and makeSOCKmaster functions.

Script which launches another application will bring it down on exit

I have a script which does launch another application using nohup my_app &, but when the initial script dies the launched process also goes down. As per my understanding since since it has been ran with nohup that should not happen. The original script also called with nohup.
What went wrong there?

A very reliable script that has been used successfully for years, and has always terminated after invoking a nohup uses this construct:
nohup ${BinDir}/${Watcher} >${DataDir}/${Watcher}.nohup.out 2>&1 &
Perhaps the problem is that output is not being managed?

nohup does not mean that a (child) process is still running when the (parent) process is killed. nohup is used f.e. when you're connecting over ssh to a server and there starting a process. If you log out, the process will terminate (logging out sents the signal SIGHUP to the process causing the process to terminate), using nohup avoid this behaviour and you're process is still running when you logged out.
If you need a program which runs in the background even it's parent process has terminated try using daemons.

It depends what my-app does - it might set its own signal mask. You probably know that nohup ignores the hang-up signal SIGHUP, and this is inherited by the target program. If that target program does its own signal handling then it might be setting SIGHUP to, for example SIG_DFT - the default action (which is to die).
To check, run strace -f -o out or truss -f -o out on the command. This will give you all the kernel calls in the file called 'out'. You should be able to spot the signal mask being changed if it is.

Can a standalone ruby script (windows and mac) reload and restart itself?

I have a master-workers architecture where the number of workers is growing on a weekly basis. I can no longer be expected to ssh or remote console into each machine to kill the worker, do a source control sync, and restart. I would like to be able to have the master place a message out on the network that tells each machine to sync and restart.
That's where I hit a roadblock. If I were using any sane platform, I could just do:
exec('ruby', __FILE__)
...and be done. However, I did the following test:
p Process.pid
sleep 1
exec('ruby', __FILE__)
...and on Windows, I get one ruby instance for each call to exec. None of them die until I hit ^C on the window in question. On every platform I tried this on, it is executing the new version of the file each time, which I have verified this by making simple edits to the test script while the test marched along.
The reason I'm printing the pid is to double-check the behavior I'm seeing. On windows, I am getting a different pid with each execution - which I would expect, considering that I am seeing a new process in the task manager for each run. The mac is behaving correctly: the pid is the same for every system call and I have verified with dtrace that each run is trigging a call to the execve syscall.
So, in short, is there a way to get a windows ruby script to restart its execution so it will be running any code - including itself - that has changed during its execution? Please note that this is not a rails application, though it does use activerecord.

After trying a number of solutions (including the one submitted by Byron Whitlock, which ultimately put me onto the path to a satisfactory end) I settled upon:
IO.popen("start cmd /C ruby.exe #{$0} #{ARGV.join(' ')}")
sleep 5
I found that if I didn't sleep at all after the popen, and just exited, the spawn would frequently (>50% of the time) fail. This is not cross-platform obviously, so in order to have the same behavior on the mac:
IO.popen("xterm -e \"ruby blah blah blah\"&")

The classic way to restart a program is to write another one that does it for you. so you spawn a process to restart.exe <args>, then die or exit; restart.exe waits until the calling script is no longer running, then starts the script again.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio