Why not launch external crash dump handler at the time the application crashes? - windows

I am in the process of designing a crash handler solution for one of our applications that creates a crash dump file using the MiniDumpWriteDump() function. While reading up on the topic I have seen the recommendations to invoke MiniDumpWriteDump() from an external process to maximize the chance that the dump file contains the correct information. The common solution seems to be to run a watchdog process in parallel to the application process. When the application crashes it somehow contacts the watchdog process, providing it with the information that is required to create the crash dump. Then the application goes to sleep until it is terminated by the watchdog process.
I can imagine such a watchdog process being run continually as a background service. This has many implications, starting with "who creates the service?", but also "which user does the service run as?", and "how does the application contact the service?" etc. It seems a pretty heavy-weight solution which I don't feel is appropriate for the scope of my task.
A simpler approach is suggested by this SO answer: Launch a guard process on application startup that is tightly coupled to the application process. This is pretty good, but it still leaves me with the tasks of 1) keeping the information somewhere in the application how I can contact the guard process in case of a crash; and 2) making sure to terminate the guard process if the application process shuts down normally.
The simplest solution of all would be to launch the crash dump handler process at the time the crash occurs, passing all the information that is required to create the crash dump as arguments to the process. This information consists of
The process ID of the application process that crashed
The thread ID of the thread that crashed
The adress of the EXCEPTION_POINTERS structure that describes the exception that caused the crash
This "fire and forget" approach is compelling because it does not require any state retention, nor any complicated over-time process management. In fact, the approach seems so overwhelmingly simple that I cannot help but feel that I am overlooking something.
What are the arguments against such an approach?

The main argument against the "fire and forget" approach, as I called it, is that it is not safe to launch a new process at a time when the application is already in a state where it is about to crash.
Because of that I went for the "guard process" approach. It brings a number of challenges with it, for which Hans Passant has outlined a solution.
I also added a bit of code in this answer that should help with deep-copying the all-important EXCEPTION_POINTERS data structure.
Using WER, as proposed in the comments, also looks like a good alternative to writing your own guard process. I must admit I have not investigated this any further, though.

Related

Detect UI operation which will "hang" the application if running in service mode

Fellow experts!
I have faced the following dilemma: some of our tools (executables) are started as scheduled tasks, some are started as services and others as usual desktop apps with interactive Windows user. We are using the code sharing strategy for source management (this is not debatable for this question).
So the solution I want to find is the following:
Detect UI operation at run-time which leads to hanging service/background task (such as say call to Application.ShowException, ShowMessage, MessageDialog, TForm.Show etc.). And when such an action detected I want to raise the exception instead. Then the operation will fail, we will have stack trace etc. but the process will not hang up! The most problematic hang up is when some event processing is done in transaction and then in some of the code used to process event suddenly (because of error in code, design, whatever) there is UI code executed then the process hangs and the DB parts can be locked!
What I think I need to do is: Use DDetours library to intercept WinAPI calls to a certain routines and raise exception instead (so that the process does not hang, but just fail in some method). Also I know that the creation of forms and windows does not hang the app, but only the tries to show them to the user.
Is there some known method of handling this problem? Or maybe there is some list of WinAPI routine set which hangs in service mode?
Thank you in advance.

Why is WmiPrvSE.exe holding onto a handle to my Process' Job Object?

I have a .NET application which spawns multiple child 'worker processes'. I am using the Windows Job Object API and the JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE setting to ensure the child processes always get killed if the parent process is terminated.
However, I have observed a number of orphaned processes still running on the machine after the parent has been closed. Using Process Explorer, I can see they are correctly still assigned to the Job, and that the Job has the correct 'Kill on Job Close' setting configured.
The documentation for JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE states:
"Causes all processes associated with the job to terminate when the last handle to the job is closed."
This would seem to imply that a handle to the Job was still open somewhere... I did a search for handles to my Job object, and found instances of WmiPrvSE.exe in the results. If I kill the relevant WmiPrvSE.exe process, the outstanding handle to Job is apparently closed, and all the orphaned application processes get terminated as expected.
How come WmiPrvSE.exe has a handle to my Job?
You may find this blog in sorting out what WmiPrvSE is doing.
WmiPrvSE is the WMI Provider host. That means it hosts WMI providers, which are DLLs. So it's almost surely the case that WmiPrvSE doesn't have a handle to your job, but one of the providers it hosts does. In order to figure out which provider is the culprit, one way is to follow the process here and then see which of the separate processes holds the handle.
Once you have determined which provider is holding the handle you can either try to deduce, based on what system components the provider manages, what kind of query would have a handle to your Job. Or you can just disable the provider, if you don't care about losing access to the management of the components the provider provides.
If you can determine what kind of query would be holding a handle, you may be able to deduce what program is issuing the query. Or maybe the eventlog can tell you that (first link above).
To get more help please provide additional details in the OP, such as which providers are running in WmiPrvSE, any relevant eventlog events, and any other diagnostics info you obtain.
EDIT 1/27/16
An approach to find out what happened that caused WMIPrvSE to obtain your job's handle is to use Windbg's !htrace extension. You need to run !htrace -enable after you load you .EXE but before you execute it in Windbg. Then you can break in later and execute !htrace <handle> to see stack traces when the handle was manipulated. You may want to start with this article on handle implementation.

How does OSX Activity Monitor match XPC tasks to their initiator process?

When an application process launches an XPC helper process, it doesn't actually do the fork()/exec() itself in the classic UNIX style. Instead, it sends a message to launchd, which does the dirty work for it. Thus, if you query the parent process on the XPC process, it comes back as the launchd process.
However, if you open Activity Monitor in the hierarchical process view, the XPC helper processes are all shown below the original application that requested them, for example:
In the software I'm working on, knowing this relationship between processes would be extremely useful. So far we've been using the regular BSD parent process information, but as everything moves towards XPC, this isn't much use anymore.
So:
Where is the "original" parent process information stored for XPC processes?
How does Activity Monitor access it?
There is a kext involved, so I'd be happy to pull this information straight out in the kernel instead of userspace, but I can't seem to even figure out where it's stored.
Update: Discussion on Apple's darwin-kernel mailing list: http://lists.apple.com/archives/darwin-kernel/2015/Mar/msg00001.html
I imagine that launchd knows what you are looking for.
The Service Management framework has a method that might give you what you are looking for easily.
CFDictionaryRef SMJobCopyDictionary(CFStringRef domain, CFStringRef jobLabel); function.

What processĀ API do I need to hook to track services?

I need to track to a log when a service or application in Windows is started, stopped, and whether it exits successfully or with an error code.
I understand that many services do not log their own start and stop times, or if they exit correctly, so it seems the way to go would have to be inserting a hook into the API that will catch when services/applications request a process space and relinquish it.
My question is what function do I need to hook in order to accomplish this, and is it even possible? I need it to work on Windows XP and 7, both 64-bit.
I think your best bet is to use a device driver. See PsSetCreateProcessNotifyRoutine.
Windows Vista has NotifyServiceStatusChange(), but only for single services. On earlier versions, it's not possible other than polling for changes or watching the event log.
If you're looking for a user-space solution, EnumProcesses() will return a current list. But it won't signal you with changes, you'd have to continually poll it and act on the differences.
If you're watching for a specific application or set of applications, consider assigning them to Job Objects, which are all about allowing you to place limits on processes and manage them externally. I think you could even associate Explorer with a job object, then all tasks launched by the user would be associated with your job object automatically. Something to look into, perhaps.

How do I code a watchdog timer to restart a Windows service?

I'm very interested in the answer to another question regarding watchdog timers for Windows services (see here). That answer stated:
I have also used an internal watchdog system running in another thread. That thread looks at the main thread for activity like log output or a toggling event. If the activity is not seen then the service is considered hung and I shutdown the service.
In this case you can configure windows to auto-restart a stopped service and that might clear the problem (as long as it's not an internal logic bug).
Also services I work with have text logs that are written to a log. In addition for services that are about to "sleep for a bit", I log the time for the next wake up. I use MTAIL to watch a log for output."
Could anyone give some sample code how to use an internal watchdog running in another thread, since I currently have a task to develop a windows service which will be able to self restart in case it failed, hung up, etc.
I really appreciate your help.
I'm not a big fan of running a watchdog as a thread in the process you're watching. That means if the whole process hangs for some reason, the watchdog won't work.
Watchdogs are an idea lifted from the hardware world and they had it right. Use an external circuit as simple as possible (so it can be provably correct). Typical watchdogs simply ran an timer and, if the process hadn't done something before the timer expired (like access a memory location the watchdog was watching), the whole thing was reset. When the watchdog was "kicked", it would restart the timer.
The act of the process kicking the watchdog protected that process from summary termination.
My advice would be to write a very simple stand-alone program which just monitored an event (such as file update time being modified). If that event didn't occur within the required time, kill the process being watched (and let Windows restart it).
Then have your watched program periodically rewrite that file.
Other approaches you might want to consider besides regularly modifying the lastwritetime of a file would be to create a proper performance counter or even a WMI object. We do the later in our build infrastructure, the 'trick' is to find a meaningful work unit in the service being monitored and pulse your 'heartbeat' each time a unit is finished.
The advantage of WMI or Perf Counters over a the file approach is that you then become visible to a whole bunch of professional MIS / management tools. This can add a lot of value.
You can configure from service properties to self restart in case of failure
Services -> right-click your service -> Properties -> First failure : restart the service -> Second failure : restart the service -> Subsequent failure : restart

Resources