Silent Process Exit due to KiSchedulerApcTerminate - windows

I have a Windows 10 machine where some UI process runs happily for days until it silently exits without any visible exception.
I have enabled ETW tracing but pretty much all call stacks come up with ???? marks. I suppose there was no proper ETW provider rundown happening in this process. After enabling Process Destruction call stacks I have found that one
[Root]
ntoskrnl.exe!KiPageFault
ntoskrnl.exe!KiInitiateUserApc
ntoskrnl.exe!KiDeliverApc
ntoskrnl.exe!KiSchedulerApcTerminate
ntoskrnl.exe!PsExitCurrentUserThread
ntoskrnl.exe!PspExitThread
ntoskrnl.exe!PspExitProcess
Does this indicate a page in error so I should suspect a hardware failure of the hard disc or memory?
I have found a failed file io request at that time:
The specified request is not a valid operation for the target device.
(0xc0000010) = Event SubType, FSCTL = Event Type, C:\Windows\Microsoft.Net\assembly\GAC_MSIL\System.ServiceModel\v4.0_4.0.0.0__b77a5c561934e089\System.ServiceModel.dll
which makes me think that the hard disk has an issue.

Related

Detect UI operation which will "hang" the application if running in service mode

Fellow experts!
I have faced the following dilemma: some of our tools (executables) are started as scheduled tasks, some are started as services and others as usual desktop apps with interactive Windows user. We are using the code sharing strategy for source management (this is not debatable for this question).
So the solution I want to find is the following:
Detect UI operation at run-time which leads to hanging service/background task (such as say call to Application.ShowException, ShowMessage, MessageDialog, TForm.Show etc.). And when such an action detected I want to raise the exception instead. Then the operation will fail, we will have stack trace etc. but the process will not hang up! The most problematic hang up is when some event processing is done in transaction and then in some of the code used to process event suddenly (because of error in code, design, whatever) there is UI code executed then the process hangs and the DB parts can be locked!
What I think I need to do is: Use DDetours library to intercept WinAPI calls to a certain routines and raise exception instead (so that the process does not hang, but just fail in some method). Also I know that the creation of forms and windows does not hang the app, but only the tries to show them to the user.
Is there some known method of handling this problem? Or maybe there is some list of WinAPI routine set which hangs in service mode?
Thank you in advance.

Resolve Windows socket error WSAENOBUFS (10055)

Our application has a feature to actively connect to the customers' internal factory network and send a message when inspection events occur. The customer enters the IP address and port number of their machine and application into our software.
I'm using a TClientSocket in blocking mode and have provided callback functions for the OnConnect and OnError events. Assuming the abovementioned feature has been activated, when the application starts I call the following code in a separate thread:
// Attempt active connection
try
m_socketClient.Active := True;
except
end;
// Later...
// If `OnConnect` and socket is connected...send some data!
// If `OnError`...call `m_socketClient.Active := True;` again
When IP + port are valid, the feature works well. But if not, after several thousand errors (and many hours or even days) eventually Windows socket error 10055 (WSAENOBUFS) occurs and the application crashes.
Various articles such as this one from ServerFramework and this one from Microsoft talk about exhausting the Windows non-paged pool and mention (1) actively managing the number outstanding asynchronous send operations and (2) releasing the data buffers that were used for the I/O operations.
My question is how to achieve this and is three-fold:
A) Am I doing something wrong that memory is being leaked? For example, is there some missing cleanup code in the OnError handler?
B) How do you monitor I/O buffers to see if they are being exhausted? I've used Process Explorer to confirm my application is the cause of the leak, but ideally I'd need some programmatic way of measuring this.
C) Apart from restarting the application, is there a way to ask Windows to clear out or release I/O operation data buffers?
Code samples in Delphi, C/C++, C# fine.
A) The cause of the resource leak was a programming error. When the OnError event occurs, Socket.Close() should be called to release low-level resources associated with the socket.
B) The memory leak does not show up in the standard Working Set memory use of the process. Open handles belonging to your process need to be monitored which is possible with GetProcessHandleCount. See this answer in Delphi which was tested and works well. This answer in C++ was not tested but the answer is accepted so should work. Of course, you should be able to use GetProcessHandleCount directly in C++.
C) After much research, I must conclude that just like a normal memory leak, you cannot just ask Windows to "clean up" after you! The handle resource has been leaked by your application and you must find and fix the cause (see A and B above).

Why not launch external crash dump handler at the time the application crashes?

I am in the process of designing a crash handler solution for one of our applications that creates a crash dump file using the MiniDumpWriteDump() function. While reading up on the topic I have seen the recommendations to invoke MiniDumpWriteDump() from an external process to maximize the chance that the dump file contains the correct information. The common solution seems to be to run a watchdog process in parallel to the application process. When the application crashes it somehow contacts the watchdog process, providing it with the information that is required to create the crash dump. Then the application goes to sleep until it is terminated by the watchdog process.
I can imagine such a watchdog process being run continually as a background service. This has many implications, starting with "who creates the service?", but also "which user does the service run as?", and "how does the application contact the service?" etc. It seems a pretty heavy-weight solution which I don't feel is appropriate for the scope of my task.
A simpler approach is suggested by this SO answer: Launch a guard process on application startup that is tightly coupled to the application process. This is pretty good, but it still leaves me with the tasks of 1) keeping the information somewhere in the application how I can contact the guard process in case of a crash; and 2) making sure to terminate the guard process if the application process shuts down normally.
The simplest solution of all would be to launch the crash dump handler process at the time the crash occurs, passing all the information that is required to create the crash dump as arguments to the process. This information consists of
The process ID of the application process that crashed
The thread ID of the thread that crashed
The adress of the EXCEPTION_POINTERS structure that describes the exception that caused the crash
This "fire and forget" approach is compelling because it does not require any state retention, nor any complicated over-time process management. In fact, the approach seems so overwhelmingly simple that I cannot help but feel that I am overlooking something.
What are the arguments against such an approach?
The main argument against the "fire and forget" approach, as I called it, is that it is not safe to launch a new process at a time when the application is already in a state where it is about to crash.
Because of that I went for the "guard process" approach. It brings a number of challenges with it, for which Hans Passant has outlined a solution.
I also added a bit of code in this answer that should help with deep-copying the all-important EXCEPTION_POINTERS data structure.
Using WER, as proposed in the comments, also looks like a good alternative to writing your own guard process. I must admit I have not investigated this any further, though.

Windows Kernel Driver Boot\winlogon complete callback

Can I get an event callback to my kernel driver when the boot process has completed, or when a user logs in?
The simple answer is no.
The long answer is yes, but why?
I'll answer the second part, because it's easier. You can easily register to recieve a notification when any process is launched. A short examination of Windows Internals will tell you that from Vista and up, the process userinit.exe is the first process to be executed in any given user session.
To the first part, this very much changes depending on your definition of boot process. Is it when a GUI is loaded? Is it when the computer can receive network requests? Does it matter which network requests (TCP/IP, SMB, RPC)?
The answer to each of these is very different.
When win32K has finished loading
When the TCP/IP stack drivers finish loading
When specific services (RPC, Server service) are done loading
What is the problem you're trying to solve?

How do I code a watchdog timer to restart a Windows service?

I'm very interested in the answer to another question regarding watchdog timers for Windows services (see here). That answer stated:
I have also used an internal watchdog system running in another thread. That thread looks at the main thread for activity like log output or a toggling event. If the activity is not seen then the service is considered hung and I shutdown the service.
In this case you can configure windows to auto-restart a stopped service and that might clear the problem (as long as it's not an internal logic bug).
Also services I work with have text logs that are written to a log. In addition for services that are about to "sleep for a bit", I log the time for the next wake up. I use MTAIL to watch a log for output."
Could anyone give some sample code how to use an internal watchdog running in another thread, since I currently have a task to develop a windows service which will be able to self restart in case it failed, hung up, etc.
I really appreciate your help.
I'm not a big fan of running a watchdog as a thread in the process you're watching. That means if the whole process hangs for some reason, the watchdog won't work.
Watchdogs are an idea lifted from the hardware world and they had it right. Use an external circuit as simple as possible (so it can be provably correct). Typical watchdogs simply ran an timer and, if the process hadn't done something before the timer expired (like access a memory location the watchdog was watching), the whole thing was reset. When the watchdog was "kicked", it would restart the timer.
The act of the process kicking the watchdog protected that process from summary termination.
My advice would be to write a very simple stand-alone program which just monitored an event (such as file update time being modified). If that event didn't occur within the required time, kill the process being watched (and let Windows restart it).
Then have your watched program periodically rewrite that file.
Other approaches you might want to consider besides regularly modifying the lastwritetime of a file would be to create a proper performance counter or even a WMI object. We do the later in our build infrastructure, the 'trick' is to find a meaningful work unit in the service being monitored and pulse your 'heartbeat' each time a unit is finished.
The advantage of WMI or Perf Counters over a the file approach is that you then become visible to a whole bunch of professional MIS / management tools. This can add a lot of value.
You can configure from service properties to self restart in case of failure
Services -> right-click your service -> Properties -> First failure : restart the service -> Second failure : restart the service -> Subsequent failure : restart

Resources