I want to create test that run my program for long time and output the count of available handles time to time. How can I do this with some WINAPI function?
This is a great article of how to debug handle leaks
http://blogs.technet.com/b/yongrhee/archive/2011/12/19/how-to-troubleshoot-a-handle-leak.aspx
but it didn't suitable in my case. I didn't have idea of how to automate debugger in my test.
That's not how it works. The number of handles you can consume is limited by a quota, by default it is 10,000 handles. There are three types of handles, each governed by their own quota:
kernel handles, returned by functions that are exported by kernel32.dll. Files, pipes, sockets, synchronization objects, etcetera. Best way to identify them is by the way they are released, kernel handles always require CloseHandle(). There is no hard upper limit on the number of kernel handles beyond the quota, failure occurs when the kernel memory pool runs out of space
user32 handles, window and menu objects. Beyond the quota, a hard upper limit exists for the number of handles that can be allocated in one desktop session. The sum of all user32 handles of all processes running on the same desktop cannot exceed an upper limit, it think it is 65535 handles
gdi handles, device contexts and drawing objects like bitmaps and brushes, etcetera. Beyond the quota it is subject to the same hard upper limit as user32 handles.
A program will always fail when it consumes one of the three quota limits. But can fail earlier if other processes consume a lot of user32 or gdi objects or the kernel memory pool is under pressure.
The sane thing to do is not log the number of handles still available, you can't find out, but instead log how many handles you've consumed. You can call GetGuiResources() to track the number of consumed user32 and gdi handles. GetProcessHandleCount() returns the number of kernel handles in use for your process.
But instead of writing code, by far the simplest way is use Task Manager, Processes tab. Use View + Select Columns, on Windows 8 right-click the column headers, and tick Handles, User Objects and GDI Objects. You'll get a live update of the handle count for the three sets of handle types while your program executes and immediate feedback while you debug your code.
Related
In a Windows application I have a class which wraps up a filename and a buffer. You construct it with a filename and you can query the object to see if the buffer is filled yet, returning nullptr if not and the buffer addres if so. When the object falls out of scope, the buffer is released:
class file_buffer
{
public:
file_buffer(const std::string& file_name);
~file_buffer();
void* buffer();
private:
...
}
I want to put the data into memory asynchronously, and as far as I see it I have two choices: either create a buffer and use overlapped IO through ReadFileEx, or use MapViewOfFile and touch the address on another thread.
At the moment I'm using ReadFileEx which presents some problems, as requests greater than about 16MB are prone to failure: I can try splitting up the request but then I get synchronisation issues, and if the object falls out of scope before the IO is complete I have buffer-cleanup issues. Also, if multiple instances of the class are created in quick succession things get very fiddly.
Mapping and touching the data on another thread would seem to be considerably easier since I won't have the upper limit issues: also if the client absolutely has to have the data right now, they can simply dereference the address, let the OS worry about page faults and take the blocking hit.
This application needs to support single core machines, so my question is: will page faults on another software thread be any more expensive than overlapped IO on the current thread? Will they stall the process? Does overlapped IO stall the process in the same way or is there some OS magic I don't understand? Are page faults carried out using overlapped IO anyway?
I've had a good read of these topics:
http://msdn.microsoft.com/en-us/library/aa365199(v=vs.85).aspx (IO Concepts in File Management)
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366556(v=vs.85).aspx (File mapping)
but I can't seem to infer how to make a performance tradeoff.
You will definitively want to go with memory-mapped files. Overlapped IO (with FILE_FLAG_NO_BUFFERING) has been advocated as "the fastest way to get data into RAM" by some people for years, but this is only true in very contrieved cases with very specific conditions. In the normal, average case, turning off the buffer cache is a serious anti-optimization.
Now, overlapped IO without FILE_FLAG_NO_BUFFERINGhas all the quirks of overlapped IO, and is about 50% slower (for a reason I still cannot understand).
I've done some rather extensive benchmarking a year ago. The bottom line is: Memory mapped files are faster, better, less surprising.
Overlapped IO uses more CPU, is much slower when using the buffer cache, asynchronous reverts to synchronous under some well-documented and some undocumented conditions (e.g. encryption, compression, and... pure chance? request size? number of requests?), stalling your application at unpredictable times.
Submitting requests can sometimes take "funny" amounts of time, and CancelIO sometimes doesn't cancel anything but waits for completion. Processes with outstanding requests are unkillable. Managing buffers with outstanding overlapped writes is non-trivial extra work.
File mapping just works. Fullstop. And it works nicely. No surprises, no funny stuff. Touching every page has very little overhead and delivers as fast as the disk is able to deliver, and it takes advantage of the buffer cache. Your concern about a single-core CPU is no problem. If the touch-thread faults, it blocks, and as always when a thread blocks, another thread gets CPU time instead.
I'm even using file mapping for writing now, whenever I have more than a few bytes to write. This is somewhat non-trivial (have to manually grow/preallocate files and mappings, and truncate to actual length when closing), but with some helper classes it's entirely doable. Write 500 MiB of data, and it takes "zero time" (you basically do a memcpy, the actual write happens in the background, any time later, even after your program has finished). It's stunning how well this works, even if you know that it's the natural thing for an operating system to do.
Of course you had better not have a power failure before the OS has written out all pages, but that's true for any kind of writing. What's not on the disk yet is not on the disk -- there's really not much more to say to it than that. If you must be sure about that, you have to wait for a disk sync to complete, and even then you can't be sure the lights aren't going out while you wait for the sync. That's life.
I don't claim to understand this better than you, as it seem you made some inventigation. And to be totally sure you will need to experiment. But this is my understanding of the issues, in reverse order:
File mapping and overlapped IO in Windows are different implentations and none of them rely on the other under the hood. But both use the asynchronous block device layer. As I imagine it, in the kernel every IO is actually asynchronous, but some user operations wait for it to finish and so they create the illusion of synchronicity.
From point 1, if a thread does IO, other threads from the same process will not stall. That, unless the system resources are scarce or these other threads do IO themselves and face some kind of contention. This will be true no matter the kind of IO the first thread does: blocking, non-blocking, overlapped, memory-mapped.
In memory-mapped files, the data is read at least one page at a time, probably more because of the read-ahead, but you cannot be sure about that. So the probing thread will have to touch the mapped memory at least one on every page. That will be something like probe/block-probe-probe-probe-probe/block-probe... That might be a bit less efficient than a big overlapped read of several MB. Or maybe the kernel programmers were smart and it is even more efficient. You will have to make a little profiling... Hey, you could even go without the probing thread and see what happens.
Cancelling overlapping operations is a PITA, so my recommendation will be to go with the memory-mapped files. That is way easier to set up and you get extra functionality:
the memory is usable even before it is fully in memory
the memory can/will be shared by several instances of the process
if the memory is in the cache, it will be ready instantaneously instead of just quickly.
if the data is read-only, you can protect the memory from writing, catching bugs.
Have an application with a GDI leak that will eventually hit 10,000 allocated GDI objects and crash. I tried increasing the GDIProcessHandleQuota to 20,000, but the program still crashed when it reached 10,000 objects. We're currently working on patching this leak, but out of curiosity--is there a way to increase the GDI limit for a single process? Or is 10k an individual application's hard limit?
10K is a hard limit.
GDI objects represent graphical device interface resources like fonts, bitmaps, brushes, pens, and device contexts (drawing surfaces). As it does for USER objects, the window manager limits processes to at most 10,000 GDI objects [...]
Mark Russinovich has a series of articles that go in-depth about the various limits in Windows. You might find these two useful:
Pushing the Limits of Windows: USER and GDI Objects – Part 1
Pushing the Limits of Windows: USER and GDI Objects – Part 2
Another good article from Raymond Chen:
Why is the limit of window handles per process 10,000?
There is a solution that might work. I deal with a misbehaved vendor's app here that allocates tons of GDI objects and this solution allows it to work most of the time...
Do
reg query "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\SubSystems" /v windows
Look for SharedSection= which should be 3 numbers separated by commas. Increase the middle number by 1024 at a time and see if that solves your problem. You are controlling the amount of "desktop heap" with this variable which has in the past allowed me to get a misbehaving GDI running.
Look at KB184802 for a little more info. Search for SharedSection to find the relevant part of the page.
I am able to increase my GDI objects from 10000 to 15000 by changing ONLY the GDIProcessHandleQuota, but this requires a reboot to take effect. I did not have to change my SharedSection values, only the reboot was required.
While 10000 seems like a big number, my application has a large UI with lots of buttons, brushes, images, icons, etc. Once the application starts up, the number of objects only increases if the user does something that merits an increase. No GDI objects are leaking from the application. To test my solution I did add a "leak" method, so I could watch in the task manager what happened as the number of GDI objects increased beyond various limits.
I noticed that all the _EPROCESS objects are linked to each other via the ActiveProcessList link. What is the purpose of this List. For what does the OS use this list of Active Processes?
In Windows NT, the schedulable unit is the thread. Processes serve as a container of threads, and also as an abstraction that defines what virtual memory map is active (and some other things).
All operating systems need to keep this information available. At different times, different components of the operating system could need to search for a process that matches a specific characteristic, or would need to assess all active processes.
So, how do we store this information? Why not a gigantic array in memory? Well, how big is that array going to be? Are we comfortable limiting the number of active processes to the size of this array? What happens if we can't grow the array? Are we prepared to reserve all that memory up front to keep track of the processes? In the low process use case, isn't that a lot of wasted memory?
So we can keep them on a linked list.
There are some occasions in NT where we care about process context but not thread context. One of those is I/O completion. When an I/O operation is handled asynchronously by the operating system, the eventual completion of that I/O could be in a process context that is different from the requesting process context. So, we need some records and information about the originating process so that we can "attach" to this process. "Attaching" to the process swaps us into the appropriate context with the appropriate user-mode memory available. We don't care about thread context, we care about process context, so this works.
Assume you have a reference counted object in shared memory. The reference count represents the number of processes using the object, and processes are responsible for incrementing and decrementing the count via atomic instructions, so the reference count itself is in shared memory as well (it could be a field of the object, or the object could contain a pointer to the count, I'm open to suggestions if they assist with solving this problem). Occasionally, a process will have a bug that prevents it from decrementing the count. How do you make it as easy as possible to figure out which process is not decrementing the count?
One solution I've thought of is giving each process a UID (maybe their PID). Then when processes decrement, they push their UID onto a linked list stored alongside the reference count (I chose a linked list because you can atomically append to head with CAS). When you want to debug, you have a special process that looks at the linked lists of the objects still alive in shared memory, and whichever apps' UIDs are not in the list are the ones that have yet to decrement the count.
The disadvantage to this solution is that it has O(N) memory usage where N is the number of processes. If the number of processes using the shared memory area is large, and you have a large number of objects, this quickly becomes very expensive. I suspect there might be a halfway solution where with partial fixed size information you could assist debugging by somehow being able to narrow down the list of possible processes even if you couldn't pinpoint a single one. Or if you could just detect which process hasn't decremented when only a single process hasn't (i.e. unable to handle detection of 2 or more processes failing to decrement the count) that would probably still be a big help.
(There are more 'human' solutions to this problem, like making sure all applications use the same library to access the shared memory region, but if the shared area is treated as a binary interface and not all processes are going to be applications written by you that's out of your control. Also, even if all apps use the same library, one app might have a bug outside the library corrupting memory in such a way that it's prevented from decrementing the count. Yes I'm using an unsafe language like C/C++ ;)
Edit: In single process situations, you will have control, so you can use RAII (in C++).
You could do this using only a single extra integer per object.
Initialise the integer to zero. When a process increments the reference count for the object, it XORs its PID into the integer:
object.tracker ^= self.pid;
When a process decrements the reference count, it does the same.
If the reference count is ever left at 1, then the tracker integer will be equal to the PID of the process that incremented it but didn't decrement it.
This works because XOR is commutative ( (A ^ B) ^ C == A ^ (B ^ C) ), so if a process XORs the tracker with its own PID an even number of times, it's the same as XORing it with PID ^ PID - that's zero, which leaves the tracker value unaffected.
You could alternatively use an unsigned value (which is defined to wrap rather than overflow) - adding the PID when incrementing the usage count and subtracting it when decrementing it.
Fundementally, shared memory shared state is not a robust solution and I don't know of a way of making it robust.
Ultimately, if a process exits all its non-shared resources are cleaned up by the operating system. This is incidentally the big win from using processes (fork()) instead of threads.
However, shared resources are not. File handles that others have open are obviously not closed, and ... shared memory. Shared resources are only closed after the last process sharing them exits.
Imagine you have a list of PIDs in the shared memory. A process could scan this list looking for zombies, but then PIDs can get reused, or the app might have hung rather than crashed, or...
My recommendation is that you use pipes or other message passing primitives between each process (sometimes there is a natural master-slave relationship, other times all need to talk to all). Then you take advantage of the operating system closing these connections when a process dies, and so your peers get signalled in that event. Additionally you can use ping/pong timeout messages to determine if a peer has hung.
If, after profiling, it is too inefficient to send the actual data in these messages, you could use shared memory for the payload as long as you keep the control channel over some kind of stream that the operating system clears up.
The most efficient tracing systems for resource ownership don't even use reference counts, let alone lists of reference-holders. They just have static information about the layouts of every data type that might exist in memory, also the shape of the stack frame for every function, and every object has a type indicator. So a debugging tool can scan the stack of every thread, and follow references to objects recursively until it has a map of all the objects in memory and how they refer to each other. But of course systems that have this capability also have automatic garbage collection anyway. They need help from the compiler to gain all that information about the layout of objects and stack frames, and such information cannot actually be reliably obtained from C/C++ in all cases (because object references can be stored in unions, etc.) On the plus side, they perform way better than reference counting at runtime.
Per your question, in the "degenerate" case, all (or almost all) of your process's state would be held in shared memory - apart from local variables on the stack. And at that point you would have the exact equivalent of a multi-threaded program in a single process. Or to put it another way, processes that share enough memory start to become indistinguishable from threads.
This implies that you needn't specify the "multiple processes, shared memory" part of your question. You face the same problem anyone faces when they try to use reference counting. Those who use threads (or make unrestrained use of shared memory; same thing) face another set of problems. Put the two together and you have a world of pain.
In general terms, it's good advice not to share mutable objects between threads, where possible. An object with a reference count is mutable, because the count can be modified. In other words, you are sharing mutable objects between (effective) threads.
I'd say that if your use of shared memory is complex enough to need something akin to GC, then you've almost got the worst of both worlds: the expensive of process creation without the advantages of process isolation. You've written (in effect) a multi-threaded application in which you are sharing mutable objects between threads.
Local sockets are a very cross-platform and very fast API for interprocess communication; the only one that works basically identically on all Unices and Windows. So consider using that as a minimal communication channel.
By the way, are you using consistently using smart pointers in the processes that hold references? That's your only hope of getting reference counting even half right.
Use following
int pids[MAX_PROCS]
int counter;
Increment
do
find i such pid[i]=0 // optimistic
while(cas[pids[i],0,mypid)==false)
my_pos = i;
atomic_inc(counter)
Decrement
pids[my_pos]=0
atomic_dec(counter);
So you know all processes using this object
You MAX_PROCS big enough and search free
place randomly so if number of processes significanly lower then MAX_PROCS the search
would be very fast.
Next to doing things yourself: you can also use some tool like AQTime which has a reference counted memchecker.
It's highly likely that there is a limitation on how many synchronization objects - semaphores, events, critical sections - can one process and all processes on a given machine use. What exactly is this limitation?
For windows, the per-process limit on kernel handles(semaphores, events,mutex) is 2^24.
From MSDN:
Kernel object handles are process
specific. That is, a process must
either create the object or open an
existing object to obtain a kernel
object handle. The per-process limit
on kernel handles is 2^24. However,
handles are stored in the paged pool,
so the actual number of handles you
can create is based on available
memory. The number of handles that you
can create on 32-bit Windows is
significantly lower than 2^24.
It depends on the quota that is available for the process. I think in XP it is set to 10000 per process, but it can grow. I am not sure what the upper limit is.
Just checked it again, the 10000 limit is for the GDI handles and not for Kernel objects.