I have an application that is using about 100k more of the Desktop Heap in this version then it did last version. Is there a way I can see what is on the Desktop Heap and how big the individual objects are? Using Dheapmon I was able to see what percentage of the heap I was using, but I want more details.
Stolen from a comment on a blog post here
Let me
give a little background on how
desktop heap allocations are made. The
desktop heaps are in kernel mode
virtual address space, so individual
desktop heap allocations have to be
made by a component running in kernel
mode. In particular, win32k.sys is the
only kernel mode component that makes
desktop heap allocations. win32k.sys
in the kernel mode side of Win32, and
it includes both the window manager
(USER) and GDI. It is the window
manager piece of win32k.sys that uses
desktop heap. The functionality of the
window manager is exposed to processes
running in user mode through
user32.dll. It is user32.dll that
exports user mode callable functions
that are implemented in win32k.sys. So
if a process does not load user32.dll,
it will not use desktop heap.
Regarding your question about setting
a breakpoint that will catch desktop
heap allocations... yes, there is such
a function - win32k!DesktopAlloc.
However, this is a kernel mode
function, and to set a breakpoint on
it will require that you use a kernel
debugger.
That sounds all scary complicated to me who has never ventured away from user-mode in Windows.
When I had a similar problem I just put break points all over the startup portion of our application. At each break I would watch the level of allocated handles and what dhelpmon told me. Doing a sort of binary search I started to narrow down where the allocations were happening.
Dheapmon is the only tool I know of for looking directly at the desktop heap, but have you tried looking at your application with a tool like Winspector to look for glaring differences between the two versions (say, some type of control in your application now contains far more windows)? Any chance the application has switched to a newer version of IE? I seem to remember IE7 being much more desktop heap-intensive than IE6...
You can walk the heap using the Win32 API call HeapWalk. You can call GetProcessHeap to get all the heaps available to the process if you need to walk more than just the default heap.
Related
The new 1-bit exploit of "all" windows versions uses a bug in the kernel code that handles scrollbars. That got me thinking. Why does windows handle scrollbars in kernel, rather than user mode? Historical reasons? Does any other OS do this?
TL;DR: Microsoft sacrificed security for performance.
Scrollbars are a bit special on Windows. Most scrollbars are not real windows but are implemented as decorations on the "parent" window. This leads us to a more general question; why are windows implemented in kernel mode on Windows?
Lets look at the alternatives:
Per-process in user mode.
Single "master" process in user mode.
Alternative 1 has a big advantage when dealing with your own windows; no context switch/kernel transition. The problem is of course that windows from different processes live on the same screen and somebody has to be responsible for deciding which window is active and coordinate changes when the user switches to a different window. This somebody would have to be a special system process or the kernel because this information cannot be per-process, it has to be stored somewhere global. This dual information design is going to be complicated because the per-process information cannot be trusted by the global window manager. I'm sure there are a ton of other downsides to this theoretical design but I'm not going to spend more time on it here.
Windows NT 3 implemented a variant of alternative 2. The window manager was moved into kernel mode in NT 4 mainly for performance reasons:
...the Window Manager (USER) and Graphics Device Interface (GDI) have
been moved from the Win32 subsystem to the Windows NT Executive. Win32
user-mode device drivers, including graphics display and printer
drivers, have also been moved to the Executive. These changes are
designed to simplify graphics handling, reduce memory requirements,
and improve performance.
...and further down in the same document there are more technical details and justifications:
When Windows NT was first designed, the Win32 environment subsystem
was designed as a peer to the environment subsystems supporting
applications in MS-DOS, POSIX, and OS/2. However, applications and
other subsystems needed to use the graphics, windowing, and messaging
functions in the Win32 subsystem. To avoid duplicating these
functions, the Win32 subsystem was used as a server for graphics
functions to all subsystems.
This design worked respectably for Windows NT 3.5 and 3.51, but it
underestimated the volume and frequency of graphics calls. Having
functions as basic as messaging and window control in a separate
process generated substantial memory overhead from client/server
message passing, data gathering, and managing multiple threads. It
also required multiple context switches, which consume CPU cycles as
well as memory. The volume of graphics support calls per second
degraded the performance of the system. It was clear that a redesign
of this facet in Windows NT 4.0 could reclaim these wasted system
resources and improve performance.
The other subsystems are not that relevant these days but the performance issues remain.
If we look at a simple function like IsWindowVisible then there is not a lot of overhead when the window manager is in kernel mode: The function will execute a couple of instructions in user mode and then switch the CPU to ring 0 where the entire operation (validate the window handle passed in and if valid, retrieve the visible property) is performed in kernel mode. It then switches back to user mode and that is about it.
If the window manager lives in another process then you will at least double the amount of kernel transitions and you must somehow pass the functions input and output to and from the window manager process and you must somehow cause the window manager process to execute while you wait for the result. NT 3 did this by using a combination of shared memory, LPC and a obscure feature called paired threads.
Please correct me if I am wrong. My understanding is that Mac OS X has a WindowServer process that composites windows from all applications and draw the final composite image on screen. The question is then where WindowServer process obtains the "windows data" (in some form such as bitmaps) of other applications. Is it implemented through shared memory mechanism between applications and the WindowServer process? Any info or pointers/documentations on this would be helpful!
Also, is iOS implemented similarly regarding this aspect?
Thanks!
The mechanism by which your window bitmaps get marshaled to the WindowServer process is an undocumented implementation detail that is effectively "opaque", so even if you went to the effort to figure out how it works right now, it might change from release to release. That said...
If I had to take a guess on how it works, my guess would be that there is a block of shared memory that backs each window, and when your window goes to draw its view hierarchy, [NSGraphicsContext currentContext] is set up to point to a CGContext that's backed by that block of shared memory. When the window drawing sequence finishes, I would guess that one or more mach messages are sent from your process to the WindowServer process to tell it that it's time to present the just-drawn frame.
On iOS, it seems the Springboard process plays the window server role, and I imagine works similarly, however again, all these details are undocumented implementation details and therefore opaque. Since CoreGraphics is present in both OSX and iOS, it stands to reason that the mechanisms are similar.
You can find some evidence for this hypothesis using vmmap and the debugger (or dtrace). For instance, you can set up breakpoints (or dtrace probes) on all the different functions that can map virtual memory regions into your process (mmap, vm_allocate, etc.) then do a before/after comparison of vmmap output across the act of opening a new window. You'll see that there are new VM regions that have been mapped into your process, but you'll not have seen any corresponding hits on your breakpoints/dtrace probes (i.e. nothing in your process mapped these regions). This is evidence of the window server process having mapped regions of shared memory into your process. The meta-info about these regions is communicated to your process using mach messages (most likely). Trying this on a trivial sample application, opening a new window and looking at the difference in vmmap output shows this region that is very likely the backing store for our recently created window:
CG backing stores 00000001c73f2000-00000001c74cc000 [ 872K] rw-/rw- SM=SHM
I downloaded a disk and memory editor called HxD (available at http://mh-nexus.de/en/hxd/). I wonder how it is able to access (read and modify) virtual memory assigned to all applications running on my system (Windows XP Pro SP3). From what I know, Windows is running in protected mode, making such endeavours impossible. Yet it's not, how can that be?
Windows does indeed protect the memory of applications. Every application has its own address space and can simply not access anything outside it.
But, Windows also has functions that allow you to access memory from other processes. Not by simply accessing a pointer, but by calling a function to get the data from the other process.
This functionality seems strange, but it is essential if you want to write a debugger, or other kinds of diagnostics utilities.
If the program is run in administrator mode then the it can load a driver dynamically and see information via kernel mode to any process. An example is a debugger or similar like the process explorer tools from Sysinternals.
We are working on a Vista/Windows 7 application that will be running in 64 bit mode using VS2008/C++. We will be needing to cache hundreds of 2-3 mb blobs of data in RAM for performance reasons up to some memory limit. Our usage profile is such that we cannot read the data in fast enough if it is all on the the disk. Cached Memory usage will be larger than 1gb memory used. For this to work well, we need to ensure that Windows does not page this memory out as it will defeat the purpose of why we are doing this.
I've done a fair amount of research and cannot find documentation that states exactly how to do this. I've seen several references that infer memory mapped files work this way. Is there an expert who can clarify this for me?
I'm aware there are other programs that we could adapt to do this, for example, splitting the blobs and loading into memcache or inmemory databases, but they all have too many problems with performance or code complexity.
Suggestions?
You can use VirtualLock. However, you'll surely hit the quota with the amount you're talking about. Given that you should never run any other code on this machine, you'll be better off just disabling the paging file. Control Panel + System + Advanced.
From user mode, you can't (EDIT: At least for the sizes you're talking about). User mode allocations all come down to either the VirtualAlloc API (On top of which the GlobalAlloc/LocalAlloc/C Runtime's functions are written) or the Memory Mapped File API. Neither API supports this, and therefore it's impossible to obtain on Win32. It is possible from whithin Kernel Mode, but somehow I suspect this is a user-mode application :)
Note that the memory manager is not going to decide to page your RAM without good reason to do so.
Now, you could of course, if you control the machine completely (this is for internal use or something) disable the pagefile on the machine in question, but that does not seem to solve your problem.
It's possible! You can force pages to be locked in memory from a user mode app by allocating them using AWE (Address Windowing Extensions) VirtualAlloc + AllocatePhysicalPages + MapPhysicalPages.
Note: I have read that you can use the AWE APIs from either a 32-bit or 64-bit app also, but I've only tried with 32-bit app. (Of course since it's AWE you can manually remap memory to access > 2GB RAM.)
Note: You have to first have seLockMemoryPrivilege. (Which seems to require the app to run as Administrator in my testing so far.)
Note: Using AWE implies some limitations on what you can do with those particular pages of memory, e.g. no VirtualProtect().
perhaps the answer? (from a VMWARE tutorial)
To edit the Registry and disable paging kernel-mode stacks
Click Start > Run and type regedit.
In the left pane of the Registry Editor, navigate to
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager.
In the right pane, rightâclick GlobalFlag and select Modify.
With Base Hexadecimal, type value 80000, which corresponds to FLG_DISABLE_PAGE_KERNEL_STACKS.
Click OK and exit the Registry Editor.
Reboot the guest system for this change to take effect.
hope it helps
I'm compiling a vc8 C++ project in a WinXp VmWare session. It's a hell of a lot slower than gcc3.2 in a RedHat VmWare session, so I'm looking at Task Manager. It's saying a very large percentage of my compile process is spent in the kernel. That doesn't sounds right to me.
Is there an equivalent of strace for Win32? At least something which will give me an overview of which kernel functions are being called. There might be something that stands out as being the culprit.
Windows Resource Kit contains a tool called kernrate. It's a sampling profiler. It can profile entire system or a particular process. By default, its resolution is on a module level, but can be tuned down to several bytes. You should be fine with default resolution as you'll see which modules/drivers are consuming most of the time.
Here is some info regarding its use.
Not exactly strace, but there is a way of getting visibility into the kernel call stack, and by sampling it at times of high CPU usage, you can usually estimate what's using up all the time.
Install Process Explorer and make sure you configure it with symbol server support. You can do this by:
Installing WinDebug to get an updated dbghelp.dll
Set Process Explorer to use this version of dbghelp.dll by setting the path in the Options | Configure Symbols menu of Process Explorer.
Also in the same dialog, set the symbols path such that it includes the MS symbol server and a local cache.
Here's an example value for the symbol path:
SRV*C:\symbolcache*http://msdl.microsoft.com/download/symbols
(You can set _NT_SYMBOL_PATH environment variable to the same value to have the debugging tools use the same symbol server and cache path.) This path will cause dbghelp.dll to download symbols to local disk when asked for symbols for a module that doesn't have symbols locally.
After having set up Process Explorer like this, you can then get a process's properties, go to the threads tab, and double-click on the busiest thread. This will cause Process Explorer to temporarily hook into the process and scan the thread's stack, and then go and look up the symbols for the various return addresses on the stack. The return addresses's symbols, and the module names (for non-MS third-party drivers) should give you a strong clue as to where your CPU time is being spent.
VmWare support should be address that question. It's probably somewhere in the VmWare implementation.
You can use for example IrpTracker that give you an idea what is going on in the kernel.
Another option is using kernel debugger i.e WinDbg. If the cpu load very high just randomly breaking in the debugger and looking on the call stack can give you an idea who is the driver behind the cpu load. But as i stated i will guess that it will be some VmWare component. It worth to check if the problem persist on same computer on WinXP without emulation.