Comparing cold-start to warm start - windows

Our application takes significantly more time to launch after a reboot (cold start) than if it was already opened once (warm start).
Most (if not all) the difference seems to come from loading DLLs, when the DLLs' are in cached memory pages they load much faster. We tried using ClearMem to simulate rebooting (since its much less time consuming than actually rebooting) and got mixed results, on some machines it seemed to simulate a reboot very consistently and in some not.
To sum up my questions are:
Have you experienced differences in launch time between cold and warm starts?
How have you delt with such differences?
Do you know of a way to dependably simulate a reboot?
Edit:
Clarifications for comments:
The application is mostly native C++ with some .NET (the first .NET assembly that's loaded pays for the CLR).
We're looking to improve load time, obviously we did our share of profiling and improved the hotspots in our code.
Something I forgot to mention was that we got some improvement by re-basing all our binaries so the loader doesn't have to do it at load time.

As for simulating reboots, have you considered running your app from a virtual PC? Using virtualization you can conveniently replicate a set of conditions over and over again.
I would also consider some type of profiling app to spot the bit of code causing the time lag, and then making the judgement call about how much of that code is really necessary, or if it could be achieved in a different way.

It would be hard to truly simulate a reboot in software. When you reboot, all devices in your machine get their reset bit asserted, which should cause all memory system-wide to be lost.
In a modern machine you've got memory and caches everywhere: there's the VM subsystem which is storing pages of memory for the program, then you've got the OS caching the contents of files in memory, then you've got the on-disk buffer of sectors on the harddrive itself. You can probably get the OS caches to be reset, but the on-disk buffer on the drive? I don't know of a way.

How did you profile your code? Not all profiling methods are equal and some find hotspots better than others. Are you loading lots of files? If so, disk fragmentation and seek time might come into play.
Maybe even sticking basic timing information into the code, writing out to a log file and examining the files on cold/warm start will help identify where the app is spending time.
Without more information, I would lean towards filesystem/disk cache as the likely difference between the two environments. If that's the case, then you either need to spend less time loading files upfront, or find faster ways to load files.
Example: if you are loading lots of binary data files, speed up loading by combining them into a single file, then do a slerp of the whole file into memory in one read and parse their contents. Less disk seeks and time spend reading off of disk. Again, maybe that doesn't apply.
I don't know offhand of any tools to clear the disk/filesystem cache, but you could write a quick application to read a bunch of unrelated files off of disk to cause the filesystem/disk cache to be loaded with different info.

#Morten Christiansen said:
One way to make apps start cold-start faster (sort of) is used by e.g. Adobe reader, by loading some of the files on startup, thereby hiding the cold start from the users. This is only usable if the program is not supposed to start up immediately.
That makes the customer pay for initializing our app at every boot even when it isn't used, I really don't like that option (neither does Raymond).

One succesful way to speed up application startup is to switch DLLs to delay-load. This is a low-cost change (some fiddling with project settings) but can make startup significantly faster. Afterwards, run depends.exe in profiling mode to figure out which DLLs load during startup anyway, and revert the delay-load on them. Remember that you may also delay-load most Windows DLLs you need.

A very effective technique for improving application cold launch time is optimizing function link ordering.
The Visual Studio linker lets you pass in a file lists all the functions in the module being linked (or just some of them - it doesn't have to be all of them), and the linker will place those functions next to each other in memory.
When your application is starting up, there are typically calls to init functions throughout your application. Many of these calls will be to a page that isn't in memory yet, resulting in a page fault and a disk seek. That's where slow startup comes from.
Optimizing your application so all these functions are together can be a big win.
Check out Profile Guided Optimization in Visual Studio 2005 or later. One of the thing sthat PGO does for you is function link ordering.
It's a bit difficult to work into a build process, because with PGO you need to link, run your application, and then re-link with the output from the profile run. This means your build process needs to have a runtime environment and deal cleaning up after bad builds and all that, but the payoff is typically 10+ or more faster cold launch with no code changes.
There's some more info on PGO here:
http://msdn.microsoft.com/en-us/library/e7k32f4k.aspx

As an alternative to function order list, just group the code that will be called within the same sections:
#pragma code_seg(".startUp")
//...
#pragma code_seg
#pragma data_seg(".startUp")
//...
#pragma data_seg
It should be easy to maintain as your code changes, but has the same benefit as the function order list.
I am not sure whether function order list can specify global variables as well, but use this #pragma data_seg would simply work.

One way to make apps start cold-start faster (sort of) is used by e.g. Adobe reader, by loading some of the files on startup, thereby hiding the cold start from the users. This is only usable if the program is not supposed to start up immediately.
Another note, is that .NET 3.5SP1 supposedly has much improved cold-start speed, though how much, I cannot say.

It could be the NICs (LAN Cards) and that your app depends on certain other
services that require the network to come up. So profiling your application alone may not quite tell you this, but you should examine the dependencies for your application.

If your application is not very complicated, you can just copy all the executables to another directory, it should be similar to a reboot. (Cut and Paste seems not work, Windows is smart enough to know the files move to another folder is cached in the memory)

Related

Does process termination automatically free all memory used? Any reason to do it explicitly?

In Windows NT and later I assume that when a process expires, either because it terminated itself or was forcefully terminated, the OS automatically reclaims all the memory used by the process. Are there any situations in which this is not true? Is there any reason to free all the memory used by a user-mode application explicitly?
Whenever a process ends, all memory pages mapped to it are returned to the available state. This could qualify as "reclaiming the memory," as you say. However, it does not do things such as running destructors (if you are using C++).
I highly recommend freeing all memory, not from a resources perspective, but from a development perspective. Trying to free up memory encourages you to think about the lifespan of memory, and helps you make sure you actually clean up properly.
This does not matter in the short run, but I have dealt with countless software programs which assumed that they owned the process, so didn't have to clean up after themselves. However, there are a lot of reasons to want to run a program in a sandbox. Many randomized testing scenarios can run much faster if they don't have to recreate the process every time. I've also had several programs which thought they would be standalone, only to find a desire to integrate into a larger software package. At those times, we found out all the shortcuts that had been taken with memory management.

What could cause the application as well as the system to slowdown?

I am debugging an application which slows down the system very badly. The application loads a large amount of data (some 1000 files each of half an MB) from the local hard disk.The files are loaded as memory mapped files and are mapped only when needed. This means that at any given point in time the virtual memory usage does not exceed 300 MB.
I also checked the Handle count using handle.exe from sysinternals and found that there are at the most some 8000 odd handles opened. When the data is unloaded it drops to around 400. There are no handle leaks after each load and unload operation.
After 2-3 Load unload cycles, during one load, the system becomes very slow. I checked the virtual memory usage of the application as well as the handle counts at this point and it was well within the limits (VM about 460MB not much fragmentation also, handle counts 3200).
I want how an application could make the system very slow to respond? What other tools can I use to debug this scenario?
Let me be more specific, when i mean system it is entire windows that is slowing down. Task manager itself takes 2 mins to come up and most often requires a hard reboot
The fact that the whole system slows downs is very annoying, it means you can not attach a profiler easily, it also means it would be even difficult to stop the profiling session in order to view the results ( since you said it require a hard reboot ).
The best tool suited for the job in this situation is ETW ( Event Tracing for Windows ), these tools are great, will give you the exact answer you are looking for
Check them out here
http://msdn.microsoft.com/en-us/library/cc305210.aspx
and
http://msdn.microsoft.com/en-us/library/cc305221.aspx
and
http://msdn.microsoft.com/en-us/performance/default.aspx
Hope this works.
Thanks
Tools you can use at this point:
Perfmon
Event Viewer
In my experience, when things happen to a system that prevent Task Manager from popping up, they're usually of the hardware variety -- checking the system event log of Event Viewer is sometimes just full of warnings or errors that some hardware device is timing out.
If Event Viewer doesn't indicate that any kind of loggable hardware error is causing the slowdown, then try Perfmon -- add counters for system objects to track file read, exceptions, context switches etc. per second and see if there's something obvious there.
Frankly the sort of behavior demonstrated is meant to be impossible - by design - for user-mode code to cause. WinNT goes to a lot of effort to insulate applications from each other and prevent rogue applications from making the system unusable. So my suspicion is some kind of hardware fault is to blame. Is there any chance you can simply run the same test on a different PC?
If you don't have profilers, you may have to do the same work by hand...
Have you tried commenting out all read/write operations, just to check whether the slow down disappears ?
"Divide and conquer" strategies will help you find where the problem lies.
If you run it under an IDE, run it until it gets real slow, then hit the "pause" button. You will catch it in the act of doing whatever takes so much time.
You use tools like "IBM Rational Quantify" or "Intel VTune" to detect performance issue.
[EDIT]
Like BenoƮt did, one good mean is measuring tasks time to identify which is eating cpu.
But remember, as you are working with many files, is likely to be missing that causes the memory to disk swap.
when task manager is taking 2 minutes to come up, are you getting a lot of disk activity? or is it cpu-bound?
I would try process explorer from sysinternals. When your system is in the slowed-down state, and you try running, say, notepad, pay attention to page fault deltas.
Windows is very greedy about caching file data. I would try removing file I/O as someone suggested, and also making sure you close the file mapping as soon as you are done with a file.
I/O is probably causing your slowdown,especially if your files are on the same disk as the OS. Another way to test that would be to move your files to another disk and see if that alleviates the problem.

Large application/file load-time

I'm sure many have noticed that when you have a large application (i.e. something requiring a few MBs of DLLs) it loads much faster the second time than the first time.
The same happens if you read a large file in your application. It's read much faster after the first time.
What affects this? I suppose this is the hard-drive cache, or is the OS adding some memory-caching of its own.
What techniques do you use to speed-up the loading times of large applications and files?
Thanks in advance
Note: the question refers to Windows
Added: What affects the cache size of the OS? In some apps, files are slow-loading again after a minute or so, so the cache fills in a minute?
Two things can affect this. The first is hard-disk caching (done by the disk which has little impact and by the OS which tends to have more impact). The second is that Windows (and other OS') have little reason to unload DLLs when they're finished with them unless the memory is needed for something else. This is because DLLs can easily be shared between processes.
So DLLs have a habit of hanging around even after the applications that were using them disappear. If another application decides the DLL is needed, it's already in memory and just has to be mapped into the processes address space.
I've seen some application pre-load their required DLLs (usually called QuickStart, I think both MS Office and Adobe Reader do this) so that the perceived load times are better.
Windows's memory manager is actually pretty slick -- it services memory requests AND acts as the disk cache. With enough free memory on the system, lots of files that have been recently accessed will reside in memory. Until the physical memory is needed, those DLLs will remain in cache -- all ala the CacheManager.
As far as how to help, look into Delay Loading your DLLs. The advantages of LoadLibrary only when you need it, but automatic so you don't have LoadLibrary/GetProcAddress on all of your code. (Well automatic, as far as just needing to add a linker command switch):
http://msdn.microsoft.com/en-us/library/yx9zd12s.aspx
Or you could pre-load like Office and others do (as mentioned above), but I personally hate that -- slows down the computer at initial boot up.
I see two possibilities :
preload yourlibraries at system startup as already mentionned Office, OpenOffice and others are doing just that.
I am not a great fan of that solution : It makes your boot time longer and eats lots of memory.
load your DLL dynamically (see LoadLibrary) only when needed. Unfortunately not possible with all DLL.
For example, why load at startup a DLL to export file in XYZ format when you are not sure it will ever be needed ?? Load it when the user did select this export format.
I have a dream where Adobe Acrobat use this approach, instead of bogging me with loads of plugins I never use every time I want to display a PDF file !
Depending on your needs you might have to use both techniques : preload some big heavliy used librairies and load on demand only specific plugins...
One item that might be worth looking at is "rebasing". Each DLL has a preset "base" address that it prefers to be loaded into memory at. If an application is loading the DLL at a different address (because the preferred one is not available) the DLL is loaded at the new address and "rebased". Roughly speaking this means that parts of the dll are updated on the fly. This only applies to native images as opposed to .NET vm .dll's.
This really old MSDN article covers rebase'ng:
http://msdn.microsoft.com/en-us/library/ms810432.aspx
Not sure whether much of it still applies (it's a very old article)... but here's an enticing quote:
Prefer one large DLL over several
small ones; make sure that the
operating system does not need to
search for the DLLs very long; and
avoid many fixups if there is a chance
that the DLL may be rebased by the
operating system (or, alternatively,
try to select your base addresses such
that rebasing is unlikely).
Btw if you're dealing with .NET then "ngen'ng" your app/dlls should help speed things up (ngen = natve image generation).
Yep, anything read in from the hard drive is cached so it will load faster the second time. The basic assumption is that it's rare to use a large chunk of data from the HD only once and then discard it (this is usually a good assumption in practice). Typically I think it's the operating system (kernel) that implements the cache, taking up a chunk of RAM to do so, although I'm not sure if modern hard drives have some builtin cache capability. (I once wrote a small kernel as an academic project; caching of HD data in memory was one of its features)
One additional factor which affects program startup time is Superfetch, a technology introduced with (I believe) Windows XP. Essentially it monitors disk access during program startup, recognizes file access patterns and them attempts to "bunch up" the required data for quicker access (e.g. by rearranging the data sequentially on disk according to its loading order).
As the others mentioned, generally speaking any read operation is likely to be cached by the Windows disk cache, and reused unless the memory is needed for other operations.
NGENing the assemblies might help with the startup time, however, runtime might be effected (Sometimes the NGened code is not as optimal as OnDemand Compiled code)
NGENing can be done in the background as well: http://blogs.msdn.com/davidnotario/archive/2005/04/27/412838.aspx
Here's another good article NGen and Performance http://msdn.microsoft.com/en-us/magazine/cc163808.aspx
The system cache is used for anything that comes off disk. That includes file metadata, so if you are using applications that open a large number of files (say, directory scanners), then you can easily flush the cache if you also have applications running that eat up a lot of memory.
For the stuff I use, I prefer to use a small number of large files (>64 MB to 1 GB) and asynchronous un-bufferred I/O. And a good ol' defrag every once in a while.

Can you start and stop boundschecker (DevPartner)?

I'm trying to use boundschecker to analyze a rather complex program. Running the program with boundschecker is almost too slow for it to be of any use since it takes me almost a day to run the program to the point in the code where I suspect the issue exists. Can anyone give me some ideas for how to inspect only certain parts of my software using boundschecker (DevPartner) in Visual Studio 2005?
Thanks in advance for all your help!
I last used BoundsChecker a few years ago, and had the same problems. With large projects, it makes everything run so slowly that it is useless. We ended up ditching it.
But, we still needed some of it's functionality, but like you, not for the whole program. So we had to do it ourselves.
In our case, we mainly used it to try and track down memory leaks. If that's your objective as well, there are other options.
Visual Studio does a pretty good job of telling you about memory leaks when your program exits
It reports leaks in the order that they were created
It will tell you exactly where the leaked memory was created if your source files have this at the top
#ifdef _DEBUG
#undef THIS_FILE
static char THIS_FILE[]=__FILE__;
#define new DEBUG_NEW
#endif
Those help a lot, but it's often not enough. Adding that snippet everywhere isn't always feasible. If you use factory classes, knowing where memory was allocated doesn't help at all. So when all else fails, we take advantage of #2.
Add something like the following:
#define LEAK(str) {char *s = (char*)malloc(100); strcpy(s, str);}
Then, pepper your code with "LEAK("leak1");" or whatever. Run the program, and exit it. Your new leaked strings will display in Visual Studio's leak dump surrounding the existing leaks. Keep adding/moving your LEAK statements and re-running the program to narrow your search until you've pinpointed the exact location. Then fix the leak, remove your debugging leaks, and you're all set!
BoundsChecker tracks all memory allocations and releases in extreme detail. It knows, for instance, that such and such a memory allocation was done from the C runtime heap, which in turn was taken from a Win32 heap, which in turn started life as memory allocated by VirtualAlloc. If the application was instrumented (FinalCheck), it also has detailed information as to which pointers reference the memory.
This is one reason (of many) why the thing is slow.
If BC were to connect to an application late, it would have none of this data built up, and would have either (1) dig it all up at once, or (2) start guessing about things. Neither solution is very practical.
One way to lighten up BoundsChecker is by excluding from instrumentation all but the few modules you are interested in. I know thats not great because if you knew where the leak was you wouldn't need BoundsChecker. What I usually recommend is that you use BC's Active Check mode first with only Memory Tracking available. You miss the API Validations but you could always rerun that seperately. After you run Active Check and you get clues regarding which modules tend to be problematic, only then do you enable instrumentation for the module or modules of interest and their dependencies. We know Final Check is annoyingly slow but as Mistiano correctly states, with Final Check not only does BC keep a graph of all allocated blocks but also all pointers and contexts to them. Therein lies the magic in how Final Check can nail leaks and corruptions at the point of occurance, not just on application shutdown or fault. Shameless plug: I work on the DevPartner team. We are releasing DPS 10.5 on February 4, 2011 with x64 application support in BC. Unlike the relatively ancient and undersold BC64 for Itanium which only provided Active Check, DPS 10.5 provides full Final Check support for x64 applications, both for pure C++ and for native modules running in .NET processes. See microfocus.com under MF Developer for details.

Finding GDI/User resource usage from a crash dump

I have a crash dump of an application that is supposedly leaking GDI. The app is running on XP and I have no problems loading it into WinDbg to look at it. Previously we have use the Gdikdx.dll extension to look at Gdi information but this extension is not supported on XP or Vista.
Does anyone have any pointers for finding GDI object usage in WinDbg.
Alternatively, I do have access to the failing program (and its stress testing suite) so I can reproduce on a running system if you know of any 'live' debugging tools for XP and Vista (or Windows 2000 though this is not our target).
I've spent the last week working on a GDI leak finder tool. We also perform regular stress testing and it never lasted longer than a day's worth w/o stopping due to user/gdi object handle overconsumption.
My attempts have been pretty successful as far as I can tell. Of course, I spent some time beforehand looking for an alternative and quicker solution. It is worth mentioning, I had some previous semi-lucky experience with the GDILeaks tool from msdn article mentioned above. Not to mention that i had to solve a few problems prior to putting it to work and this time it just didn't give me what and how i wanted it. The downside of their approach is the heavyweight debugger interface (it slows down the researched target by orders of magnitude which I found unacceptable). Another downside is that it did not work all the time - on some runs I simply could not get it to report/compute anything! Its complexity (judging by the amount of code) was another scare-away factor. I'm not a big fan of GUIs, as it is my belief that I'm more productive with no windows at all ;o). I also found it hard to make it find and use my symbols.
One more tool I used before setting on to write my own, was the leakbrowser.
Anyways, I finally settled on an iterative approach to achieve following goals:
minor performance penalties
implementation simplicity
non-invasiveness (used for multiple products)
relying on as much available as possible
I used detours (non-commercial use) for core functionality (it is an injectible DLL). Put Javascript to use for automatic code generation (15K script to gen 100K source code - no way I code this manually and no C preprocessor involved!) plus a windbg extension for data analysis and snapshot/diff support.
To tell the long story short - after I was finished, it was a matter of a few hours to collect information during another stress test and another hour to analyze and fix the leaks.
I'll be more than happy to share my findings.
P.S. some time did I spend on trying to improve on the previous work. My intention was minimizing false positives (I've seen just about too many of those while developing), so it will also check for allocation/release consistency as well as avoid taking into account allocations that are never leaked.
Edit: Find the tool here
There was a MSDN Magazine article from several years ago that talked about GDI leaks. This points to several different places with good information.
In WinDbg, you may also try the !poolused command for some information.
Finding resource leaks in from a crash dump (post-mortem) can be difficult -- if it was always the same place, using the same variable that leaks the memory, and you're lucky, you could see the last place that it will be leaked, etc. It would probably be much easier with a live program running under the debugger.
You can also try using Microsoft Detours, but the license doesn't always work out. It's also a bit more invasive and advanced.
I have created a Windbg script for that. Look at the answer of
Command to get GDI handle count from a crash dump
To track the allocation stack you could set a ba (Break on Access) breakpoint past the last allocated GDICell object to break just at the point when another GDI allocation happens. That could be a bit complex because the address changes but it could be enough to find pretty much any leak.

Resources