Program is slower when compiled - performance

Any suggestions on why a VB6 program would be slower when compiled than when running in the debugger? I'm compiling it with "Optimize for fast code."
Notes:
I measure performance by running the compiled version and the non-compiled version on the same machine. I based my predictions on wall-clock time, since 30 minutes vs. 100 minutes is a big enough difference to be visible.

Several months ago, I configured a debugging tool to attach itself to my program whenever it ran. I totally forgot that I had done this.
Special thanks to Process Monitor for making this very obvious.
Turning it off made the program run fast.
AppVerifier, for those who are curious.

You should select the compile to Native Code option
The compile to P-code option forces your program to run in an interpreted mode, which can be slower.
There are some optimizations in the advanced section. Try them out too.
Some more points to consider:
Are you running the compliled application in the same environment? Is it taking the same data as input?
How did you know that it is slow? What if your timing program is wrong?

How do you measure the performance?
It is hard to measure the performance by what you just said. You have to ensure the running environment must be exactly same for compare the performance?
Are you running on the same machine? Do you connect to DB? Does DB has the same work load at different run? You need isolate other factors before reaching such a decision.

Related

Visual studio 2012 - measuring speed of few instructions

In my code I have few instructions that are connecting to my database. I would like to know how fast are they executing. I know that I can write some kind of a timer, but pasting this code into few places and then removing it after measurement, will surely leave some mess.
I want to know if maybe VS2012 has any tool to help me with that? Or is there any addon maybe?
Just a quick one-off use of the Stopwatch class can give you insight.
Do beware that such a test isn't actually that useful. It will repeat very poorly, dbase connection times heavily depend on network traffic overhead and dbase server usage. And worst of all, there just isn't anything you can do about it in your code. Spending time on profiling code that you cannot improve is not a very productive use of your time.
You might actually want to leave that code in place so that the user has some idea why the program is performing poorly. Whether that is useful is hard to tell.

Testing perceived performance

I recently got a shiny new development workstation. The only disadvantage of this is that the desktop apps I'm developing now run very, very fast, and so I fear that parts of the code that would be annoyingly slow on end users' machines will go unnoticed during my testing.
Is there a good way to slow down an application for testing? I've tried searching around, but all of the results I've been able to find seem pretty fiddly to set up (e.g., manually setting up a high-priority CPU-bound task on the same CPU core as the target app, or running a background process that rapidly interrupts and resumes the target app), and I don't know if the end result is actually a good representation of running on a slower computer (with its slower CPU, slower RAM, slower disk I/O...).
I don't think that this is a job for a profiler; I'm interested in the user's perception of end-to-end performance rather than in where the time goes for particular operations.
setup a virtual machine, give in as little ram as needed and also you can have it use 1,2 or more CPUs. I like VirtualBox myself install your app and test with different RAM configs
Personally, I'd get an old used crappy computer that is typical of what the users have and test on that. It should be cheap and you will see pretty fast how bad things are.
I think the only way to deal with this is through proper end-user testing, i.e. get yourself a "typical" system for testing and use that to identify any perceptible performance bottlenecks.
You can try out either Virtual PC or VMWare Player/Workstation, load an OS onto it, and then throttle back the resources. I know that with any of those tools you can reduce the memory to whatever you'd like. You can also specify the number of cores you want to use. You might even be able to adjust the clock speed in VMWare Workstation... I'm not sure.
I upvoted SQLMenace's answer, but, I also think that profiling needs to be mentioned, no matter how quickly the code is executing - you'll still see what's taking the most time. If you find yourself with some free time, I think profiling and investigating the results is a good way to spend it.

Finding GDI/User resource usage from a crash dump

I have a crash dump of an application that is supposedly leaking GDI. The app is running on XP and I have no problems loading it into WinDbg to look at it. Previously we have use the Gdikdx.dll extension to look at Gdi information but this extension is not supported on XP or Vista.
Does anyone have any pointers for finding GDI object usage in WinDbg.
Alternatively, I do have access to the failing program (and its stress testing suite) so I can reproduce on a running system if you know of any 'live' debugging tools for XP and Vista (or Windows 2000 though this is not our target).
I've spent the last week working on a GDI leak finder tool. We also perform regular stress testing and it never lasted longer than a day's worth w/o stopping due to user/gdi object handle overconsumption.
My attempts have been pretty successful as far as I can tell. Of course, I spent some time beforehand looking for an alternative and quicker solution. It is worth mentioning, I had some previous semi-lucky experience with the GDILeaks tool from msdn article mentioned above. Not to mention that i had to solve a few problems prior to putting it to work and this time it just didn't give me what and how i wanted it. The downside of their approach is the heavyweight debugger interface (it slows down the researched target by orders of magnitude which I found unacceptable). Another downside is that it did not work all the time - on some runs I simply could not get it to report/compute anything! Its complexity (judging by the amount of code) was another scare-away factor. I'm not a big fan of GUIs, as it is my belief that I'm more productive with no windows at all ;o). I also found it hard to make it find and use my symbols.
One more tool I used before setting on to write my own, was the leakbrowser.
Anyways, I finally settled on an iterative approach to achieve following goals:
minor performance penalties
implementation simplicity
non-invasiveness (used for multiple products)
relying on as much available as possible
I used detours (non-commercial use) for core functionality (it is an injectible DLL). Put Javascript to use for automatic code generation (15K script to gen 100K source code - no way I code this manually and no C preprocessor involved!) plus a windbg extension for data analysis and snapshot/diff support.
To tell the long story short - after I was finished, it was a matter of a few hours to collect information during another stress test and another hour to analyze and fix the leaks.
I'll be more than happy to share my findings.
P.S. some time did I spend on trying to improve on the previous work. My intention was minimizing false positives (I've seen just about too many of those while developing), so it will also check for allocation/release consistency as well as avoid taking into account allocations that are never leaked.
Edit: Find the tool here
There was a MSDN Magazine article from several years ago that talked about GDI leaks. This points to several different places with good information.
In WinDbg, you may also try the !poolused command for some information.
Finding resource leaks in from a crash dump (post-mortem) can be difficult -- if it was always the same place, using the same variable that leaks the memory, and you're lucky, you could see the last place that it will be leaked, etc. It would probably be much easier with a live program running under the debugger.
You can also try using Microsoft Detours, but the license doesn't always work out. It's also a bit more invasive and advanced.
I have created a Windbg script for that. Look at the answer of
Command to get GDI handle count from a crash dump
To track the allocation stack you could set a ba (Break on Access) breakpoint past the last allocated GDICell object to break just at the point when another GDI allocation happens. That could be a bit complex because the address changes but it could be enough to find pretty much any leak.

Comparing cold-start to warm start

Our application takes significantly more time to launch after a reboot (cold start) than if it was already opened once (warm start).
Most (if not all) the difference seems to come from loading DLLs, when the DLLs' are in cached memory pages they load much faster. We tried using ClearMem to simulate rebooting (since its much less time consuming than actually rebooting) and got mixed results, on some machines it seemed to simulate a reboot very consistently and in some not.
To sum up my questions are:
Have you experienced differences in launch time between cold and warm starts?
How have you delt with such differences?
Do you know of a way to dependably simulate a reboot?
Edit:
Clarifications for comments:
The application is mostly native C++ with some .NET (the first .NET assembly that's loaded pays for the CLR).
We're looking to improve load time, obviously we did our share of profiling and improved the hotspots in our code.
Something I forgot to mention was that we got some improvement by re-basing all our binaries so the loader doesn't have to do it at load time.
As for simulating reboots, have you considered running your app from a virtual PC? Using virtualization you can conveniently replicate a set of conditions over and over again.
I would also consider some type of profiling app to spot the bit of code causing the time lag, and then making the judgement call about how much of that code is really necessary, or if it could be achieved in a different way.
It would be hard to truly simulate a reboot in software. When you reboot, all devices in your machine get their reset bit asserted, which should cause all memory system-wide to be lost.
In a modern machine you've got memory and caches everywhere: there's the VM subsystem which is storing pages of memory for the program, then you've got the OS caching the contents of files in memory, then you've got the on-disk buffer of sectors on the harddrive itself. You can probably get the OS caches to be reset, but the on-disk buffer on the drive? I don't know of a way.
How did you profile your code? Not all profiling methods are equal and some find hotspots better than others. Are you loading lots of files? If so, disk fragmentation and seek time might come into play.
Maybe even sticking basic timing information into the code, writing out to a log file and examining the files on cold/warm start will help identify where the app is spending time.
Without more information, I would lean towards filesystem/disk cache as the likely difference between the two environments. If that's the case, then you either need to spend less time loading files upfront, or find faster ways to load files.
Example: if you are loading lots of binary data files, speed up loading by combining them into a single file, then do a slerp of the whole file into memory in one read and parse their contents. Less disk seeks and time spend reading off of disk. Again, maybe that doesn't apply.
I don't know offhand of any tools to clear the disk/filesystem cache, but you could write a quick application to read a bunch of unrelated files off of disk to cause the filesystem/disk cache to be loaded with different info.
#Morten Christiansen said:
One way to make apps start cold-start faster (sort of) is used by e.g. Adobe reader, by loading some of the files on startup, thereby hiding the cold start from the users. This is only usable if the program is not supposed to start up immediately.
That makes the customer pay for initializing our app at every boot even when it isn't used, I really don't like that option (neither does Raymond).
One succesful way to speed up application startup is to switch DLLs to delay-load. This is a low-cost change (some fiddling with project settings) but can make startup significantly faster. Afterwards, run depends.exe in profiling mode to figure out which DLLs load during startup anyway, and revert the delay-load on them. Remember that you may also delay-load most Windows DLLs you need.
A very effective technique for improving application cold launch time is optimizing function link ordering.
The Visual Studio linker lets you pass in a file lists all the functions in the module being linked (or just some of them - it doesn't have to be all of them), and the linker will place those functions next to each other in memory.
When your application is starting up, there are typically calls to init functions throughout your application. Many of these calls will be to a page that isn't in memory yet, resulting in a page fault and a disk seek. That's where slow startup comes from.
Optimizing your application so all these functions are together can be a big win.
Check out Profile Guided Optimization in Visual Studio 2005 or later. One of the thing sthat PGO does for you is function link ordering.
It's a bit difficult to work into a build process, because with PGO you need to link, run your application, and then re-link with the output from the profile run. This means your build process needs to have a runtime environment and deal cleaning up after bad builds and all that, but the payoff is typically 10+ or more faster cold launch with no code changes.
There's some more info on PGO here:
http://msdn.microsoft.com/en-us/library/e7k32f4k.aspx
As an alternative to function order list, just group the code that will be called within the same sections:
#pragma code_seg(".startUp")
//...
#pragma code_seg
#pragma data_seg(".startUp")
//...
#pragma data_seg
It should be easy to maintain as your code changes, but has the same benefit as the function order list.
I am not sure whether function order list can specify global variables as well, but use this #pragma data_seg would simply work.
One way to make apps start cold-start faster (sort of) is used by e.g. Adobe reader, by loading some of the files on startup, thereby hiding the cold start from the users. This is only usable if the program is not supposed to start up immediately.
Another note, is that .NET 3.5SP1 supposedly has much improved cold-start speed, though how much, I cannot say.
It could be the NICs (LAN Cards) and that your app depends on certain other
services that require the network to come up. So profiling your application alone may not quite tell you this, but you should examine the dependencies for your application.
If your application is not very complicated, you can just copy all the executables to another directory, it should be similar to a reboot. (Cut and Paste seems not work, Windows is smart enough to know the files move to another folder is cached in the memory)

Operating System Overheads while profiling?

I am doing profiling of a C code in Microsoft VS 2005 on a Intel Core-2Duo platform.
I measure the time(secs:millisecs) counsumed by my function. But i have some doubts about the accuracy of this measurement as the operating system will not continuously run my application, but instead schedule others apps/services in between the execution of my code.(Although i have no major applications running while i do the profile run, still windows will have lot of code of its own which it will run by preempting my app.). Because of all this i believe the profiling number(time taken by my app to run) is not accurate.
So my question is there any way to find out the Operating system overheads, scheduling overhead on a typical windows system(I run Windows XP)e.g. if my applications says it ran for 60 milliseconds, out of that 60 msec, how much time really was used by my app. and how much time it was sitting idle, due to being pre-empted by some other task scheduled by the OS?
or
Atleast is there any ball-park number to get such OS overhead, based on your experience you came across while doing something similar?
#Kogus: Even if i run outside debugger(standalone app. from a command prompt) it still could be preempted by OS and cause a incorrect measurement of the time consumed by my app.
Is'nt it?
-AD
I think you are going to have some problems with the granularity. See similar questions GetLocalTime() API time resolution and Is gettimeofday() guaranteed to be of microsecond resolution?
Also, you may want to take a look at the Windows Resource Kits Tools which include timeit.exe (similar to time on unix/linux) to give you elapsed and process times.
Suggestion
Try run on multi CPU systems.
The best way of doing this is a dedicated profiling tool. There are lots out there. I haven't used one for C for a few years, someone else will hopefully be able to give better advice. As you are using Visual Studio 2005 this might be a good place to start:
AQ, but I've never used it.
1 - Put some debug logging in your code (include timestamps of course), and run it outside of the debugger
2 - Run again in the debugger
3 - Repeat many times, to get statistically valid data.
4 - Compare.
If there is a significant difference in the average execution time of the standalone vs. the debugger, then you are right to be suspicious of the OS (or the overhead of the debugger hooks themselves...). If no difference, then don't sweat it.
Edit0: Obviously the debug messages have some overhead of their own. You may want to leave those in the code even when you are running from the debugger. That way, both the standalone and the debugger are running the very same code.
Edit1: I misunderstood the question. I thought your concern was that --while debugging--, the OS might interrupt your app more frequently than in a normal mode of execution. If you want to know how much time your app actually spent working, just compare the time taken to the "CPU Time" in the Task Manager.
Edit2: Compare the time returned by GetProcessTimes for your process to the actual execution time. The difference is the time spent by the CPU on somebody else.

Resources