Performance issue between builds - c++11

I've been developing a small indie game in my spare time and have run across an inexplicable issue. Some builds of the game will randomly run several hundred frames per second slower than other builds. For example, when rendering some text and no 3D scene, I can achieve 1800FPS on my own hardware. Add one 3D sphere (10k verts, pixel shaded), achieve 1700 FPS. Add two more spheres, achieve 800 FPS. Remove all spheres, achieve 1100FPS- even though the code now renders the same scene as I previously achieved at 1800FPS, which is just the FPS counter being rendered. I've tried rebuilding and cleaning the project and rebooting the compiler. This is in Release mode and I turned on all the optimizations I could find. Any suggestions as to the cause?
I ran a quick profile, and Visual Studio seems to think that over 90% of my time was spent in D3D9_43.dll, suggesting that it's not a bug in my app, which doesn't explain why it manifests in only some builds.
I rebooted my machine and it's back up to 1800FPS. I think it's a bug in the DirectX SDK tools (amongst many others). Going to delete this question.

I don't know if MSVC does this, but GCC does:
When GCC cannot determine the most likely branch it throws the dice.
If MSVC does that, it may be that in each build an important branch point is being predicted one way or the other and it makes a difference.
You can fix that by doing a PGO build: profile guided optimization. That will examine the code at run time and it will make all the branches correctly predicted. At least, it will be correct if your test run is a good sample.
That said, the results are not usually so dramatic. If you had more objects on the scene and more code involved the changes would even out more.

Another possibility: CPU speed scaling.
If your program spends most of its time executing on the GPU, CPU usage may not rise high enough to push the CPU into full speed.
Try adjusting Windows power management to full speed instead of balanced, see if it changes anything.

Related

Unreasonable CPU consumption for server build with nographics

I have built my game in server mode on Mac OS and attached profiler to it. In profiler I can see unreasonable high cpu load.
Other scripts take a lot of cpu time. How can this be optimized?
Vsync takes a lot of time otherwise. How can it be since I have disabled VSync, built as server, run with -nographics and even removed cameras and UI?
I dont know what your game is doing in terms of calculations from custom scripts, but if its not just an empty project, then i dont see why 33ms is unreasonable. Check your servers specs maybe? Also VSync just idles for a need amount of time to reach the fixed target FPS, meaning its not actually under load, even though the profiler is showing it as a big lump of color. Think of it more as headroom - how much processing you can still do per frame and have your target FPS.

Profile Build vs Normal Build: CPU Usage?

Short Version:
Before the TL;DR section, my main question is this, what is difference when building to profile using instruments then a regular build that would result in reduced CPU load of my app by over 200%?
When building to run, it uses well over 200% CPU as reported by activity monitor, but with everything else the same, when building for profiling, using the Time Profiler, it reduces the CPU load down to <5%, which is a dramatic (orders of magnitude) difference.
TL;DR Version:
As an exercise to learn Cocoa, Swift and DSP (yes all three at once), I am working on writing a simple radio scanner OS X application using the cheap rtl-sdr dongles.
I have written a simple Swift wrapper around librtlsdr, a simple UI to be able to set the frequency, and a couple of simple DSP routines. My wrapper around librtlsdr uses an NSOperationQueue and my DSP routines use GCD queues in order to move the IO and CPU intense routines off the main thread / queue.
Currently, everything is working to the extent that I can successfully demodulate an AM transmission.
I have implemented a simple low-pass FIR filter and while working on the algorithm, I was surprised when I realized that I couldn’t use much more than about 30 coefficients before my filter routine started taking too long and the audio became choppy. As well, Activity Monitor shows up to 300% CPU usage for my app, which seems crazy high considering my filter contains nothing but a nested loop to do some multiply and accumulate operations. Anything higher than about 40 coefficients and the UI becomes unresponsive.
For the DSP minded, it’s a decimating filter where I am using the entire sample set for filtering (960000 sps) , but only filtering the samples that I need for the rate reduction (48000), using a rectangular windowed sinc function for the coefficients, pre computed. Not the most efficient algorithm, but on my quad core i7 Macbook Pro and iMac, it should still scream.
To get some insight on where my program was using up all the CPU cycles, I decided to give Instruments a go. Product->Profile, choosing the Time Profiler and running my app gave my some interesting information.
1) My filter routine was NOT using the most CPU cycles.
2) Activity monitor showed that my app wasn’t even at 5% CPU usage
So I decided to find out how far I can stress things before I see any stress on the CPU and I was up to a 50,000 tap filter before it started to be noticeably choppy and the CPU usage went close to 300%. So… to recap, normal build and run, I max out at about 35-40 filter taps; profile build and run, I max out at about 50,000 filter taps.
Also worthy of note, while profiling with 50,000 filter taps, the UI still responds instantly and I can change frequency, start / stop the radio and it has choppy audio. During a normal run, the UI starts to freeze just as soon as I start the radio with no audio, and that happens after I get to only about 50 taps.
Again, why the dramatic difference in CPU usage between between running while profiling, and running just a standard build; what’s different aside from the elevated privileges for Instruments and what do I need to do to make it the normal behavior for my app?
JE
This is all about build configurations. When you profile an app with Xcode it gets built with optimizations because Xcode uses the "release" build configuration for profiling. As the name suggests, the "release" config is also used on your final product which therefore always is a build optimized for speed. The default "debug" build configuration which comes to play when you build your app in Xcode by pressing ⌘R doesn't apply any compiler optimizations. This is the reason why your app is slower when not being profiled.
You can learn more about build configurations here: https://developer.apple.com/library/mac/recipes/xcode_help-project_editor/Articles/BasingBuildConfigurationsonConfigurationFiles.html#//apple_ref/doc/uid/TP40010155-CH13-SW1

Drastic performance inprovement in .NET CF after app gets moved out of the foreground. Why?

I have a large (500K lines) .NET CF (C#) program, running on CE6/.NET CF 3.5 (v.3.5.10181.0). This is running on a FreeScale i.Mx31 (ARM) # 400MHz. It has 128MB RAM, with ~80MB available to applications. My app is the only significant one running (this is a dedicated, embedded system). Managed memory in use (as reported by GC.Collect) is about 18MB.
To give a better idea of the app size, here's some stats culled from .NET CF Remote Performance Monitor after staring up the application:
GC:
Garbage Collections 131
Bytes Collected by GC 97,919,260
Managed Bytes in use after GC 17,774,992
Total Bytes in use after GC 24,117,424
GC Compactions 41
JIT:
Native Bytes Jitted: 10,274,820
Loader:
Classes Loaded 7,393
Methods Loaded 27,691
Recently, I have been trying to track down a performance problem. I found that my benchmark after running the app in two different startup configurations would run in approximately 2 seconds (slow case) vs. 1 second (fast case). In the slow case, the time for the benchmark could change randomly from EXE run to EXE run from 1.1 to 2 seconds, but for any given EXE run, would not change for the life of the application. In other words, you could re-run the benchmark and the time for the test stays the same until you restart the EXE, at which point a new time is established and consistent.
I could not explain the 1.1 to 2x slowdown via any conventional mechanism, or by narrowing the slowdown to any particular part of the benchmark code. It appeared that the overall process was just running slower, almost like a thread was spinning and taking away some of "my" CPU.
Then, I randomly discovered that just by switching away from my app (the GUI loses the foreground) to another app, my performance issue disappears. It stays gone even after returning my app to the foreground. I now have a tentative workaround where my app after startup launches an auxiliary app with a 1x1 size window that kills itself after 5ms. Thus the aux app takes the foreground, then relinquishes it.
The question is, why does this speed up my application?
I know that code gets pitched when a .NET CF app loses the foreground. I also notice that when performing a "GC Heap" capture with .NET CF Remote Performance Monitor, a Code Pitch is logged -- and this also triggers the performance improvement in my app. So I suspect somehow that code pitching is related or even responsible for fixing performance. But I'm at a loss as to figure out how to determine if that is really the case, or even to explain why pitching code could help in this way. Does pitching out lots of code somehow significantly help locality of reference of code pages (that are re-JITted, presumably near each other in memory) enough to help this much? (My benchmark spans probably 3 dozen routines and hundreds of lines of code.)
Most importantly, what can I do in my app to reliably avoid this slower condition. Any pointers to relevant .NET CF / JIT / Code pitching information would be greatly appreciated.
Your app going to the background auto-triggers a GC.Collect, which collects, may compact the GC Heap and may pitch code. Have you checked to see if a manual GC.Collect without going to the background gives the same behavior? It might not be pitching that's giving the perf gain, it might be collection or compaction. If a significant number of dead roots are swept up, walking the root tree may be getting faster. Can't say I've specifically seen this issue, so this is all conjecture.
Send your app a wm_hibernate before your performance loop. Will clean up things
We have a similar issue with our .NET CF application.
Over time, our application progressively slows down, eventually to a halt with what I anticipate is due to high CPU load, or as #wil-s says, as if thread is spinning consuming CPU. The only assumption / conclusion I've made to so far is either we have a rogue thread in our code, or there's an under the cover issue in .NET CF, maybe with the JITter.
Closing the application and re-launching returns our application to normal expected performance.
I am yet to implement code change to test issuing WM_HIBERNATE or launch a dummy app which quits itself (as above) to force a code pitch, but am fairly sure this will resolve our issue based on the above comments. (so many thanks for that)
However, I'm subsequently interested to know whether a root cause was ever found to this specific issue?
Incidentally and seemingly off topic (but bear with me), we're using a Freescale i.MX28 processor and are experiencing unpredictable FlashDisk corruption. Seeing 2K blocks of 0xFFs (erased blocks) in random files located on NAND Flash.
I'm mentioning this as I now believe the CPU and FlashDisk corruption issues are linked, due to this article as well as this one:
https://electronics.stackexchange.com/questions/26720/flash-memory-corruption-due-to-electricals
In the article, #jwygralak67 comments:
I recently worked through a flash corruption issue, on a WinCE system,
as part of a development team. We would sporadically find 2K blocks of
flash that were erased. (All bytes 0xFF) For about 6 months we tested
everything from ESD, to dirty power to EMI and RFI interference, we
bought brand new devices and tracked usage to make sure we weren't
exceeding the erase cycle limit and buring out blocks, we went through
our (application level) software with a fine toothed comb.
In the end it turned out to be an obscure bug in the very low level
flash driver code, which only occurred under periods of heavy CPU
load. The driver came from a 3rd party. We informed them of the issue
we found, but I don't know if they ever released a patch.
Unfortunately, we're yet to make contact with him.
With all of this in mind, potentially if we work around the high CPU load, maybe the corruption will no longer be present. Another conjecture case!
On that assumption however, this doesn't give a firm root cause for either situation, which I'm desperately seeking!
Any knowledge or insight, however small, would be very gratefully received.
#ctacke - we've spoken before regarding OpenNETCF via email, so I'm pleased to see your name!

Qt 4.6.x under MacOS/X: widget update performance mystery

I'm working on a Qt-based MacOS/X audio metering application, which contains audio-metering widgets (potentially a lot of them), each of which is supposed to be updated every 50ms (i.e. at 20Hz).
The program works, but when lots of meters are being updated at once, it uses up lots of CPU time and can bog down (spinny-color-wheel, oh no!).
The strange thing is this: Originally this app would just call update() on the meter widget whenever the meter value changed, and therefore the entire meter-widget would be redrawn every 50ms. However, I thought I'd be clever and compute just the area of the meter that actually needs to be redrawn, and only redraw that portion of the widget (e.g. update(x,y,w,h), where y and h are computed based on the old and new values of the meter). However, when I implemented that, it actually made CPU usage four times higher(!)... even though the app was drawing 50% fewer pixels per second.
Can anyone explain why this optimization actually turns out to be a pessimization? I've posted a trivial example application that demonstrates the effect, here:
http://www.lcscanada.com/jaf/meter_test.zip
When I compile (qmake;make) the above app and run it like this:
$ ./meter.app/Contents/MacOS/meter 72
Meter: Using numMeters=72 (partial updates ENABLED)
... top shows the process using ~50% CPU.
When I disable the clever-partial-updates logic, by running it like this:
$ ./meter.app/Contents/MacOS/meter 72 disable_partial_updates
Meter: Using numMeters=72 (partial updates DISABLED)
... top shows the process using only ~12% CPU. Huh? Shouldn't this case take more CPU, not less?
I tried profiling the app using Shark, but the results didn't mean much to me. FWIW, I'm running Snow Leopard on an 8-core Xeon Mac Pro.
GPU drawing is a lot faster then letting CPU caclulate the part to redraw (at least for OpenGL this takes in account, I got the Book OpenGL superbible, and it states that OpenGL is build to redraw not, to draw delta as this is potentially a lot more work to do). Even if you use Software Rendering, the libraries are higly optimzed to do their job properly and fast. So Just redrawing is state of art.
FWIW top on my Linux box shows ~10-11% without partial updates and 12% using partial updates. I had to request 400 meters to get that CPU usage though.
Perhaps it's just that the overhead of Qt setting up a paint region actually dwarfs your paint time? After all your painting is really simple, it's just two rectangular fills.

Compact Framework and JIT. How long could it take

We have/had a phantom delay in our app. This was traced to the initialisation of a singleton when the object was touched for the first time and was blamed on JIT. I'm not utterly convinced by this as there is no mechanism for measuring JIT (or is there?) and the entire delay was seven seconds. Seven seconds of JIT?!? Could that be forreal?
Either way I have difficulty in blaming things that one cannot easily measure. When I had a glance at the issue a while back I commented out a bunch of code and watched the seven second delay "jump" elsewhere in the app. Suggesting it is somehow happening on a background process somewhere (and I guess this would count JIT in as a potential cause).
Just for fun if there was a static object that happened to reference a lot of other objects does anyone have a rule of thumb for how long the JIT might take? Does anyone have further references so I can understand more about the JIT so I stand a chance of learning whether or not JIT is/was to blame for this slow down?
I've only seen JIT take a really long time (greater than 1 second) in a weird bug that had to do with templated items inside a templated collection (see edit below).
At any rate, the fact you see it "move" definitely indicates to me that it probably isn't the issue. To try to determine this definitively I'd look at using RPM to see what's happening right before and after the delay.
Expected JIT time is a really nebulous thing, since there are so many factors that can affect it. Processor speed is an obvious one, but less obvious might be things like app storage media and device memory pressure.
Storage media can affect JIT speed because the JITter has to pull the IL from the media when it needs to JIT it, and if pulling it is slow, then JITting it will be slow.
Memory pressure is a tough one, and can have serious repercussions on a CE device. The issue here is that when you start running out of memory, the EE will start pitching JITted code during collection - everything but the call stack. Now if you're in a routine that, for example, calls out to some worker or helper stuff, or has a thread running, then that helper method could be getting pitched, JITted, pitched JITted, etc. This is referred to as "thrash."
Identifying the latter is fairly easy with RPM (fixing it may not be so easy). Look at the amount of code pitched to raise frequently and look for a strong correlation between a rise in the number of pitches and your perceived lock ups.
Edit: I finally found the bug description here.
JIT (and GC) timers etc. can be found here:
Performance Counters in the .NET Compact Framework
(http://msdn.microsoft.com/en-us/library/ms172525.aspx)
Monitoring Application Performance on the .NET Compact Framework Part I - Enabling performance counters (http://blogs.msdn.com/davidklinems/archive/2005/10/04/476988.aspx)
Analyzing Device Application Performance with the .Net Compact Framework Remote Performance Monitor (http://blogs.msdn.com/stevenpr/archive/2006/04/17/577636.aspx)
Performance Counters in the .NET Framework
(http://msdn.microsoft.com/en-us/library/w8f5kw2e(VS.80).aspx)
Regards,
tamberg

Resources