I'm working on a Qt-based MacOS/X audio metering application, which contains audio-metering widgets (potentially a lot of them), each of which is supposed to be updated every 50ms (i.e. at 20Hz).
The program works, but when lots of meters are being updated at once, it uses up lots of CPU time and can bog down (spinny-color-wheel, oh no!).
The strange thing is this: Originally this app would just call update() on the meter widget whenever the meter value changed, and therefore the entire meter-widget would be redrawn every 50ms. However, I thought I'd be clever and compute just the area of the meter that actually needs to be redrawn, and only redraw that portion of the widget (e.g. update(x,y,w,h), where y and h are computed based on the old and new values of the meter). However, when I implemented that, it actually made CPU usage four times higher(!)... even though the app was drawing 50% fewer pixels per second.
Can anyone explain why this optimization actually turns out to be a pessimization? I've posted a trivial example application that demonstrates the effect, here:
http://www.lcscanada.com/jaf/meter_test.zip
When I compile (qmake;make) the above app and run it like this:
$ ./meter.app/Contents/MacOS/meter 72
Meter: Using numMeters=72 (partial updates ENABLED)
... top shows the process using ~50% CPU.
When I disable the clever-partial-updates logic, by running it like this:
$ ./meter.app/Contents/MacOS/meter 72 disable_partial_updates
Meter: Using numMeters=72 (partial updates DISABLED)
... top shows the process using only ~12% CPU. Huh? Shouldn't this case take more CPU, not less?
I tried profiling the app using Shark, but the results didn't mean much to me. FWIW, I'm running Snow Leopard on an 8-core Xeon Mac Pro.
GPU drawing is a lot faster then letting CPU caclulate the part to redraw (at least for OpenGL this takes in account, I got the Book OpenGL superbible, and it states that OpenGL is build to redraw not, to draw delta as this is potentially a lot more work to do). Even if you use Software Rendering, the libraries are higly optimzed to do their job properly and fast. So Just redrawing is state of art.
FWIW top on my Linux box shows ~10-11% without partial updates and 12% using partial updates. I had to request 400 meters to get that CPU usage though.
Perhaps it's just that the overhead of Qt setting up a paint region actually dwarfs your paint time? After all your painting is really simple, it's just two rectangular fills.
Related
I have windowed WinApi/OpenGL app. Scene is drawn rarely (compared to games) in WM_PAINT, mostly triggered by user input - MW_MOUSEMOVE/clicks etc.
I noticed, that when there is no scene moving by user mouse (application "idle") and then some mouse action by user starts, the first frame is drawn with unpleasant delay - like 300 ms. Following frames are fast again.
I implemented 100 ms timer, which only does InvalidateRect, which is later followed by WM_PAINT/draw scene. This "fixed" the problem. But I don't like this solution.
I'd like know why is this happening and also some tips how to tackle it.
Does OpenGL render context save resources, when not used? Or could this be caused by some system behaviour, like processor underclocking/energy saving etc? (Although I noticed that processor runs underclocked even when app under "load")
This sounds like Windows virtual memory system at work. The sum of all the memory use of all active programs is usually greater than the amount of physical memory installed on your system. So windows swaps out idle processes to disc, according to whatever rules it follows, such as the relative priority of each process and the amount of time it is idle.
You are preventing the swap out (and delay) by artificially making the program active every 100ms.
If a swapped out process is reactivated, it takes a little time to retrieve the memory content from disc and restart the process.
Its unlikely that OpenGL is responsible for this delay.
You can improve the situation by starting your program with a higher priority.
https://superuser.com/questions/699651/start-process-in-high-priority
You can also use the virtuallock function to prevent Windows from swapping out part of the memory, but not advisable unless you REALLY know what you are doing!
https://msdn.microsoft.com/en-us/library/windows/desktop/aa366895(v=vs.85).aspx
EDIT: You can improve things for sure by adding more memory and for sure 4GB sounds low for a modern PC, especially if you Chrome with multiple tabs open.
If you want to be scientific before spending any hard earned cash :-), then open Performance Manager and look at Cache Faults/Sec. This will show the swap activity on your machine. (I have 16GB on my PC so this number is very low mostly). To make sure you learn, I would check Cache Faults/Sec before and after the memory upgrade - so you can quantify the difference!
Finally, there is nothing wrong with the solution you found already - to kick start the graphic app every 100ms or so.....
Problem was in NVidia driver global 3d setting -"Power management mode".
Options "Optimal Power" and "Adaptive" save power and cause the problem.
Only "Prefer Maximum Performance" does the right thing.
Short Version:
Before the TL;DR section, my main question is this, what is difference when building to profile using instruments then a regular build that would result in reduced CPU load of my app by over 200%?
When building to run, it uses well over 200% CPU as reported by activity monitor, but with everything else the same, when building for profiling, using the Time Profiler, it reduces the CPU load down to <5%, which is a dramatic (orders of magnitude) difference.
TL;DR Version:
As an exercise to learn Cocoa, Swift and DSP (yes all three at once), I am working on writing a simple radio scanner OS X application using the cheap rtl-sdr dongles.
I have written a simple Swift wrapper around librtlsdr, a simple UI to be able to set the frequency, and a couple of simple DSP routines. My wrapper around librtlsdr uses an NSOperationQueue and my DSP routines use GCD queues in order to move the IO and CPU intense routines off the main thread / queue.
Currently, everything is working to the extent that I can successfully demodulate an AM transmission.
I have implemented a simple low-pass FIR filter and while working on the algorithm, I was surprised when I realized that I couldn’t use much more than about 30 coefficients before my filter routine started taking too long and the audio became choppy. As well, Activity Monitor shows up to 300% CPU usage for my app, which seems crazy high considering my filter contains nothing but a nested loop to do some multiply and accumulate operations. Anything higher than about 40 coefficients and the UI becomes unresponsive.
For the DSP minded, it’s a decimating filter where I am using the entire sample set for filtering (960000 sps) , but only filtering the samples that I need for the rate reduction (48000), using a rectangular windowed sinc function for the coefficients, pre computed. Not the most efficient algorithm, but on my quad core i7 Macbook Pro and iMac, it should still scream.
To get some insight on where my program was using up all the CPU cycles, I decided to give Instruments a go. Product->Profile, choosing the Time Profiler and running my app gave my some interesting information.
1) My filter routine was NOT using the most CPU cycles.
2) Activity monitor showed that my app wasn’t even at 5% CPU usage
So I decided to find out how far I can stress things before I see any stress on the CPU and I was up to a 50,000 tap filter before it started to be noticeably choppy and the CPU usage went close to 300%. So… to recap, normal build and run, I max out at about 35-40 filter taps; profile build and run, I max out at about 50,000 filter taps.
Also worthy of note, while profiling with 50,000 filter taps, the UI still responds instantly and I can change frequency, start / stop the radio and it has choppy audio. During a normal run, the UI starts to freeze just as soon as I start the radio with no audio, and that happens after I get to only about 50 taps.
Again, why the dramatic difference in CPU usage between between running while profiling, and running just a standard build; what’s different aside from the elevated privileges for Instruments and what do I need to do to make it the normal behavior for my app?
JE
This is all about build configurations. When you profile an app with Xcode it gets built with optimizations because Xcode uses the "release" build configuration for profiling. As the name suggests, the "release" config is also used on your final product which therefore always is a build optimized for speed. The default "debug" build configuration which comes to play when you build your app in Xcode by pressing ⌘R doesn't apply any compiler optimizations. This is the reason why your app is slower when not being profiled.
You can learn more about build configurations here: https://developer.apple.com/library/mac/recipes/xcode_help-project_editor/Articles/BasingBuildConfigurationsonConfigurationFiles.html#//apple_ref/doc/uid/TP40010155-CH13-SW1
I'm using MatLab R2014B on Win 8.1 I have a figure with two sub-plots. The data for the first sub-plot is about 700,000 points; the second is about 50,000 points. When I display it or manipulate it in any way (zoom, say), there's a huge lag in time, up to about 30 seconds. Obviously I'd like to improve performance. Here's what I know:
If I break it into 4 plots, each covering 1/4 of the data, performance is fast. Much more than 4 times as fast. The difference seems exponential.
A colleague (running R2014A I believe) has a machine that should be slower but in fact the figure displays with near-realtime speed.
The problem is perhaps how the figure is being rendered. I ran MatLab's "opengl info" and it reports that the Software flag is false. That should mean it's using the display's hardware rendering.
So maybe the display adapter isn't set quite right. My machine (it's a Lenovo laptop) has two display adapters: Intel HD Graphics 3000 and NVIDIA NVS 4200M. I don't know why there are both or whether there are any relevant settings.
Any thoughts on how to proceed?
It could be that you're running it through your integrated graphics processor (Intel HD Graphics 3000) rather than your dedicated graphics processor (NVIDIA NVS 4200M). If your Lenovo has "switchable graphics" enabled, you should be able to switch to the NVIDIA, or check that you are indeed rendering through that. Right click on your power manager in the taskbar. If you see a menu item that says "switchable graphics," you can change it to your NVIDIA. Note, you'll have to close out of MATLAB to do switch.
It does sound like a slowdown caused by the rendering configuration. When you run opengl info in MATLAB, what device is listed as "Renderer"?
If you don't need to manipulate it (say you only want an image file), you can always create your figure with figure('Visible','Off') and save it without actually showing the figure on screen.
MATLAB releases ever since R2014b use a new graphics engine which is known to be extremely slow with large data sets; see for example
http://www.mathworks.com/matlabcentral/newsreader/view_thread/337755
The solution has nothing to do with graphics drivers etc. Revert back to MATLAB R2014a and stay there.
I wrote a function, plotECG, that enables you to show plots with millions of samples. It includes sliders for quick scrolling and zooming.
If you have multiple timeseries and want them to be displayed in a synchronized way, you can pass them as matrix all at once and define the key 'AutoStackSignals', followed by a cell array of strings with the signal's names. Then, the signals are shown one below the other in the same axis with the corresponding name as YTickLabel.
https://de.mathworks.com/matlabcentral/fileexchange/59296
A fresh XNA game project application consumes quite some CPU percentage while its window is active. On my desktop PC it's about 30% of one core of a 2-core processor. When the window loses focus, the game goes into idle mode and consumes about 1% of CPU.
In an image viewer application I recently made using XNA, redrawing frames when no image manipulation is going on doesn't make much sense, so I'm using the SuppressDraw() method which, as the name suggests, suppresses spending resources for drawing the next frame, showing the last drawn frame instead. But still, there's a problem where the application keeps wasting CPU for a very simple input update.
How do I reduce the CPU usage for an XNA application when it doesn't require much of it?
quote from this question
According to this discussion on XBox Live Indie Games forum , apparently on some processors (and OS-s) XNA takes up 100% CPU time on one core when the default value of Game.IsFixedTimeStep is used.
A common solution (one that worked for me as well) is to put the following in your Game constructor:
IsFixedTimeStep = false;
more details here
In my relatively short time learning OpenCL I frequently see my application cause the operating system UI to become significantly less responsive (several seconds for a window to respond to a drag for example). I have encountered this problem on Windows Vista and Mac OS X both with NVidia GPUs.
What can I do when using OpenCL on the same GPU as the display to ensure that my application does not significantly degrade the UI responsiveness like this? Also, can this be done without taking needless performance losses within my application? (Ie, if the user is not doing some UI intensive task then I would not expect my application to run any slower than it does now.)
I understand that any answers will be very platform specific (where platform includes OS/GPU/driver combo).
As described in Dr. David Gohara's OpenCL Tutorial Episode 6 (beginning at 43:49), graphics cards cannot be preemptively scheduled at this time. As a result, using the same graphics card both for an intensive OpenCL kernel and the UI (or other GPU-using operations) will result in clunkiness or the visual appearance of freezing. Until graphics cards get preemptively scheduled multitasking (if ever), there's no way to do exactly what you want with just a single graphics card. I don't believe this is a platform-specific issue at all.
However, this problem might be solvable by dividing the problem up. Given the relative speed of whatever single GPU is available (you'll have to do testing to find the right setup), divide up your OpenCL problem to run the kernel multiple times with different parts of the input data, and later combine the output data when all sets of kernels are complete. I would recommend creating kernel sets that take less than 100 milliseconds to run (on a given GPU) so that lag would be, if not unnoticeable, not significantly annoying (the 100 milliseconds figure is a good "rule of thumb" according to this paper).
Based on your comment about your program being a command-line application, I assume your application will only run once at any given time, versus being a continuously running application with real-time output, as a lot of OpenCL demos are. My above answer is only satisfactory for non-continuous applications, since real-time performance isn't inherently expected. However, if your application is supposed to be continuous, the only solution currently available is to add a second, simpler graphics card that will only be used for UI.