Bad performance of GCContextDrawImage due to it calling suspicious debug functions - macos

A call to GCContextDrawImage turned out to be a bottleneck in my Mac OS X application, especially on retina screens. I managed to mitigate it somewhat by Avoiding colorspace transformations when blitting, Mac OS X 10.11 SDK, but it still seems to be slower than I would expect it to be.
When investigating the stack dump with Instruments I noticed that a lot of time was spent in two functions with highly suspicious names, vImageDebug_CheckDestBuffer which is calling into _ERROR_Buffer_Write__Too_Small_For_Arguments_To_vImage__CheckBacktrace. See the the full stack dump below.
This seems to me like some sort of debug assertion? Am I running a debug version of the vImage library without realising it? Is there something I can do to stop these functions from sucking up all my precious cycles?

The performance problem was solved by making sure the beginning of the pixel data in each scan line of the source bitmap is aligned to 16 bytes. Doing this seems to make the image drawing considerably faster. Presumably this happens by default if you allocate a new image, but we wrapped a CGImage around an existing pixel buffer which wasn't aligned.

What about a graphics context (first parameter)? Do you pass it from other thread? What if you will get context in main thread, then draw image also in main thread?

Related

Pixel Buffer Objects, gReadPixels and GL_UNSIGNED_INT_8_8_8_8_REV

I could not get pixel buffer objects to make async glReadPixels work (well it works but has no speed up) on OS X 10.10 with Pixel Buffers using GL_PIXEL_PACK_BUFFER.
I switched from GL_UNSIGNED_BYTE in glReadPixels to GL_UNSIGNED_INT_8_8_8_8_REV
glReadPixels dropped to 0.6 ms from 20ms - in other words it started to work async in a real sense.
My question is:
Will setting GL_UNSIGNED_INT_8_8_8_8_REV as a pixel format work on other mac systems or do I need to test them all?
If you want to be confident that it will perform well on all configurations, you'll have to test them all. It will often depend on the GPU vendor if a certain path is slow or fast. The result can also be different between the drivers for different GPU generations, and can even change from software release to software release.
What you're measuring in this specific example is quite odd. GL_UNSIGNED_BYTE and GL_UNSIGNED_INT_8_8_8_8_REV are actually the same format on a little endian machine. There's no good reason why one of them should be faster than the other. It's most likely just an omission when checking if a fast path can be used.

Possible to keep bad VRAM "occupied"?

I've got an iMac whose VRAM appears to have gone on the fritz. On boot, things are mostly fine for a while, but eventually, as more and more windows are opened (i.e. textures are created on the GPU), I eventually hit the glitchy VRAM, and I get these bizarre "noisy" grid-like patterns of red and green in the windows.
I had an idea, but I'm mostly a newb when it comes to OpenGL and GPU programming in general, so I figured I'd ask here to see if it was plausible:
What if I wrote a little app, that ran on boot, and would allocate GPU textures (of some reasonable quantum -- I dunno, maybe 256K?) until it consumed all available VRAM (i.e. can't allocate any more textures). Then have it upload a specific pattern of data into each texture. Next it would readback the texture from the GPU and checksum the data against the original pattern. If it checks out, then release it (for the rest of the system to use). If it doesn't checksum, hang onto it (forever).
Flaws I can see: a user space app is not going to be able to definitively run through ALL the VRAM, since the system will have grabbed some, but really, I'm just trying to squeeze some extra life out of a dying machine here, so anything that helps in that regard is welcome. I'm also aware that reading back from VRAM is comparatively slow, but I'm not overly concerned with performance -- this is a practical endeavor, to be sure.
Does this sound plausible, or is there some fundamental truth about GPUs that I'm missing here?
Your approach is interesting, although I think there other ways that might be easier to implement if you're looking for a quick fix or work-around. If your VRAM is on the fritz then it's likely that there is a specific location the corruption is taking place. If you're able to determine consistently that it happens at a certain point (VRAM is consuming x amount of memory, etc.) then you can work with it.
It's quite easy to create a RAM disk, and another possibility would be to allocate regular memory for VRAM. I know both of these are very possible, because I've done it. If someone says something "won't work" (no offense Pavel), it shouldn't discourage you from at least trying. If you're interested in the techniques that I mentioned I'd be happy to provide more info, however, this is about your idea and I'd like to know if you can make it work.
If you are able to write an app that ran on boot even before an OS loaded, that would be in the bootloader - why wouldnt you just then do a self-test of memory at that time ?
Or did you mean an userland app after the OS boots into the login ? An userland app will not be able to do what you mentioned of cycling through every address simply because there is no mapping to userland directly for every page.
If you are sure that RAM is a problem, did you try replacing the RAM ?

EGL/OpenGL ES/switching context is slow

I am developing an OpenGL ES 2.0 application (using angleproject on Windows for developement) that is made up of multiple 'frames'.
Each frame is an isolated application that should not interfere with the surrounding frames. The frames are drawn using OpenGL ES 2.0, by the code running inside of that frame.
My first attempt was to assign a frame buffer to each frame. But there was a problem - OpenGL's internal states are changed while one frame is drawing, and if the next frame doesn't comprehensively reset every known OpenGL state, there could be possible side effects. This defeats my requirement that each frame should be isolated and not affect one another.
My next attempt was to use a context per frame. I created a unique context for each frame. I'm using sharing resources, so that I can eglMakeCurrent to each frame, render each to their own frame buffer/texture, then eglMakeCurrent back to globally, to compose each texture to the final screen.
This does a great job at isolating the instances, however.. eglMakeCurrent is very slow. As little as 4 of them can make it take a second or more to render the screen.
What approach can I take? Is there a way I can either speed up context switching, or avoid context switching by somehow saving the OpenGL state per frame?
I have a suggestion that may eliminate the overhead of eglMakeCurrent while allowing you to use your current approach.
The concept of current EGLContext is thread-local. I suggest creating all contexts in your process's master thread, then create one thread per context, passing one context to each thread. During each thread's initialization, it will call eglMakeCurrent on the context it owns, and never call eglMakeCurrent again. Hopefully, in ANGLE's implementation, the thread-local storage for contexts is implemented efficiently and does not have unnecessary synchronization overhead.
The problem here is trying to do this in a generic platform and OS independent way. If you choose a specific platform, there are good solutions. On Windows, there are the wgl and glut libraries that will give you multiple windows with completely independent OpenGL contexts running concurrently. They are called Windows, not Frames. You could also use DirectX instead of OpenGL. Angle uses DirectX. On linux, the solution is X11 for OpenGL. In either case, it's critical to have quality OpenGL drivers. No Intel Extreme chipset drivers. If you want to do this on Android or iOS, then those require different solutions. There was a recent thread on the Khronos.org OpenGL ES forum about the Android case.

How does WinRT handle BitmapImage and Image memory

I am new to programming Windows Store Apps with C# and I am trying to understand how image memory is handled. My app is very simple:
1) it references a bitmap from a file using a Windows.UI.Xaml.Media.Imaging.BitmapImage object and then uses that as the Source for a Windows.UI.Xaml.Controls.Image object. In my case the image on disk has larger dimensions than what is being displayed on screen so it is being scaled by the system.
My question is how does WinRT handle the memory for the image? I used the vmmap tool and I see in the Mapped File section there is an entry for my image file. I guess this means that the raw bytes for this file are fully loaded into memory. Since this is a JPG these bytes must be decoded into pixel bytes. It seems from my tests that setting the UriSource of the BitmapImage doesn't actually cause any processing to take place since it takes 0 ms and that instead there is some lazy loading going on.
So the questions are: Which object is dominator of the the uncompressed unscaled pixel data? What object is the dominator for the scaled pixel data that gets drawn on screen? Are there tools that can easily show me this? In the Java world I use the Eclipse memory analyzer tool. I tried using PerfView but the results make no sense to me, it seems the tool was meant for analyzing performance.
UPDATE:
At the BUILD conference the team discussed the Windows Performance Toolkit. I never heard anyone mention PerfView so I believe that WPT is the latest and greatest tool for analyzing memory and performance, here is a link:
http://msdn.microsoft.com/en-us/performance/cc825801.aspx
A short answer is most likely "optimally". Not being a smartass, there are just a lot of different systems out there. One mentioned hardware acceleration, you can also consider number of cores, display memory, disk speed, monitor bit depth and resolution, the list goes on and on.

Qt 4.6.x under MacOS/X: widget update performance mystery

I'm working on a Qt-based MacOS/X audio metering application, which contains audio-metering widgets (potentially a lot of them), each of which is supposed to be updated every 50ms (i.e. at 20Hz).
The program works, but when lots of meters are being updated at once, it uses up lots of CPU time and can bog down (spinny-color-wheel, oh no!).
The strange thing is this: Originally this app would just call update() on the meter widget whenever the meter value changed, and therefore the entire meter-widget would be redrawn every 50ms. However, I thought I'd be clever and compute just the area of the meter that actually needs to be redrawn, and only redraw that portion of the widget (e.g. update(x,y,w,h), where y and h are computed based on the old and new values of the meter). However, when I implemented that, it actually made CPU usage four times higher(!)... even though the app was drawing 50% fewer pixels per second.
Can anyone explain why this optimization actually turns out to be a pessimization? I've posted a trivial example application that demonstrates the effect, here:
http://www.lcscanada.com/jaf/meter_test.zip
When I compile (qmake;make) the above app and run it like this:
$ ./meter.app/Contents/MacOS/meter 72
Meter: Using numMeters=72 (partial updates ENABLED)
... top shows the process using ~50% CPU.
When I disable the clever-partial-updates logic, by running it like this:
$ ./meter.app/Contents/MacOS/meter 72 disable_partial_updates
Meter: Using numMeters=72 (partial updates DISABLED)
... top shows the process using only ~12% CPU. Huh? Shouldn't this case take more CPU, not less?
I tried profiling the app using Shark, but the results didn't mean much to me. FWIW, I'm running Snow Leopard on an 8-core Xeon Mac Pro.
GPU drawing is a lot faster then letting CPU caclulate the part to redraw (at least for OpenGL this takes in account, I got the Book OpenGL superbible, and it states that OpenGL is build to redraw not, to draw delta as this is potentially a lot more work to do). Even if you use Software Rendering, the libraries are higly optimzed to do their job properly and fast. So Just redrawing is state of art.
FWIW top on my Linux box shows ~10-11% without partial updates and 12% using partial updates. I had to request 400 meters to get that CPU usage though.
Perhaps it's just that the overhead of Qt setting up a paint region actually dwarfs your paint time? After all your painting is really simple, it's just two rectangular fills.

Resources