OpenGL core profile incredible slowdown on OS X - performance

I added a new GL renderer to my engine, which uses the core profile. While it runs fine on Windows and/or nvidia cards, it is like 10 times slower on OS X (3 fps instead of 30). The weird thing is, that my compatibility profile renderer runs fine.
I collected some traces with Instruments and the GL profiler:
It shows that the application spends its time in glDrawRangeElements.
I tried the following things:
use glDrawElements instead (no effect)
flip culling (no effect on speed)
disable some GL_DYNAMIC_DRAW buffers (no effect)
bind index buffer after VAO when drawing (no effect)
converted indices to 4 byte (no effect)
use GL_BGRA textures (no effect)
What I didn't try is to align my vertices to 16 byte boundary and/or convert indices to 4 byte, but seriously, if that would be the issue then why the hell does the standard allow it?
I'm creating the context like this:
NSOpenGLPixelFormatAttribute attributes[] =
NSOpenGLPFAColorSize, 24,
NSOpenGLPFAAlphaSize, 8,
NSOpenGLPFADepthSize, 24,
NSOpenGLPFAStencilSize, 8,
NSOpenGLPFAOpenGLProfile, NSOpenGLProfileVersion3_2Core,
NSOpenGLPixelFormat* format = [[NSOpenGLPixelFormat alloc] initWithAttributes:attributes];
NSOpenGLContext* context = [[NSOpenGLContext alloc] initWithFormat:format shareContext:nil];
[self.view setOpenGLContext:context];
[context makeCurrentContext];
Tried on the following specs:
radeon 6630M, OS X 10.7.5
radeon 6750M, OS X 10.7.5
geforce GT 330M, OS X 10.8.3
Do you have any ideas what I might do wrong? Again, it works fine with the compatibility profile (not using VAOs though).
UPDATE: reported to Apple.
UPDATE: Apple doesn't give a damn to the problem...anyway I created a small test program which is actually good. Now I compared the call stack with Instruments, and found out that when using the engine, glDrawRangeElements does two calls:
while in the test program it calls only the second. Now the first call does something like an immediate mode render (gleFlushPrimitivesTCLFunc, gleRunVertexSubmitterImmediate), so obviously casues the slowdown.

Finally, I was able to reproduce the slowdown. This is just crazy... It is clearly caused by glBindAttribLocation being called on the "my_Position" attribute. Now I did some testing:
1 is default (as returned by glGetAttribLocation)
if I set it to zero, theres no problem
if I set it to 1, the rendering becomes slow
if I set it to any larger number, it is slow again
Obviously I relink the program (check code). It is not a problem in the implementation, I tested it with "normal" values too.
Test program:
How to repro:
open with XCode
open common/glext.h (don't be disturbed by the name)
modify the GLDECLUSAGE_POSITION constant from 0 to 1
compile and run => slow
changing back to zero => good

I have managed to get myself the same problem in the following circumstance under
OS X Mavericks:
Instanced rendering using array buffers to give each instance its own modelToWorld and inverseNormal matrices; attribute locations are being specified through layout rather than using glGetAttribLocation
leaving one of these array buffers unused in the shader, where location is declared but the attribute isn't actually used for anything in the glsl code
In this case, a call to glDrawElementsInstanced takes up a LOT of CPU time (under normal circumstances, this call uses nearly zero CPU even when drawing several thousand instances).
You can tell that you're getting this specific problem if almost all of the CPU time used within glDrawElementsInstanced is spent in gleDrawArraysOrElements_ExecCore. Making sure that all of the array buffers are actually referenced in your shader code fixes the CPU time back to (nearly) zero.
I suspect that this is one of the situations where leaving a variable out of your main() in glsl confuses the compiler in to deleting all reference to that variable, leaving you with a dangling reference to an attribute or uniform.


Adjust value set in IDXGISwapChain2::SetMaximumFrameLatency

I use the combination of DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT, GetFrameLatencyWaitableObject() and SetMaximumFrameLatency(UINT MaxLatency) to control the input lag vs. smoothness of my application as explained at A value of 1 gives the lowest input lag, but sometimes I need a higher value to reduce jitter/stutter/slowdown caused by cpu and gpu cannot really work in parallel when the value is 1.
I want to be able to dynamically change this value based on the required input lag vs smoothness trade-off.
The problem I have noticed is that while it's possible to, between frames, increase this value by calling SetMaximumFrameLatency with a higher value than set before, I see no effect when decreasing this value by calling the function again with a lower value than the maximum value ever set for this swap chain by a previous call to the same function. So if I ever set it to 2, it is not possible to set it to 1 later. Is this a bug or undocumented "feature"? Or did I do something wrong?
The API itself does not return any error or similar; from the API point of view it appears to apply the new lower value correctly.
To test this, I have BufferCount = 16 and then adjust the max latency value from 1 to 16 which makes the current latency obvious to the eye. It's therefore apparent that dxgi does not apply new lower values.
I've tried to call functions in different orders, close the handle for the waitable object and recreate a new one when modifying the latency, but nothing works. The only workaround so far I'm aware of is to fully recreate the swap chain, which is annoying due to the requirement to unbind all context objects etc.
When initializing the game, I create the swap chain and set an initial latency using SetMaximumFrameLatency.
The game loop is then basically this:
Call WaitForSingleObject on the waitable object handle.
Process inputs.
Render and present a frame.
If it's decided that the latency should change at this point, call SetMaximumFrameLatency with the new value.
Other info:
Renderer: Direct3D 11
OS: Windows 11 21H2 version 22000.675
Graphics card: Intel UHD Graphics 620 / Nvidia GeForce MX150 (tried with both cards) with latest drivers, supporting WDDM 3.0
App type: Win32 desktop application

Zero Opengl 3.2 pixelformat matches found?

Today I finally found out what has been stalling my development process: Even though no errorcode is set, the function wglChoosePixelFormatARB returns 0 pixelformats.
I am trying to set up an OpenGL context in my C++ application and I have managed to retrieve the function pointers for the extensions.
glGetIntegerv(GL_MAJOR_VERSION, &maj)
returns 4 so, naturally, I assumed it would be possible to create an OpenGL 3.2 context. However, after finding out there were no matches, I started to comment out some of my requirements to go in the attribList parameter. There were no matches whatsoever.
Only when I, just to be certain, commented out
I finally got matches. Out of the 8 matching pixel formats that the other requirements meet, not ONE of them seems to support version 3 of OGL.
Has anyone ever run into this? I have tried updating/reinstalling my video drivers, but nothing has changed. I am running this on Windows 7, MS Visual Studio 2008, and my graphics card is one from the AMD Radeon HD 7700 Series.
The WGL_CONTEXT_MAJOR_VERSION_ARB, WGL_CONTEXT_MINOR_VERSION_ARB and related attributes are not attributes of the Windows Pixelformat.
You must not use them with wglChoosePixelFormatARB().
Those options belong into the attribute list of wglCreateContextAttribsARB as defined by the WGL_ARB_create_context extension.

Directx Texture interface to existing memory

I'm writing a rendering app that communicates with an image processor as a sort of virtual camera, and I'm trying to figure out the fastest way to write the texture data from one process to the awaiting image buffer in the other.
Theoretically I think it should be possible with 1 DirectX copy from VRAM directly to the area of memory I want it in, but I can't figure out how to specify a region of memory for a texture to occupy, and thus must perform an additional memcpy. DX9 or DX11 solutions would be welcome.
So far, the docs here: have held the most promise.
"In Windows Vista CreateTexture can create a texture from a system memory pointer allowing the application more flexibility over the use, allocation and deletion of the system memory"
I'm running on Windows 7 with the June 2010 Directx SDK, However, whenever I try and use the function in the way it specifies, I the function fails with an invalid arguments error code. Here is the call I tried as a test:
static char s_TextureBuffer[640*480*4]; //larger than needed
void* p = (void*)s_TextureBuffer;
HRESULT res = g_D3D9Device->CreateTexture(640,480,1,0, D3DFORMAT::D3DFMT_L8, D3DPOOL::D3DPOOL_SYSTEMMEM, &g_ReadTexture, (void**)p);
I tried with several different texture formats, but with no luck. I've begun looking into DX11 solutions, it's going slowly since I'm used to DX9. Thanks!

OpenCL-GL Interop memory not in sync

I'm having troubles with OpenCL-GL shared memory.
I have a application that's working in both linux and windows. The CL-GL sharing works in linux, but not in windows.
The windows driver says that it supports sharing, the examples from AMD work so it should work. My code for creating the context in windows is:
cl_context_properties properties[] = {
CL_CONTEXT_PLATFORM, (cl_context_properties)platform_(),
CL_WGL_HDC_KHR, (intptr_t) wglGetCurrentDC(),
CL_GL_CONTEXT_KHR, (intptr_t) wglGetCurrentContext(),
platform_.getDevices(CL_DEVICE_TYPE_GPU, &devices_);
context_ = cl::Context(devices_, properties, &CL::cl_error_callback, nullptr, &err);
err = clGetGLContextInfoKHR(properties, CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR, sizeof(device_id), &device_id, NULL);
context_device_ = cl::Device(device_id);
queue_ = cl::CommandQueue(context_, context_device_, 0, &err);
My problem is that the CL and GL memory in a shared buffer is not the same. I print them out (by memory mapping) and I notice that they differ. Changing the data in the memory works in both CL and GL, but only changes that memory, not both (that is both buffers seems intact, but not shared).
Also, clGetGLObjectInfo on the cl-buffer returns the correct gl buffer.
Update: I have found that if I create the opencl-context on the cpu it works. This seems weird, as I'm not using integrated graphics, and I don't belive the cpu is handling opengl. I'm using SDL to create the window, could that have something to do with this?
I have now confirmed that the opengl context is running on the gpu, so the problem lies elsewhere.
Update 2: Ok, so this is weird. I tried again today, and suddenly it works. As far as I know I didn't install any new drivers before I shut down the computer yesterday, so I don't know what could have brought this about.
Update 3: Right, I noticed that changing the number of particles caused this to work. When I allocated so many particles that the shared buffer is slightly above one MB it suddenly starts to work.
I solved the problem.
OpenGL buffer object must be created "after" OpenCL context was created.
If "before", we can't share the OpenGL data.
I use RadeonHD5670 ATI Catalyst 12.10
Maybe, ATI driver's problem because Nvidia-Computing-SDK samples don't depend on the order.

Pixel modifying code runs quick in main app, really slow in Delphi 6 DirectShow filter with other problems

I have a Delphi 6 application that sends bitmaps to a DirectShow DLL in real-time, 25 frames a second. The DirectShow DLL is my code too and is also written in Delphi 6 using the DSPACK DirectShow component suite. I have a simple block of code that goes through each pixel in the bitmap modifying the brightness and contrast of the image, if a certain flag is set, otherwise the bitmap is pushed out the DirectShow DLL unmodified (push source video filter). The code used to be in the main application and then I just moved it into the DirectShow DLL. When it was in the main application it ran fine. I could see the changes in the bitmap as expected. However, now that the code resides in the DirectShow DLL it has the following problems:
When the code block below is active the DirectShow DLL is really slow. I have a quad core i5 and it's really slow. I can also see a big spike in the CPU consumption. In contrast, the very same code running in the main application ran fine on an old single core P4. It did hit the CPU noticeably on that old machine but the video was smooth and there were no problems. The images are only 352 x 288 pixels in size.
I don't see the expected changes to the visible bitmap. I can trace the code in the DirectShow DLL and see the numerical values of each pixel properly altered by the code, but the viewable image in the Graph Edit ActiveMovie window looks completely unchanged.
If I deactivate the code, which I can do in real-time, the ActiveMovie window shows video that is as smooth as glass, perfectly rendered with the CPU barely touched. If I reactivate the code the video is now really choppy, probably showing only 1 to 2 frames a second with a long delay before the first frame is shown, and the CPU spikes. Not completely, but a lot more than I would expect.
I tried compiling the DirectShow DLL with everything on including range checking, overflow checking, etc. and there were no warnings or errors during run-time. I then tried compiling for fastest speed and it still had the exact same problems listed above. Something is really wrong and I can't figure out what. Note, I do indeed lock the canvas before modifying the bitmap and unlock it after I'm done. If it weren't for the "everything on" compilation run I noted above I'd say it felt like an FPU Exception was being raised and silently swallowed with every pixel computation, but as I said, no errors or Exceptions are occurring.
UPDATE: I am putting this here so that the solution, which is embedded in one of Roman R's comment, is plainly visible. The problem that I was not setting the PixelFormat property to pf24Bit before accessing the ScanLine property. As Roman suggested, not doing this must make the TBitmap code create a temporary copy of the bitmap. As soon as I added the line of code below the problems went away, both that of changes not being visible and the soft page faults. It's an insidious problem because the only object that is affected is the pointer you use to access the ScanLine property, since (assumption) it contains a pointer to a temporary copy of the bitmap. That's must be why the subsequent TextOut() call still worked since it worked on the original copy of the bitmap.
clip.PixelFormat := pf24bit; // The missing code line that fixes the problem.
Here's the code block I've been referring to:
function IntToByte(i: Integer): Byte;
if i > 255 then
Result := 255
else if i < 0 then
Result := 0
Result := i;
// ---------------------------------------------------------------
procedure brightnessTurboBoost(var clip: TBitmap; rangeExpansionPowerOf2: integer; shiftValue: Byte);
p0: PByte;
x,y: Integer;
if (rangeExpansionPowerOf2 = 0) and (shiftValue = 0) then
exit; // These parameter settings will not change the pixel values.
for y := 0 to clip.Height-1 do
p0 := clip.scanline[y];
// Can't just do the whole buffer as a big block of bytes since the
// individual scan lines may be padded for CPU alignment.
for x := 0 to (clip.Width - 1) * 3 do
if rangeExpansionPowerOf2 >= 1 then
p0^ := IntToByte((p0^ shl rangeExpansionPowerOf2) + shiftValue)
p0^ := IntToByte(p0^ + shiftValue);
There are a few things to say about this code snippet.
First of all, you are using Scanline property of TBitmap class. I have not been dealign with Delphi for many years, so I might be wrong about this but I am under impression that Scanline is not actually a thin accessor, is it? It might be internally hiding things which can dramatically affect performance, such as "if he wants to access the bits of the image, then we have to first convert it to DIB before returning pointers". So a thing looking so simple might appear to be a killer.
"if rangeExpansionPowerOf2 >= 1 then" in the inner loop body? You don't really want to compare this all the way. Either make two separate functions or duplicate the whole loop without in two version for zero and non-zero rangeExpansionPowerOf2 and do this if only once.
"for ... to (clip.Width - 1) * 3 do" I am not really sure that Delphi optimizes the upper boundary evaluation to make it only once. You might be doing those multiplication thrice for every pixel, while you could do it only once the whole image.
For top perofrmance IntToByte is definitely implemented in MMX to avoid ifs and process multiple bytes at once.
Still as you say that images are only 352x288, I would suspect that #1 is ruining the performance.
