OpenCL-GL Interop memory not in sync - windows

I'm having troubles with OpenCL-GL shared memory.
I have a application that's working in both linux and windows. The CL-GL sharing works in linux, but not in windows.
The windows driver says that it supports sharing, the examples from AMD work so it should work. My code for creating the context in windows is:
cl_context_properties properties[] = {
CL_CONTEXT_PLATFORM, (cl_context_properties)platform_(),
CL_WGL_HDC_KHR, (intptr_t) wglGetCurrentDC(),
CL_GL_CONTEXT_KHR, (intptr_t) wglGetCurrentContext(),
0
};
platform_.getDevices(CL_DEVICE_TYPE_GPU, &devices_);
context_ = cl::Context(devices_, properties, &CL::cl_error_callback, nullptr, &err);
err = clGetGLContextInfoKHR(properties, CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR, sizeof(device_id), &device_id, NULL);
context_device_ = cl::Device(device_id);
queue_ = cl::CommandQueue(context_, context_device_, 0, &err);
My problem is that the CL and GL memory in a shared buffer is not the same. I print them out (by memory mapping) and I notice that they differ. Changing the data in the memory works in both CL and GL, but only changes that memory, not both (that is both buffers seems intact, but not shared).
Also, clGetGLObjectInfo on the cl-buffer returns the correct gl buffer.
Update: I have found that if I create the opencl-context on the cpu it works. This seems weird, as I'm not using integrated graphics, and I don't belive the cpu is handling opengl. I'm using SDL to create the window, could that have something to do with this?
I have now confirmed that the opengl context is running on the gpu, so the problem lies elsewhere.
Update 2: Ok, so this is weird. I tried again today, and suddenly it works. As far as I know I didn't install any new drivers before I shut down the computer yesterday, so I don't know what could have brought this about.
Update 3: Right, I noticed that changing the number of particles caused this to work. When I allocated so many particles that the shared buffer is slightly above one MB it suddenly starts to work.

I solved the problem.
OpenGL buffer object must be created "after" OpenCL context was created.
If "before", we can't share the OpenGL data.
I use RadeonHD5670 ATI Catalyst 12.10
Maybe, ATI driver's problem because Nvidia-Computing-SDK samples don't depend on the order.

Related

OpenCL inter-context buffer aliasing

Suppose I have 2 OpenCL-capable devices on my machine (not including CPUs); and suppose that an evil colleague of mine creates a different context for each of them, which I have to work with.
I know I can't share buffers between contexts - not properly and officially, at least. But suppose that I create two OpenCL buffers, one in each context, and pass to each of them the same region of host memory, with the CL_MEM_USE_HOST_PTR flag. e.g.:
enum { size = 1234 };
//...
context_1 = clCreateContext(NULL, 1, &some_device_id, NULL, NULL, NULL);
context_2 = clCreateContext(NULL, 1, &another_device_id, NULL, NULL, NULL);
void* host_mem = malloc(size);
assert(host_mem != NULL);
buff_1 = clCreateBuffer(context_1, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size, host_mem, NULL);
buff_2 = clCreateBuffer(context_2, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size, host_mem, NULL);
I realize that, officially,
The result of OpenCL commands that operate on multiple buffer objects created with the same host_ptr or overlapping host regions is considered to be undefined.
But what will actually happen if I copy to this buffer from one device, and from this buffer to another device? I'm specifically interested in the case of (relatively-recent) AMD and NVIDIA GPUs.
If your OpenCL implementation's vendor guarantees some kind of specific behaviour that goes beyond the standard, then go with that and make sure to follow any instructions about limitations to the letter.
If it doesn't, then you have to assume what the standard says.
I know I can't share buffers between contexts
It's not the contexts that are the problem. It's platforms. There are essentially two cases:
1) you want to share buffers between devices from the same platform. In that case, simply create a single context with all devices, don't complicate your life, and let the platform handle it.
2) you need to share buffer between devices from different platforms. In that case, you're on your own.
waiting "for the split-context bug report to assigned and handled" isn't going to get you anywhere, because if it's contexts from same platform they'll tell you what i said in 1), and if it's contexts from different platforms they'll tell you it's impossible to support in any sane way.
"what will actually happen" ... depends (on a gajillion things). Some platforms will try to map the memory pointer (if it's properly aligned, for some definition of "properly") to the device address space. Some platforms will just silently copy it to device memory. Some platforms will also update the contents of the host memory after every enqueued command (which could mean a huge slowdown), while others will only update it at some specific "synchronization points".
My personal experience is to avoid CL_MEM_USE_HOST_PTR unless i know i'm working with iGPU or a CPU implementation (and have properly aligned pointers).
If you have AMD and NVIDIA gpus in the same machine, i'm not aware of any official way they can share buffers efficiently, which means you'll have to go through host memory anyway... in which case i'd avoid any games with CL_MEM_USE_HOST_PTR and just rely on clMap/Unmap or clRead/Write.

LoadLibrary() fails with error 8 (ERROR_NOT_ENOUGH_MEMORY)

Later edit: After more investigation, the Windows Updates and the OpenGL DLL were red herrings. The cause of these symptoms was a LoadLibrary() call failing with GetLastError() == ERROR_NOT_ENOUGH_MEMORY. See my answer for how to solve such issues. Below is the original question for historical interest. /edit
A map viewer I wrote in Python/wxPython for Windows with a C++ backend suddenly
stopped working, without any code changes or even recompiling. The very same
executables had been working for weeks before (same Python, same DLLs, ...).
Now, when querying Windows for a pixel format to use with OpenGL (with
ChoosePixelFormat()), I get a MessageBox saying:
LoadLibrary failed with error 8:
Not enough storage is available to process this command
The error message is displayed when executing the following code fragment:
void DevContext::SetPixelFormat() {
PIXELFORMATDESCRIPTOR pfd;
memset(&pfd, 0, sizeof(pfd));
pfd.nSize = sizeof(pfd);
pfd.nVersion = 1;
pfd.dwFlags = PFD_DRAW_TO_WINDOW | PFD_SUPPORT_OPENGL;
pfd.iPixelType = PFD_TYPE_RGBA;
pfd.cColorBits = 32;
int pf = ChoosePixelFormat(m_hdc, &pfd); // <-- ERROR OCCURS IN HERE
if (pf == 0) {
throw std::runtime_error("No suitable pixel format.");
}
if (::SetPixelFormat(m_hdc, pf, &pfd) == FALSE) {
throw std::runtime_error("Cannot set pixel format.");
}
}
It's actually an ATI GL driver DLL showing the message box. The relevant part of the call stack is this:
... More MessageBox stuff
0027e860 770cfcf1 USER32!MessageBoxTimeoutA+0x76
0027e880 770cfd36 USER32!MessageBoxExA+0x1b
*** ERROR: Symbol file not found. Defaulted to export symbols for C:\Windows\SysWOW64\atiglpxx.dll -
0027e89c 58471df1 USER32!MessageBoxA+0x18
0027e9d4 58472065 atiglpxx+0x1df1
0027e9dc 57acaf0b atiglpxx!DrvValidateVersion+0x13
0027ea00 57acb0f3 OPENGL32!wglSwapMultipleBuffers+0xc5e
0027edf0 57acb1a9 OPENGL32!wglSwapMultipleBuffers+0xe46
0027edf8 57acc6a4 OPENGL32!wglSwapMultipleBuffers+0xefc
0027ee0c 57ad5658 OPENGL32!wglGetProcAddress+0x45f
0027ee28 57ad5dd4 OPENGL32!wglGetPixelFormat+0x70
0027eec8 57ad6559 OPENGL32!wglDescribePixelFormat+0xa2
0027ef48 751c5ac7 OPENGL32!wglChoosePixelFormat+0x3e
0027ef60 57c78491 GDI32!ChoosePixelFormat+0x28
0027f0b0 57c7867a OutdoorMapper!DevContext::SetPixelFormat+0x71 [winwrap.cpp # 42]
0027f1a0 57ce3120 OutdoorMapper!OGLContext::OGLContext+0x6a [winwrap.cpp # 61]
0027f224 1e0acdf2 maplib_sip!func_CreateOGLDisplay+0xc0 [maps.sip # 96]
0027f240 1e0fac79 python33!PyCFunction_Call+0x52
... More Python stuff
I did a Windows Update two weeks ago and noticed some glitches (e.g. when
resizing the window), but my program still worked mostly OK. Just now I
rebooted, Windows installed 1 more update, and I don't get past
ChoosePixelFormat() any more. However, the last installed update was
KB2998527, a Russia timezone update?!
Things that I already checked:
Recompiling doesn't make it work.
Rebooting and running without other programs running doesn't work.
Memory consumption of my program is only 67 MB, I'm not out of memory.
Plenty of diskspace free (~50 GB).
The HDC m_hdc is obtained from the display panel's HWND and seems to be valid.
Changing my linker commandline doesn't work.
Should I update my graphics drivers or roll back the updates? Any other ideas?
System data dump: Windows 7 Ultimate SP1 x64, 4GB RAM; HP EliteBook 8470p; Python 3.3, wxPython 3.0.1.dev76673 msw (phoenix); access to C++ data structures via SIP 4.15.4; C++ code compiled with Visual Studio 2010 Express, Debug build with /MDd.
I was running out of virtual address space.
By default, LibTIFF reads TIF images by memory-mapping them (mmap() or CreateFileMapping()). This is fine for pictures of your wife, but it turns out it's a bad idea for gigabytes worth of topographic raster-maps of the Alps.
This was difficult to diagnose, because LibTIFF silently fell back to read() if the memory mapping failed, so there never was an explicit error before. Further, mapped memory is not accounted as working memory by Windows, so the Task-Manager was showing 67MB, when in fact nearly all virtual address space used up.
This blew up now because I added more TIF images to my database recently. LoadLibrary() started failing because it couldn't find any address space to put the new library. GetLastError() returned 8, which is ERROR_NOT_ENOUGH_MEMORY. That this happened within ATI's OpenGL library was just coincidence.
The solution was to pass "m" as flag to TiffOpen() to disable memory mapped IO.
Diagnosing this is easy with the Windows SysInternals tool VMMap (documentation link), which shows you how much of the virtual address space of a process is taken up by code/heap/stack/mapped files/shareable data/etc.
This should be the first thing to check if LoadLibrary() or CreateFileMapping() fails with ERROR_NOT_ENOUGH_MEMORY.

OpenGL core profile incredible slowdown on OS X

I added a new GL renderer to my engine, which uses the core profile. While it runs fine on Windows and/or nvidia cards, it is like 10 times slower on OS X (3 fps instead of 30). The weird thing is, that my compatibility profile renderer runs fine.
I collected some traces with Instruments and the GL profiler:
https://www.dropbox.com/sh/311fg9wu0zrarzm/31CGvUcf2q
It shows that the application spends its time in glDrawRangeElements.
I tried the following things:
use glDrawElements instead (no effect)
flip culling (no effect on speed)
disable some GL_DYNAMIC_DRAW buffers (no effect)
bind index buffer after VAO when drawing (no effect)
converted indices to 4 byte (no effect)
use GL_BGRA textures (no effect)
What I didn't try is to align my vertices to 16 byte boundary and/or convert indices to 4 byte, but seriously, if that would be the issue then why the hell does the standard allow it?
I'm creating the context like this:
NSOpenGLPixelFormatAttribute attributes[] =
{
NSOpenGLPFAColorSize, 24,
NSOpenGLPFAAlphaSize, 8,
NSOpenGLPFADepthSize, 24,
NSOpenGLPFAStencilSize, 8,
NSOpenGLPFADoubleBuffer,
NSOpenGLPFAAccelerated,
NSOpenGLPFANoRecovery,
NSOpenGLPFAOpenGLProfile, NSOpenGLProfileVersion3_2Core,
0
};
NSOpenGLPixelFormat* format = [[NSOpenGLPixelFormat alloc] initWithAttributes:attributes];
NSOpenGLContext* context = [[NSOpenGLContext alloc] initWithFormat:format shareContext:nil];
[self.view setOpenGLContext:context];
[context makeCurrentContext];
Tried on the following specs:
radeon 6630M, OS X 10.7.5
radeon 6750M, OS X 10.7.5
geforce GT 330M, OS X 10.8.3
Do you have any ideas what I might do wrong? Again, it works fine with the compatibility profile (not using VAOs though).
UPDATE: reported to Apple.
UPDATE: Apple doesn't give a damn to the problem...anyway I created a small test program which is actually good. Now I compared the call stack with Instruments, and found out that when using the engine, glDrawRangeElements does two calls:
gleDrawArraysOrElements_ExecCore
gleDrawArraysOrElements_Entries_Body
while in the test program it calls only the second. Now the first call does something like an immediate mode render (gleFlushPrimitivesTCLFunc, gleRunVertexSubmitterImmediate), so obviously casues the slowdown.
Finally, I was able to reproduce the slowdown. This is just crazy... It is clearly caused by glBindAttribLocation being called on the "my_Position" attribute. Now I did some testing:
1 is default (as returned by glGetAttribLocation)
if I set it to zero, theres no problem
if I set it to 1, the rendering becomes slow
if I set it to any larger number, it is slow again
Obviously I relink the program (check code). It is not a problem in the implementation, I tested it with "normal" values too.
Test program:
https://www.dropbox.com/s/dgg48g1fwgyc5h0/SLOWDOWN_REPRO.zip
How to repro:
open with XCode
open common/glext.h (don't be disturbed by the name)
modify the GLDECLUSAGE_POSITION constant from 0 to 1
compile and run => slow
changing back to zero => good
I have managed to get myself the same problem in the following circumstance under
OS X Mavericks:
Instanced rendering using array buffers to give each instance its own modelToWorld and inverseNormal matrices; attribute locations are being specified through layout rather than using glGetAttribLocation
leaving one of these array buffers unused in the shader, where location is declared but the attribute isn't actually used for anything in the glsl code
In this case, a call to glDrawElementsInstanced takes up a LOT of CPU time (under normal circumstances, this call uses nearly zero CPU even when drawing several thousand instances).
You can tell that you're getting this specific problem if almost all of the CPU time used within glDrawElementsInstanced is spent in gleDrawArraysOrElements_ExecCore. Making sure that all of the array buffers are actually referenced in your shader code fixes the CPU time back to (nearly) zero.
I suspect that this is one of the situations where leaving a variable out of your main() in glsl confuses the compiler in to deleting all reference to that variable, leaving you with a dangling reference to an attribute or uniform.

Directx Texture interface to existing memory

I'm writing a rendering app that communicates with an image processor as a sort of virtual camera, and I'm trying to figure out the fastest way to write the texture data from one process to the awaiting image buffer in the other.
Theoretically I think it should be possible with 1 DirectX copy from VRAM directly to the area of memory I want it in, but I can't figure out how to specify a region of memory for a texture to occupy, and thus must perform an additional memcpy. DX9 or DX11 solutions would be welcome.
So far, the docs here: http://msdn.microsoft.com/en-us/library/windows/desktop/bb174363(v=vs.85).aspx have held the most promise.
"In Windows Vista CreateTexture can create a texture from a system memory pointer allowing the application more flexibility over the use, allocation and deletion of the system memory"
I'm running on Windows 7 with the June 2010 Directx SDK, However, whenever I try and use the function in the way it specifies, I the function fails with an invalid arguments error code. Here is the call I tried as a test:
static char s_TextureBuffer[640*480*4]; //larger than needed
void* p = (void*)s_TextureBuffer;
HRESULT res = g_D3D9Device->CreateTexture(640,480,1,0, D3DFORMAT::D3DFMT_L8, D3DPOOL::D3DPOOL_SYSTEMMEM, &g_ReadTexture, (void**)p);
I tried with several different texture formats, but with no luck. I've begun looking into DX11 solutions, it's going slowly since I'm used to DX9. Thanks!

API to get the graphics or video memory

I want to get the adpater RAM or graphics RAM which you can see in Display settings or Device manager using API. I am in C++ application.
I have tried seraching on net and as per my RnD I have come to conclusion that we can get the graphics memory info from
1. DirectX SDK structure called DXGI_ADAPTER_DESC. But what if I dont want to use DirectX API.
2. Win32_videocontroller : But this class does not always give you adapterRAM info if availability of video controller is offline. I have checked it on vista.
Is there any other way to get the graphics RAM?
There is NO way to directly get graphics RAM on windows, windows prevents you doing this as it maintains control over what is displayed.
You CAN, however, create a DirectX device. Get the back buffer surface and then lock it. After locking you can fill it with whatever you want and then unlock and call present. This is slow, though, as you have to copy the video memory back across the bus into main memory. Some cards also use "swizzled" formats that it has to un-swizzle as it copies. This adds further time to doing it and some cards will even ban you from doing it.
In general you want to avoid directly accessing the video card and letting windows/DirectX do the drawing for you. Under D3D1x Im' pretty sure you can do it via an IDXGIOutput though. It really is something to try and avoid though ...
You can write to a linear array via standard win32 (This example assumes C) but its quite involved.
First you need the linear array.
unsigned int* pBits = malloc( width * height );
Then you need to create a bitmap and select it to the DC.
HBITMAP hBitmap = ::CreateBitmap( width, height, 1, 32, NULL );
SelectObject( hDC, (HGDIOBJ)hBitmap );
You can then fill the pBits array as you please. When you've finished you can then set the bitmap's bits.
::SetBitmapBits( hBitmap, width * height * 4, (void*)pBits )
When you've finished using your bitmap don't forget to delete it (Using DeleteObject) AND free your linear array!
Edit: There is only one way to reliably get the video ram and that is to go through the DX Diag interfaces. Have a look at IDxDiagProvider and IDxDiagContainer in the DX SDK.
Win32_videocontroller is your best course to get the amount of gfx memory. That's how its done in Doom3 source.
You say "..availability of video controller is offline. I have checked it on vista." Under what circumstances would the video controller be offline?
Incidentally, you can find the Doom3 source here. The function you're looking for is called Sys_GetVideoRam and it's in a file called win_shared.cpp, although if you do a solution wide search it'll turn it up for you.
User mode threads cannot access memory regions and I/O mapped from hardware devices, including the framebuffer. Anyway, what you would want to do that? Suppose the case you can access the framebuffer directly: now you must handle a LOT of possible pixel formats in the framebuffer. You can assume a 32-bit RGBA or ARGB organization. There is the possibility of 15/16/24-bit displays (RGBA555, RGBA5551, RGBA4444, RGBA565, RGBA888...). That's if you don't want to also support the video-surface formats (overlays) such as YUV-based.
So let the display driver and/or the subjacent APIs to do that effort.
If you want to write to a display surface (which not equals exactly to framebuffer memory, altough it's conceptually almost the same) there are a lot of options. DX, Win32, or you may try the SDL library (libsdl).

Resources