OpenCL inter-context buffer aliasing - memory-management

Suppose I have 2 OpenCL-capable devices on my machine (not including CPUs); and suppose that an evil colleague of mine creates a different context for each of them, which I have to work with.
I know I can't share buffers between contexts - not properly and officially, at least. But suppose that I create two OpenCL buffers, one in each context, and pass to each of them the same region of host memory, with the CL_MEM_USE_HOST_PTR flag. e.g.:
enum { size = 1234 };
//...
context_1 = clCreateContext(NULL, 1, &some_device_id, NULL, NULL, NULL);
context_2 = clCreateContext(NULL, 1, &another_device_id, NULL, NULL, NULL);
void* host_mem = malloc(size);
assert(host_mem != NULL);
buff_1 = clCreateBuffer(context_1, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size, host_mem, NULL);
buff_2 = clCreateBuffer(context_2, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size, host_mem, NULL);
I realize that, officially,
The result of OpenCL commands that operate on multiple buffer objects created with the same host_ptr or overlapping host regions is considered to be undefined.
But what will actually happen if I copy to this buffer from one device, and from this buffer to another device? I'm specifically interested in the case of (relatively-recent) AMD and NVIDIA GPUs.

If your OpenCL implementation's vendor guarantees some kind of specific behaviour that goes beyond the standard, then go with that and make sure to follow any instructions about limitations to the letter.
If it doesn't, then you have to assume what the standard says.

I know I can't share buffers between contexts
It's not the contexts that are the problem. It's platforms. There are essentially two cases:
1) you want to share buffers between devices from the same platform. In that case, simply create a single context with all devices, don't complicate your life, and let the platform handle it.
2) you need to share buffer between devices from different platforms. In that case, you're on your own.
waiting "for the split-context bug report to assigned and handled" isn't going to get you anywhere, because if it's contexts from same platform they'll tell you what i said in 1), and if it's contexts from different platforms they'll tell you it's impossible to support in any sane way.
"what will actually happen" ... depends (on a gajillion things). Some platforms will try to map the memory pointer (if it's properly aligned, for some definition of "properly") to the device address space. Some platforms will just silently copy it to device memory. Some platforms will also update the contents of the host memory after every enqueued command (which could mean a huge slowdown), while others will only update it at some specific "synchronization points".
My personal experience is to avoid CL_MEM_USE_HOST_PTR unless i know i'm working with iGPU or a CPU implementation (and have properly aligned pointers).
If you have AMD and NVIDIA gpus in the same machine, i'm not aware of any official way they can share buffers efficiently, which means you'll have to go through host memory anyway... in which case i'd avoid any games with CL_MEM_USE_HOST_PTR and just rely on clMap/Unmap or clRead/Write.

Related

Catching and avoiding memory corruption at fixed offset in physical memory

We have a 4-byte memory corruption that always occurs at a fixed offset in the physical memory.
The physical frame number is 0x00a4d and the offset is ending with dc0.
Question 1) Based on this information, can we say the physical address of corruption is 0x00a4d * PAGE_SIZE (4096) + dc0 = 0x00A4DDC0. Programmatically, what is best way to confirm the physical address? Ours is ppc64 based system.
Question 2) What would be the best way to find out this memory corruption? The more I read the more I get lost with the plethora of options. Should I use KASAN, or CONFIG_DEBUG_PAGEALLOC (debug_guardpage_minorder) option or a HW breakpoint?
Question 3) Since we know the corruption is at a fixed option, if we were to reserve/block that page, what again is the best option? The two I came across are memmap and Reserved memory regions
Thanks
1.) You are right about physical address.
2.) HW breakpoint is the best if you have such possibility. Do you have the appropriate device (t32 or whatever) / debug port/ could it place HW break at physical address?
Here is the more generic and dumb case which needs no HW support:
If I remember right from your previous post, you suspect the kernel code as a corruption causer.
If you have read anything about KASAN, you probably mentioned that gcc part places hooks instead of kernel code loads and stores. The kernel part provides kasan_store_bla_bla_bla hook, which handles correctness of this store. Very likely, that default functionality wouldn't help you, but you can integrate your code in this kasan store hook, which would:
2.1)Take the virtual address passed to the store kasan hook
2.2)Finds appropriate physical address by page tables walking like this (the more convenient API exists but i don't remember the function name):
pgd_t *pgd = pgd_offset(mm, addr);
pud_t *pud = pud_offset(pgd, addr);
pmd_t *pmd = pmd_offset(pud, addr);
...
As i remember from your previous post you get crash in userspace app, so you will be need to check all processes mms from task list.
2.3) Compare found physical address to the given, and check that written value is zero (as i remember from your previous post)
2.4) If match print backtraces for all cores and stop execution.

OpenCL-GL Interop memory not in sync

I'm having troubles with OpenCL-GL shared memory.
I have a application that's working in both linux and windows. The CL-GL sharing works in linux, but not in windows.
The windows driver says that it supports sharing, the examples from AMD work so it should work. My code for creating the context in windows is:
cl_context_properties properties[] = {
CL_CONTEXT_PLATFORM, (cl_context_properties)platform_(),
CL_WGL_HDC_KHR, (intptr_t) wglGetCurrentDC(),
CL_GL_CONTEXT_KHR, (intptr_t) wglGetCurrentContext(),
0
};
platform_.getDevices(CL_DEVICE_TYPE_GPU, &devices_);
context_ = cl::Context(devices_, properties, &CL::cl_error_callback, nullptr, &err);
err = clGetGLContextInfoKHR(properties, CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR, sizeof(device_id), &device_id, NULL);
context_device_ = cl::Device(device_id);
queue_ = cl::CommandQueue(context_, context_device_, 0, &err);
My problem is that the CL and GL memory in a shared buffer is not the same. I print them out (by memory mapping) and I notice that they differ. Changing the data in the memory works in both CL and GL, but only changes that memory, not both (that is both buffers seems intact, but not shared).
Also, clGetGLObjectInfo on the cl-buffer returns the correct gl buffer.
Update: I have found that if I create the opencl-context on the cpu it works. This seems weird, as I'm not using integrated graphics, and I don't belive the cpu is handling opengl. I'm using SDL to create the window, could that have something to do with this?
I have now confirmed that the opengl context is running on the gpu, so the problem lies elsewhere.
Update 2: Ok, so this is weird. I tried again today, and suddenly it works. As far as I know I didn't install any new drivers before I shut down the computer yesterday, so I don't know what could have brought this about.
Update 3: Right, I noticed that changing the number of particles caused this to work. When I allocated so many particles that the shared buffer is slightly above one MB it suddenly starts to work.
I solved the problem.
OpenGL buffer object must be created "after" OpenCL context was created.
If "before", we can't share the OpenGL data.
I use RadeonHD5670 ATI Catalyst 12.10
Maybe, ATI driver's problem because Nvidia-Computing-SDK samples don't depend on the order.

Is it possible to save some data parmanently in AVR Microcontroller?

Well, the question says it all.
What I would like to do is that, every time I power up the micro-controller, it should take some data from the saved data and use it. It should not use any external flash chip.
If possible, please give some code-snippet so that I can use them in AVR studio 4. for example if I save 8 uint16_t data it should load those data into an array of uint16_t.
You have to burn the data to the program memory of the chip if you don't need to update them programmatically, or if you want read-write support, you should use the built-in EPROM.
Pgmem example:
#include <avr/pgmspace.h>
PROGMEM uint16_t data[] = { 0, 1, 2, 3 };
int main()
{
uint16_t x = pgm_read_word_near(data + 1); // access 2nd element
}
You need to get the datasheet for the part you are using. Microcontrollers like these typically contain at least a flash and sometimes multiple banks of flash to allow for different bootloaders while making it easy to erase one whole flash without affecting another. Likewise some have eeprom. This is all internal, not external. Esp since you say you need to save programatically this should work (remember how easy it is to wear out a flash do dont save unless you need to). Either eeprom or flash will meet the requirement of having that information there when you power up, non-volatile. As well as being able to save it programmatically. Googling will find a number of examples on how to do this, in addition to the datasheet you apparently have not read, as well as the app notes that also contain this information (that you should have read). If you are looking for some sort of one time programmable fuse blowing thing, there may be OTP versions of the avr, and you will have to read the datasheets, programmers references and app notes on how to program that memory, and should tell you if OTP parts can be written programmatically or if they are treated differently.
The reading of the data is in the memory map in the datasheet, write code to read those adresses. Writing is described in the datasheet (programmers reference manual, users guide, whatever atmel calls it) as well and there are many examples on the net.

Looking for an explanation of kernel driver I/O interface capability

I am looking at ways of interfacing to specific hardware I/O addresses from various Windows versions from 32-bit XP up 64-bit Win7 and beyond. There seem to be various solutions published with varying degrees of capability under different Windows versions and I am trying to understand the possibilities for creating my own kernel driver. The most basic kernal I/O R/W capability seems to be the direct I/O operations such as READ_PORT_UCHAR and WRITE_PORT_UCHAR (and their word and long derivatives). I have also seen the technique below which I dont understand, appearing to be some memory mapping capability of which I have no experience and can find little readable documentation. Could someone comment on the suitability / compatibility of READ_PORT_UCHAR / WRITE_PORT_UCHAR versus this mapping technique that I reproduce below please?
Thanks in advance.
case IOCTL_PHYMEM_MAP:
if (dwInBufLen==sizeof(PHYMEM_MEM) && dwOutBufLen==sizeof(PVOID))
{
PHYSICAL_ADDRESS phyAddr;
PVOID pvk, pvu;
phyAddr.QuadPart=(ULONGLONG)pMem->pvAddr;
//get mapped kernel address
pvk=MmMapIoSpace(phyAddr, pMem->dwSize, MmNonCached);
if (pvk)
{
//allocate mdl for the mapped kernel address
PMDL pMdl=IoAllocateMdl(pvk, pMem->dwSize, FALSE, FALSE, NULL);
if (pMdl)
{
PMAPINFO pMapInfo;
//build mdl and map to user space
MmBuildMdlForNonPagedPool(pMdl);
pvu=MmMapLockedPages(pMdl, UserMode);
//insert mapped infomation to list
pMapInfo=(PMAPINFO)ExAllocatePool(\
NonPagedPool, sizeof(MAPINFO));
pMapInfo->pMdl=pMdl;
pMapInfo->pvk=pvk;
pMapInfo->pvu=pvu;
pMapInfo->memSize=pMem->dwSize;
PushEntryList(&lstMapInfo, &pMapInfo->link);
DebugPrint("Map physical 0x%x to virtual 0x%x, size %u", \
pMem->pvAddr, pvu, pMem->dwSize);
RtlCopyMemory(pSysBuf, &pvu, sizeof(PVOID));
irp->IoStatus.Information=sizeof(PVOID);
}
else
{
//allocate mdl error, unmap the mapped physical memory
MmUnmapIoSpace(pvk, pMem->dwSize);
irp->IoStatus.Status=STATUS_INSUFFICIENT_RESOURCES;
}
}
else
irp->IoStatus.Status=STATUS_INSUFFICIENT_RESOURCES;
}
else
irp->IoStatus.Status=STATUS_INVALID_PARAMETER;
break;
What are these I/O ports that you're trying to access? It's generally a Really Bad Idea to go partying on ports that you don't own because you have no way of synchronizing access to those ports with the driver that owns them, the O/S, or the BIOS (it's possible to take an SMI and have the BIOS start talking to ports that it thinks it owns).
The code snippet provided is also a horribly bad idea and should be burned. Basically, all it's doing is mapping a kernel virtual address to a device register (MmMapIoSpace) and then doing the work to then map that device register into user mode (MmMapLockedPages). There are two obvious problems with it:
1) You don't know the caching attributes of the memory, so randomly specifying MmNonCached can hang the system
2) Same as with I/O ports, you can't just arbitrarily access a device's registers. You can't properly synchronize yourself with the driver that owns them, so you're doomed to eventually borking your system.
-scott

API to get the graphics or video memory

I want to get the adpater RAM or graphics RAM which you can see in Display settings or Device manager using API. I am in C++ application.
I have tried seraching on net and as per my RnD I have come to conclusion that we can get the graphics memory info from
1. DirectX SDK structure called DXGI_ADAPTER_DESC. But what if I dont want to use DirectX API.
2. Win32_videocontroller : But this class does not always give you adapterRAM info if availability of video controller is offline. I have checked it on vista.
Is there any other way to get the graphics RAM?
There is NO way to directly get graphics RAM on windows, windows prevents you doing this as it maintains control over what is displayed.
You CAN, however, create a DirectX device. Get the back buffer surface and then lock it. After locking you can fill it with whatever you want and then unlock and call present. This is slow, though, as you have to copy the video memory back across the bus into main memory. Some cards also use "swizzled" formats that it has to un-swizzle as it copies. This adds further time to doing it and some cards will even ban you from doing it.
In general you want to avoid directly accessing the video card and letting windows/DirectX do the drawing for you. Under D3D1x Im' pretty sure you can do it via an IDXGIOutput though. It really is something to try and avoid though ...
You can write to a linear array via standard win32 (This example assumes C) but its quite involved.
First you need the linear array.
unsigned int* pBits = malloc( width * height );
Then you need to create a bitmap and select it to the DC.
HBITMAP hBitmap = ::CreateBitmap( width, height, 1, 32, NULL );
SelectObject( hDC, (HGDIOBJ)hBitmap );
You can then fill the pBits array as you please. When you've finished you can then set the bitmap's bits.
::SetBitmapBits( hBitmap, width * height * 4, (void*)pBits )
When you've finished using your bitmap don't forget to delete it (Using DeleteObject) AND free your linear array!
Edit: There is only one way to reliably get the video ram and that is to go through the DX Diag interfaces. Have a look at IDxDiagProvider and IDxDiagContainer in the DX SDK.
Win32_videocontroller is your best course to get the amount of gfx memory. That's how its done in Doom3 source.
You say "..availability of video controller is offline. I have checked it on vista." Under what circumstances would the video controller be offline?
Incidentally, you can find the Doom3 source here. The function you're looking for is called Sys_GetVideoRam and it's in a file called win_shared.cpp, although if you do a solution wide search it'll turn it up for you.
User mode threads cannot access memory regions and I/O mapped from hardware devices, including the framebuffer. Anyway, what you would want to do that? Suppose the case you can access the framebuffer directly: now you must handle a LOT of possible pixel formats in the framebuffer. You can assume a 32-bit RGBA or ARGB organization. There is the possibility of 15/16/24-bit displays (RGBA555, RGBA5551, RGBA4444, RGBA565, RGBA888...). That's if you don't want to also support the video-surface formats (overlays) such as YUV-based.
So let the display driver and/or the subjacent APIs to do that effort.
If you want to write to a display surface (which not equals exactly to framebuffer memory, altough it's conceptually almost the same) there are a lot of options. DX, Win32, or you may try the SDL library (libsdl).

Resources