What are cuda resources and cuda devices? - parallel-processing

I read the cuda api documentation but could not understand what the two mean. I want to know what CUdevice and cuResource. I have a rough understanding that CUdevice refers to one GPU device but still have no idea what cuGraphicsResource means.

I have a rough understanding that CUdevice refers to one GPU device
Correct. A CUdevice is the CUDA driver API method to refer to a device, for those API calls that need it such as cuCtxCreate(). For that API call, it indicates which device is the intended target.
what cuGraphicsResource means
This handle is used in CUDA/Graphics interop. It is one of the elements used to exchange data between the GPU operating as a graphics processor (e.g. OpenGL), and the GPU acting as a compute device (e.g. CUDA).
A graphics resource refers to a graphics entity, typically including such things as textures, Frame Buffer Objects, Pixel Buffer Objects, and Vertex Buffer Objects. Each one of those graphics entities can have data associated with it, and the cuGraphicsResource is a named handle (in the driver API) to use to refer to that data or a container ("resource") that includes that data.

Related

OpenGL rendering & display in different processes [duplicate]

Let's say I have an application A which is responsible for painting stuff on-screen via OpenGL library. For tight integration purposes I would like to let this application A do its job, but render in a FBO or directly in a render buffer and allow an application B to have read-only access to this buffer to handle the display on-screen (basically rendering it as a 2D texture).
It seems FBOs belong to OpenGL contexts and contexts are not shareable between processes. I definitely understand that allowing several processes two mess with the same context is evil. But in my particular case, I think it's reasonable to think it could be pretty safe.
EDIT:
Render size is near full screen, I was thinking of a 2048x2048 32bits buffer (I don't use the alpha channel for now but why not later).
Framebuffer Objects can not be shared between OpenGL contexts, be it that they belong to the same process or not. But textures can be shared and textures can be used as color buffer attachment to a framebuffer objects.
Sharing OpenGL contexts between processes it actually possible if the graphics system provides the API for this job. In the case of X11/GLX it is possible to share indirect rendering contexts between multiple processes. It may be possible in Windows by emplyoing a few really, really crude hacks. MacOS X, no idea how to do this.
So what's probably the easiest to do is using a Pixel Buffer Object to gain performant access to the rendered picture. Then send it over to the other application through shared memory and upload it into a texture there (again through pixel buffer object).
In MacOS,you can use IOSurface to share framebuffer between two application.
In my understanding, you won't be able to share the objects between the process under Windows, unless it's a kernel mode object. Even the shared textures and contexts can create performance hits also it has give you the additional responsibility of syncing the SwapBuffer() calls. Especially under windows platform the OpenGL implementation is notorious.
In my opinion, you can relay on inter-process communication mechanisms like Events, mutex, window messages, pipes to sync the rendering. but just realize that there's a performance consideration on approaching in this way. Kernel mode objects are good but the transition to kernel each time has a cost of 100ms. Which is damns costly for a high performance rendering application. In my opinion you have to reconsider the multi-process rendering design.
On Linux, a solution is to use DMABUF, as explained in this blog: https://blaztinn.gitlab.io/post/dmabuf-texture-sharing/

How do I copy 2D CUDA arrays/textures between contexts?

Suppose I want to copy some memory between different CUDA contexts (possibly on different devices). The CUDA Driver API offers me:
cuMemcpyPeer - for plain old device global memory
cuMemcpy3DPeer - for 3D arrays/textures
But there doesn't seem to be a similar API function for 2D arrays. Why? And - what do I do? Should I go through plain global memory buffers in both contexts?
PS - Same question for asynchronous copies; we have the plain and 3D cases covered, but no 2D.

Linux device driver for display | Framebuffer

I am studying the display device driver for linux that runs TFT display, now framebuffer stores all the data that is to be displayed.
Question: does display driver have equvalant buffer of its own to handle framebuffer from the kernel?
My concern is that the processor has to take the output from the GPU and produce a framebuffer to be sent out to the display driver, but depending on the display there might be some latencies and other issues so do display driver access framebuffer directly or it uses its own buffer as well?
This is a rabbit-hole question; it seems simple on the surface, but a factual answer is bound to end up in fractal complexity.
It's literally impossible to give a generalized answer.
The cliff notes version is: GPUs have their own memory, which is directly visible to the CPU in the form of a memory mapping (you can query the actual range of physical addresses from e.g. /sys/class/drm/card0/device/resource). Somewhere in there, there's also the memory used for the display scanout buffer. When using GPU accelerated graphics, the GPU will write directly to those scanout buffers – possibly to memory that's on a different graphics card (that's how e.g. Hybrid graphics work).
My concern is that the processor has to take the output from the GPU and produce a framebuffer to be sent out to the display driver
Usually that's not the case. However even if there's a copy involved, these days bus bandwidths are large enough for that copy operation not to matter.
I am studying the display device driver for linux that runs TFT display
If this is a TFT display connected with SPI or an parallel bus made from GPIOs, then yes, there'll be some memory reserved for the image to reside on. Strictly speaking this can be in the RAM for the CPU, or in the VRAM of a GPU, if there is one. However as far as latencies go, the copy operations for scanout don't really matter these days.
20 years ago, yes, and even back then with clever scheduling you could avoid the latencies.

How to use OpenGL SL for general computing

I know OpenCL and CUDA. These are not support in mobile device. But most of them support OpenGL ES. So I want to learn using OpenGL ES shading language for general computing. Like OpenCL or CUDA, in OpenGLSL.
How many kinds of buffer can I use? what are they?
How to manipulate these buffers
As I know, I can create vertex and fragment shader so far.
Which buffer can I manipulate when I use fragment shader
Which buffer can I manipulate when I use vertex shader
Are there any synchronized function in GPU(I mean synchronization in GPU. like the thread synchronized in the block in OpenCL or CUDA)
PS:
I read a paper Using Mobile GPU for General-Purpose Computing. Their experiments were performed on an Nvidia Tegra SoC with the following specifications:
1GHz dual-core ARM Cortex-A9 CPU,
1GB of RAM
an Nvidia ultra-low-power GeForce GPU running at 333MHz, and 512MB of Flash memory
It can get 3X speedup on FFT(128*128). I think these result is good. Do you guys think if it's worth to do it. So the main bottleneck is the memory access right?
As many guys said it's not worth to do general purpose computing on OpenGL ES. So it's not worth to expect the mobile supporting OpenCL either. Right? In my opinion, OpenGL ES is the fomentation of the OpenCL
Some platforms don't support any floating point formats. Some platforms (powervr, tegra, adreno) support half-float (16bit float) surfaces, which can be used both as a render target and as a texture. Full float support exists on some platforms (adreno, and I believe the latest powervr), but is rather rare.
So it depends a lot on what kind of calculation you're expecting to do, what kind of precision is acceptable for you, as well as what your target platform is.
Also take into account the fact that current-gen opengl es (2.0) does not have full IEEE float requirements, so the results may vary.
In the end, whether it is worth it depends a lot on your batch sizes though; accessing the results (i.e, reading pixels back from the render target) may be so slow that it negates the performance gain.
To address your bullet-points one by one:
How many kinds of buffer can I use? what are they?
You can create a texture and form a FBO out of that. Additionally you can feed data to the shaders as constants (uniforms) or per-vertex data streams (varyings/attribs).
How to manipulate these buffers
You can write to a texture using the normal texture handling functions.
Which buffer can I manipulate when I use fragment shader
When a FBO is bound, you can write to it using the fragment shader. Later, you can access the results by reading from the texture you bound to the FBO.
Which buffer can I manipulate when I use vertex shader
None.
Are there any synchronized function in GPU
You can flush the pipeline using glFinish(). The drivers should cause a pipeline flush implicitly if you try to access the texture data, though.
Before my forays into OpenCL when I had large quantities of numeric data i would feed it to the GPU using the numeric data as pixel data via an rgba image, and then manipulate it: its a nice fast way to approach large number sets for mathematical manipulation, although you then have to copy from the buffer back to the CPU to be able to extract the changes, so it depends on how much data you need to manipulate this way as to whether it is worth it, as well as how much graphics RAM you have available and the number of cores etc.

Report Direct3D memory usage

I have a Direct3D 9 application and I would like to monitor the memory usage.
Is there a tool to know how much system and video memory is used by Direct3D?
Ideally, it would also report how much is allocated for textures, vertex buffers, index buffers...
You can use the old DirectDraw interface to query the total and available memory.
The numbers you get that way are not reliable though.
The free memory may change at any instant and the available memory often takes the AGP-memory into account (which is strictly not video-memory). You can use the numbers to do a good guess about the default texture-resolutions and detail-level of your application/game, but that's it.
You may wonder why is there no way to get better numbers, after all it can't be to hard to track the resource-usage.
From an application point of view this is correct. You may think that the video memory just contains surfaces, textures, index- and vertex buffers and some shader-programs, but that's not true on the low-level side.
There are lots of other resources as well. All these are created and managed by the Direct3D driver to make the rendering as fast as possible. Among others there are hirarchical z-buffer acceleration structures, pre-compiled command lists (e.g. the data required to render something in the format as understood by the GPU). The driver also may queue rendering-commands for multiple frames in advance to even out the frame-rate and increase parallelity between the GPU and CPU.
The driver also does a lot of work under the hood for you. Heuristics are used to detect draw-calls with static geometry and constant rendering-settings. A driver may decide to optimize the geometry in these cases for better cache-usage. This all happends in parallel and under the control of the driver. All this stuff needs space as well so the free memory may changes at any time.
However, the driver also does caching for your resources, so you don't really need to know the resource-usage at the first place.
If you need more space than available the that's no problem. The driver will move the data between system-ram, AGP-memory and video ram for you. In practice you never have to worry that you run out of video-memory. Sure - once you need more video-memory than available the performance will suffer, but that's life :-)
Two suggestions:
You can call GetAvailableTextureMem in various times to obtain a (rough) estimate of overall memory usage progression.
Assuming you develop on nVidia's, PerfHUD includes a graphical representation of consumed AGP/VID memory (separated).
You probably won't be able to obtain a nice clean matrix of memory consumers (vertex buffers etc.) vs. memory location (AGP, VID, system), as -
(1) the driver has a lot of freedom in transferring resources between memory types, and
(2) the actual variety of memory consumers is far greater than the exposed D3D interfaces.

Resources