How do I copy 2D CUDA arrays/textures between contexts? - multi-gpu

Suppose I want to copy some memory between different CUDA contexts (possibly on different devices). The CUDA Driver API offers me:
cuMemcpyPeer - for plain old device global memory
cuMemcpy3DPeer - for 3D arrays/textures
But there doesn't seem to be a similar API function for 2D arrays. Why? And - what do I do? Should I go through plain global memory buffers in both contexts?
PS - Same question for asynchronous copies; we have the plain and 3D cases covered, but no 2D.

Related

What are cuda resources and cuda devices?

I read the cuda api documentation but could not understand what the two mean. I want to know what CUdevice and cuResource. I have a rough understanding that CUdevice refers to one GPU device but still have no idea what cuGraphicsResource means.
I have a rough understanding that CUdevice refers to one GPU device
Correct. A CUdevice is the CUDA driver API method to refer to a device, for those API calls that need it such as cuCtxCreate(). For that API call, it indicates which device is the intended target.
what cuGraphicsResource means
This handle is used in CUDA/Graphics interop. It is one of the elements used to exchange data between the GPU operating as a graphics processor (e.g. OpenGL), and the GPU acting as a compute device (e.g. CUDA).
A graphics resource refers to a graphics entity, typically including such things as textures, Frame Buffer Objects, Pixel Buffer Objects, and Vertex Buffer Objects. Each one of those graphics entities can have data associated with it, and the cuGraphicsResource is a named handle (in the driver API) to use to refer to that data or a container ("resource") that includes that data.

OpenGL rendering & display in different processes [duplicate]

Let's say I have an application A which is responsible for painting stuff on-screen via OpenGL library. For tight integration purposes I would like to let this application A do its job, but render in a FBO or directly in a render buffer and allow an application B to have read-only access to this buffer to handle the display on-screen (basically rendering it as a 2D texture).
It seems FBOs belong to OpenGL contexts and contexts are not shareable between processes. I definitely understand that allowing several processes two mess with the same context is evil. But in my particular case, I think it's reasonable to think it could be pretty safe.
EDIT:
Render size is near full screen, I was thinking of a 2048x2048 32bits buffer (I don't use the alpha channel for now but why not later).
Framebuffer Objects can not be shared between OpenGL contexts, be it that they belong to the same process or not. But textures can be shared and textures can be used as color buffer attachment to a framebuffer objects.
Sharing OpenGL contexts between processes it actually possible if the graphics system provides the API for this job. In the case of X11/GLX it is possible to share indirect rendering contexts between multiple processes. It may be possible in Windows by emplyoing a few really, really crude hacks. MacOS X, no idea how to do this.
So what's probably the easiest to do is using a Pixel Buffer Object to gain performant access to the rendered picture. Then send it over to the other application through shared memory and upload it into a texture there (again through pixel buffer object).
In MacOS,you can use IOSurface to share framebuffer between two application.
In my understanding, you won't be able to share the objects between the process under Windows, unless it's a kernel mode object. Even the shared textures and contexts can create performance hits also it has give you the additional responsibility of syncing the SwapBuffer() calls. Especially under windows platform the OpenGL implementation is notorious.
In my opinion, you can relay on inter-process communication mechanisms like Events, mutex, window messages, pipes to sync the rendering. but just realize that there's a performance consideration on approaching in this way. Kernel mode objects are good but the transition to kernel each time has a cost of 100ms. Which is damns costly for a high performance rendering application. In my opinion you have to reconsider the multi-process rendering design.
On Linux, a solution is to use DMABUF, as explained in this blog: https://blaztinn.gitlab.io/post/dmabuf-texture-sharing/

When to use Metal instead of Accelerate API on Apple macOs

I'm currently writing a desktop audio processing application. Its purpose is to do a lot of signal processing so I'm really concerned about the performance and reliability.
I've already use the Audio toolbox / Core Audio APIs but for the custom audio processing I was wondering what will be the best fit between Metal and Accelerate. Anyone know about their difference? Or have benchmark? I didn't find something really useful with Google...
Metal shaders use the GPU. Accelerate APIs use the CPU. So it really depends on what kind of GPU the system you are using provides, and whether your custom processing kernels can use the GPU efficiently. Examples might include algorithms that are massively parallel, such as convolution of large 2D arrays of data (much larger than 1D real-time audio buffers).

How to use OpenGL SL for general computing

I know OpenCL and CUDA. These are not support in mobile device. But most of them support OpenGL ES. So I want to learn using OpenGL ES shading language for general computing. Like OpenCL or CUDA, in OpenGLSL.
How many kinds of buffer can I use? what are they?
How to manipulate these buffers
As I know, I can create vertex and fragment shader so far.
Which buffer can I manipulate when I use fragment shader
Which buffer can I manipulate when I use vertex shader
Are there any synchronized function in GPU(I mean synchronization in GPU. like the thread synchronized in the block in OpenCL or CUDA)
PS:
I read a paper Using Mobile GPU for General-Purpose Computing. Their experiments were performed on an Nvidia Tegra SoC with the following specifications:
1GHz dual-core ARM Cortex-A9 CPU,
1GB of RAM
an Nvidia ultra-low-power GeForce GPU running at 333MHz, and 512MB of Flash memory
It can get 3X speedup on FFT(128*128). I think these result is good. Do you guys think if it's worth to do it. So the main bottleneck is the memory access right?
As many guys said it's not worth to do general purpose computing on OpenGL ES. So it's not worth to expect the mobile supporting OpenCL either. Right? In my opinion, OpenGL ES is the fomentation of the OpenCL
Some platforms don't support any floating point formats. Some platforms (powervr, tegra, adreno) support half-float (16bit float) surfaces, which can be used both as a render target and as a texture. Full float support exists on some platforms (adreno, and I believe the latest powervr), but is rather rare.
So it depends a lot on what kind of calculation you're expecting to do, what kind of precision is acceptable for you, as well as what your target platform is.
Also take into account the fact that current-gen opengl es (2.0) does not have full IEEE float requirements, so the results may vary.
In the end, whether it is worth it depends a lot on your batch sizes though; accessing the results (i.e, reading pixels back from the render target) may be so slow that it negates the performance gain.
To address your bullet-points one by one:
How many kinds of buffer can I use? what are they?
You can create a texture and form a FBO out of that. Additionally you can feed data to the shaders as constants (uniforms) or per-vertex data streams (varyings/attribs).
How to manipulate these buffers
You can write to a texture using the normal texture handling functions.
Which buffer can I manipulate when I use fragment shader
When a FBO is bound, you can write to it using the fragment shader. Later, you can access the results by reading from the texture you bound to the FBO.
Which buffer can I manipulate when I use vertex shader
None.
Are there any synchronized function in GPU
You can flush the pipeline using glFinish(). The drivers should cause a pipeline flush implicitly if you try to access the texture data, though.
Before my forays into OpenCL when I had large quantities of numeric data i would feed it to the GPU using the numeric data as pixel data via an rgba image, and then manipulate it: its a nice fast way to approach large number sets for mathematical manipulation, although you then have to copy from the buffer back to the CPU to be able to extract the changes, so it depends on how much data you need to manipulate this way as to whether it is worth it, as well as how much graphics RAM you have available and the number of cores etc.

Why don't GPU libraries support automated function composition?

Intel's Integrated Performance Primitives (IPP) library has a feature called Deferred Mode Image Processing (DMIP). It lets you specify a sequence of functions, composes the functions, and applies the composed function to an array via cache-friendly tiled processing. This gives better performance than naively iterating through the whole array for each function.
It seems like this technique would benefit code running on a GPU as well. There are many GPU libraries available, such as NVIDIA Performance Primitives (NPP), but none seem to have a feature like DMIP. Am I missing something? Or is there a reason that GPU libraries would not benefit from automated function composition?
GPU programming has similar concepts to DMIP function composition on CPU.
Although it's not easy to be "automated" on GPU (some 3rd party lib may be able to do it), manually doing it is easier than CPU programming (see Thrust example below).
Two main features of DMIP:
processing by the image fragments so data can fit into cache;
parallel processing to different fragments or execution of different independent branches of a graph.
When applying a sequence of basic operations on a large image,
feature 1 omits RAM read/write between basic operations. All the read/write is done in cache, and feature 2 can utilize multi-core CPU.
The similar concept of DMIP feature 1 for GPGPU is kernel fusion. Instead of applying multiple kernels of the basic operation to image data, one could combine basic operations in one kernel to avoid multiple GPU global memory read/write.
A manually kernel fusion example using Thrust can be found in page 26 of this slides.
It seems the library ArrayFire has made notable efforts on automaic kernel fusion.
The similar concept of DMIP feature 2 for GPGPU is concurrent kernel execution.
This feature enlarge the bandwidth requirement, but most GPGPU programs are already bandwidth bound. So the concurrent kernel execution is not likely to be used very often.
CPU cache vs. GPGPU shared mem/cache
CPU cache omits the RAM read/write in DMIP, while for a fused kernel of GPGPU, registers do the same thing. Since a thread of CPU in DMIP processes on a small image fragment, but a thread of GPGPU often processes only one pixel. A few registers are large enough to buffer the data of a GPU thread.
For image processing, GPGPU shared mem/cache is often used when the result pixel depends on surrounding pixels. Image smoothing/filtering is a typical example requiring GPGPU shared mem/cache.

Resources