How to use OpenGL SL for general computing - opengl-es

I know OpenCL and CUDA. These are not support in mobile device. But most of them support OpenGL ES. So I want to learn using OpenGL ES shading language for general computing. Like OpenCL or CUDA, in OpenGLSL.
How many kinds of buffer can I use? what are they?
How to manipulate these buffers
As I know, I can create vertex and fragment shader so far.
Which buffer can I manipulate when I use fragment shader
Which buffer can I manipulate when I use vertex shader
Are there any synchronized function in GPU(I mean synchronization in GPU. like the thread synchronized in the block in OpenCL or CUDA)
PS:
I read a paper Using Mobile GPU for General-Purpose Computing. Their experiments were performed on an Nvidia Tegra SoC with the following specifications:
1GHz dual-core ARM Cortex-A9 CPU,
1GB of RAM
an Nvidia ultra-low-power GeForce GPU running at 333MHz, and 512MB of Flash memory
It can get 3X speedup on FFT(128*128). I think these result is good. Do you guys think if it's worth to do it. So the main bottleneck is the memory access right?
As many guys said it's not worth to do general purpose computing on OpenGL ES. So it's not worth to expect the mobile supporting OpenCL either. Right? In my opinion, OpenGL ES is the fomentation of the OpenCL

Some platforms don't support any floating point formats. Some platforms (powervr, tegra, adreno) support half-float (16bit float) surfaces, which can be used both as a render target and as a texture. Full float support exists on some platforms (adreno, and I believe the latest powervr), but is rather rare.
So it depends a lot on what kind of calculation you're expecting to do, what kind of precision is acceptable for you, as well as what your target platform is.
Also take into account the fact that current-gen opengl es (2.0) does not have full IEEE float requirements, so the results may vary.
In the end, whether it is worth it depends a lot on your batch sizes though; accessing the results (i.e, reading pixels back from the render target) may be so slow that it negates the performance gain.
To address your bullet-points one by one:
How many kinds of buffer can I use? what are they?
You can create a texture and form a FBO out of that. Additionally you can feed data to the shaders as constants (uniforms) or per-vertex data streams (varyings/attribs).
How to manipulate these buffers
You can write to a texture using the normal texture handling functions.
Which buffer can I manipulate when I use fragment shader
When a FBO is bound, you can write to it using the fragment shader. Later, you can access the results by reading from the texture you bound to the FBO.
Which buffer can I manipulate when I use vertex shader
None.
Are there any synchronized function in GPU
You can flush the pipeline using glFinish(). The drivers should cause a pipeline flush implicitly if you try to access the texture data, though.

Before my forays into OpenCL when I had large quantities of numeric data i would feed it to the GPU using the numeric data as pixel data via an rgba image, and then manipulate it: its a nice fast way to approach large number sets for mathematical manipulation, although you then have to copy from the buffer back to the CPU to be able to extract the changes, so it depends on how much data you need to manipulate this way as to whether it is worth it, as well as how much graphics RAM you have available and the number of cores etc.

Related

Linux device driver for display | Framebuffer

I am studying the display device driver for linux that runs TFT display, now framebuffer stores all the data that is to be displayed.
Question: does display driver have equvalant buffer of its own to handle framebuffer from the kernel?
My concern is that the processor has to take the output from the GPU and produce a framebuffer to be sent out to the display driver, but depending on the display there might be some latencies and other issues so do display driver access framebuffer directly or it uses its own buffer as well?
This is a rabbit-hole question; it seems simple on the surface, but a factual answer is bound to end up in fractal complexity.
It's literally impossible to give a generalized answer.
The cliff notes version is: GPUs have their own memory, which is directly visible to the CPU in the form of a memory mapping (you can query the actual range of physical addresses from e.g. /sys/class/drm/card0/device/resource). Somewhere in there, there's also the memory used for the display scanout buffer. When using GPU accelerated graphics, the GPU will write directly to those scanout buffers – possibly to memory that's on a different graphics card (that's how e.g. Hybrid graphics work).
My concern is that the processor has to take the output from the GPU and produce a framebuffer to be sent out to the display driver
Usually that's not the case. However even if there's a copy involved, these days bus bandwidths are large enough for that copy operation not to matter.
I am studying the display device driver for linux that runs TFT display
If this is a TFT display connected with SPI or an parallel bus made from GPIOs, then yes, there'll be some memory reserved for the image to reside on. Strictly speaking this can be in the RAM for the CPU, or in the VRAM of a GPU, if there is one. However as far as latencies go, the copy operations for scanout don't really matter these days.
20 years ago, yes, and even back then with clever scheduling you could avoid the latencies.

When to use Metal instead of Accelerate API on Apple macOs

I'm currently writing a desktop audio processing application. Its purpose is to do a lot of signal processing so I'm really concerned about the performance and reliability.
I've already use the Audio toolbox / Core Audio APIs but for the custom audio processing I was wondering what will be the best fit between Metal and Accelerate. Anyone know about their difference? Or have benchmark? I didn't find something really useful with Google...
Metal shaders use the GPU. Accelerate APIs use the CPU. So it really depends on what kind of GPU the system you are using provides, and whether your custom processing kernels can use the GPU efficiently. Examples might include algorithms that are massively parallel, such as convolution of large 2D arrays of data (much larger than 1D real-time audio buffers).

Why don't GPU libraries support automated function composition?

Intel's Integrated Performance Primitives (IPP) library has a feature called Deferred Mode Image Processing (DMIP). It lets you specify a sequence of functions, composes the functions, and applies the composed function to an array via cache-friendly tiled processing. This gives better performance than naively iterating through the whole array for each function.
It seems like this technique would benefit code running on a GPU as well. There are many GPU libraries available, such as NVIDIA Performance Primitives (NPP), but none seem to have a feature like DMIP. Am I missing something? Or is there a reason that GPU libraries would not benefit from automated function composition?
GPU programming has similar concepts to DMIP function composition on CPU.
Although it's not easy to be "automated" on GPU (some 3rd party lib may be able to do it), manually doing it is easier than CPU programming (see Thrust example below).
Two main features of DMIP:
processing by the image fragments so data can fit into cache;
parallel processing to different fragments or execution of different independent branches of a graph.
When applying a sequence of basic operations on a large image,
feature 1 omits RAM read/write between basic operations. All the read/write is done in cache, and feature 2 can utilize multi-core CPU.
The similar concept of DMIP feature 1 for GPGPU is kernel fusion. Instead of applying multiple kernels of the basic operation to image data, one could combine basic operations in one kernel to avoid multiple GPU global memory read/write.
A manually kernel fusion example using Thrust can be found in page 26 of this slides.
It seems the library ArrayFire has made notable efforts on automaic kernel fusion.
The similar concept of DMIP feature 2 for GPGPU is concurrent kernel execution.
This feature enlarge the bandwidth requirement, but most GPGPU programs are already bandwidth bound. So the concurrent kernel execution is not likely to be used very often.
CPU cache vs. GPGPU shared mem/cache
CPU cache omits the RAM read/write in DMIP, while for a fused kernel of GPGPU, registers do the same thing. Since a thread of CPU in DMIP processes on a small image fragment, but a thread of GPGPU often processes only one pixel. A few registers are large enough to buffer the data of a GPU thread.
For image processing, GPGPU shared mem/cache is often used when the result pixel depends on surrounding pixels. Image smoothing/filtering is a typical example requiring GPGPU shared mem/cache.

What is the overhead of constantly uploading new Textures to the GPU in OpenGL?

What is the overhead of continually uploading textures to the GPU (and replacing old ones). I'm working on a new cross-platform 3D windowing system that uses OpenGL, and am planning on uploading a single Bitmap for each window (containing the UI elements). That Bitmap would be updated in sync with the GPU (using the VSync). I was wondering if this is a good idea, or if constantly writing bitmaps would incur too much of a performance overhead. Thanks!
Well something like nVidia's Geforce 460M has 60GB/sec bandwidth on local memory.
PCI express 2.0 x16 can manage 8GB/sec.
As such if you are trying to transfer too many textures over the PCIe bus you can expect to come up against memory bandwidth problems. It gives you about 136 meg per frame at 60Hz. Uncompressed 24-bit 1920x1080 is roughly 6 meg. So, suffice to say you could upload a fair few frames of video per frame on a 16x graphics card.
Sure its not as simple as that. There is PCIe overhead of around 20%. All draw commands must be uploaded over that link too.
In general though you should be fine providing you don't over do it. Bear in mind that it would be sensible to upload a texture in one frame that you aren't expecting to use until the next (or even later). This way you don't create a bottleneck where the rendering is halted waiting for a PCIe upload to complete.
Ultimately, your answer is going to be profiling. However, some early optimizations you can make are to avoid updating a texture if nothing has changed. Depending on the size of the textures and the pixel format, this could easily be prohibitively expensive.
Profile with a simpler situation that simulates the kind of usage you expect. I suspect the performance overhead (without the optimization I mentioned, at least) will be unusable if you have a handful of windows bigger, depending on the size of these windows.

Report Direct3D memory usage

I have a Direct3D 9 application and I would like to monitor the memory usage.
Is there a tool to know how much system and video memory is used by Direct3D?
Ideally, it would also report how much is allocated for textures, vertex buffers, index buffers...
You can use the old DirectDraw interface to query the total and available memory.
The numbers you get that way are not reliable though.
The free memory may change at any instant and the available memory often takes the AGP-memory into account (which is strictly not video-memory). You can use the numbers to do a good guess about the default texture-resolutions and detail-level of your application/game, but that's it.
You may wonder why is there no way to get better numbers, after all it can't be to hard to track the resource-usage.
From an application point of view this is correct. You may think that the video memory just contains surfaces, textures, index- and vertex buffers and some shader-programs, but that's not true on the low-level side.
There are lots of other resources as well. All these are created and managed by the Direct3D driver to make the rendering as fast as possible. Among others there are hirarchical z-buffer acceleration structures, pre-compiled command lists (e.g. the data required to render something in the format as understood by the GPU). The driver also may queue rendering-commands for multiple frames in advance to even out the frame-rate and increase parallelity between the GPU and CPU.
The driver also does a lot of work under the hood for you. Heuristics are used to detect draw-calls with static geometry and constant rendering-settings. A driver may decide to optimize the geometry in these cases for better cache-usage. This all happends in parallel and under the control of the driver. All this stuff needs space as well so the free memory may changes at any time.
However, the driver also does caching for your resources, so you don't really need to know the resource-usage at the first place.
If you need more space than available the that's no problem. The driver will move the data between system-ram, AGP-memory and video ram for you. In practice you never have to worry that you run out of video-memory. Sure - once you need more video-memory than available the performance will suffer, but that's life :-)
Two suggestions:
You can call GetAvailableTextureMem in various times to obtain a (rough) estimate of overall memory usage progression.
Assuming you develop on nVidia's, PerfHUD includes a graphical representation of consumed AGP/VID memory (separated).
You probably won't be able to obtain a nice clean matrix of memory consumers (vertex buffers etc.) vs. memory location (AGP, VID, system), as -
(1) the driver has a lot of freedom in transferring resources between memory types, and
(2) the actual variety of memory consumers is far greater than the exposed D3D interfaces.

Resources