When to use Metal instead of Accelerate API on Apple macOs - macos

I'm currently writing a desktop audio processing application. Its purpose is to do a lot of signal processing so I'm really concerned about the performance and reliability.
I've already use the Audio toolbox / Core Audio APIs but for the custom audio processing I was wondering what will be the best fit between Metal and Accelerate. Anyone know about their difference? Or have benchmark? I didn't find something really useful with Google...

Metal shaders use the GPU. Accelerate APIs use the CPU. So it really depends on what kind of GPU the system you are using provides, and whether your custom processing kernels can use the GPU efficiently. Examples might include algorithms that are massively parallel, such as convolution of large 2D arrays of data (much larger than 1D real-time audio buffers).

Related

Which areas can be accelerated by FPGA and GPU

I'm trying to accelerate any of my software using FPGA/GPU. I'm little confused to choose among these two. Which areas are suitable for FPGA and which areas are suitable for GPU (like Image processing is suitable for GPUs). Also it'd be good to know the areas which can be accelerated by more than 20x. I'm more interested on GPU as they are cheap and programming is easier compared to FPGA
Main difference between FPGA and GPU is that in fact today GPGPU is like CPU. It can easily handle pointers, functions and all programming is easy, because CUDA/OpenCL work with some set of C/C++ (for example OpenCL use C99 + some special functions).
FPGA is more hardware oriented. You can define gates and whole logic, which is then faster and there is paralelism, but it's achieved by different means.
FPGA is much better in serial operations, when it have constant stream of data (streaming encryption, video decoding,...) and when reprogramming is not casual. You can close FPGA in box and let it do its job only by connecting input, output cables and power supply.
GPGPU is always connected through PCI express and sending program to it is common (games use set of shaders (GPU programs), which are switching quickly), so it is more like batch handling device. Today GPGPU have large ammounts of RAM and cores/multiprocesors so it's really a more like CPU than FPGA.
There is one (maybe more, but I can't remember more) thing, that FPGA will be a lot faster. I don't know about any need for this (other than (de)crypting), but it's huge ammounts of different bit operations. GPUs are specialized at working with floats and ints (32 bit floating point and integer values), but they are quite slow when you have to do some binary magic. Simply by utilizing FPGA architecture, this binary magic can be done in paralel in one tick.
In GPU, you have to divide each binary operation (AND, OR, XOR,...), study in which order they have to be done.
Tl,dr: If you dont have specific need for FPGA, choose GPGPU.

struggling to find a way to monitor CPU and GPU, either third party or using code

so I'm currently working on a method to evaluate some graphics programming techniques in direct x 10, specifically custom shader files and instancing but I need a method of evaluating just how efficient it is to use them. I've been trying to find a way to evaluate it using draw speed, CPU load and GPU load as in theory there should be a much more rapid draw speed and the CPU & GPU load will be reduced as the program increases in efficiency.
My question is there a decent 3rd party method to monitor GPU & CPU or is it better to code manually, I'm using the rastertek framework currently.
DirectX has profiling tools already available..
i.e. here according to google ;-)

opencl - resources for multiple commandqueues

I'm working on an application where I real-time process a video feed on my GPU and once in a while I need to do some resource extensive calculations on my GPU besides that. My problem now is that I want to keep my video processing at real-time speed while doing the extra work in parallel once it comes up.
The way I think this should be done is with two command-queues, one for the real time video processing and one for the extensive calculations. However, I have no idea how this will turn out with the computing resources of the GPU: will there be equally many workers assigned to the command-queues during parallel execution? (so I could expect a slowdown of about 50% of my real-time computations?) Or is it device dependent?
The OpenCL specification leaves it up to the vendor to decide how to balance execution resources between multiple command queues. So a vendor could implement OpenCL in such a way that causes the GPU to work on only one kernel at a time. That would be a legal implementation, in my opinion.
If you really want to solve your problem in a device-independent way, I think you need to figure out how to break up your large non-real-time computation into smaller computations.
AMD has some extensions (some of which I think got adopted in OpenCL 1.2) for device fission, which means you can reserve some portion of the device for one context and use the rest for others.

How to use OpenGL SL for general computing

I know OpenCL and CUDA. These are not support in mobile device. But most of them support OpenGL ES. So I want to learn using OpenGL ES shading language for general computing. Like OpenCL or CUDA, in OpenGLSL.
How many kinds of buffer can I use? what are they?
How to manipulate these buffers
As I know, I can create vertex and fragment shader so far.
Which buffer can I manipulate when I use fragment shader
Which buffer can I manipulate when I use vertex shader
Are there any synchronized function in GPU(I mean synchronization in GPU. like the thread synchronized in the block in OpenCL or CUDA)
PS:
I read a paper Using Mobile GPU for General-Purpose Computing. Their experiments were performed on an Nvidia Tegra SoC with the following specifications:
1GHz dual-core ARM Cortex-A9 CPU,
1GB of RAM
an Nvidia ultra-low-power GeForce GPU running at 333MHz, and 512MB of Flash memory
It can get 3X speedup on FFT(128*128). I think these result is good. Do you guys think if it's worth to do it. So the main bottleneck is the memory access right?
As many guys said it's not worth to do general purpose computing on OpenGL ES. So it's not worth to expect the mobile supporting OpenCL either. Right? In my opinion, OpenGL ES is the fomentation of the OpenCL
Some platforms don't support any floating point formats. Some platforms (powervr, tegra, adreno) support half-float (16bit float) surfaces, which can be used both as a render target and as a texture. Full float support exists on some platforms (adreno, and I believe the latest powervr), but is rather rare.
So it depends a lot on what kind of calculation you're expecting to do, what kind of precision is acceptable for you, as well as what your target platform is.
Also take into account the fact that current-gen opengl es (2.0) does not have full IEEE float requirements, so the results may vary.
In the end, whether it is worth it depends a lot on your batch sizes though; accessing the results (i.e, reading pixels back from the render target) may be so slow that it negates the performance gain.
To address your bullet-points one by one:
How many kinds of buffer can I use? what are they?
You can create a texture and form a FBO out of that. Additionally you can feed data to the shaders as constants (uniforms) or per-vertex data streams (varyings/attribs).
How to manipulate these buffers
You can write to a texture using the normal texture handling functions.
Which buffer can I manipulate when I use fragment shader
When a FBO is bound, you can write to it using the fragment shader. Later, you can access the results by reading from the texture you bound to the FBO.
Which buffer can I manipulate when I use vertex shader
None.
Are there any synchronized function in GPU
You can flush the pipeline using glFinish(). The drivers should cause a pipeline flush implicitly if you try to access the texture data, though.
Before my forays into OpenCL when I had large quantities of numeric data i would feed it to the GPU using the numeric data as pixel data via an rgba image, and then manipulate it: its a nice fast way to approach large number sets for mathematical manipulation, although you then have to copy from the buffer back to the CPU to be able to extract the changes, so it depends on how much data you need to manipulate this way as to whether it is worth it, as well as how much graphics RAM you have available and the number of cores etc.

What is the overhead of constantly uploading new Textures to the GPU in OpenGL?

What is the overhead of continually uploading textures to the GPU (and replacing old ones). I'm working on a new cross-platform 3D windowing system that uses OpenGL, and am planning on uploading a single Bitmap for each window (containing the UI elements). That Bitmap would be updated in sync with the GPU (using the VSync). I was wondering if this is a good idea, or if constantly writing bitmaps would incur too much of a performance overhead. Thanks!
Well something like nVidia's Geforce 460M has 60GB/sec bandwidth on local memory.
PCI express 2.0 x16 can manage 8GB/sec.
As such if you are trying to transfer too many textures over the PCIe bus you can expect to come up against memory bandwidth problems. It gives you about 136 meg per frame at 60Hz. Uncompressed 24-bit 1920x1080 is roughly 6 meg. So, suffice to say you could upload a fair few frames of video per frame on a 16x graphics card.
Sure its not as simple as that. There is PCIe overhead of around 20%. All draw commands must be uploaded over that link too.
In general though you should be fine providing you don't over do it. Bear in mind that it would be sensible to upload a texture in one frame that you aren't expecting to use until the next (or even later). This way you don't create a bottleneck where the rendering is halted waiting for a PCIe upload to complete.
Ultimately, your answer is going to be profiling. However, some early optimizations you can make are to avoid updating a texture if nothing has changed. Depending on the size of the textures and the pixel format, this could easily be prohibitively expensive.
Profile with a simpler situation that simulates the kind of usage you expect. I suspect the performance overhead (without the optimization I mentioned, at least) will be unusable if you have a handful of windows bigger, depending on the size of these windows.

Resources