Using Alea Gpu DeviceFunction when the routine is processed on Cpu

Using Alea Gpu DeviceFunction when the routine is processed on Cpu - aleagpu

I have a routine that is designed to be called under any of three processing modes; SingleCpuThread, ParallelCpuThreads and ParallelGpuThreads.
Within the routine, the math is performed using Alea.DeviceFunction in order to be compliant with Alea when the routine is called under the ParallelGpuProcessing mode.
Question: When the same routine is called under the other two modes, and the math is being performed using DeviceFunction, is that using the Gpu and incurring the overhead and marshaling, etc.? And if so (which would be bad), what's the best way to let the same routine use dot net's .Math functions instead of .DeviceFunction, without having duplicate the whole routine for separate Cpu-happy and a Gpu-happy versions of the routine?

As the term device functions says, the functions run on the GPU, assuming all the data there. Hence there is no marshaling overhead.
To simplify CPU / GPU code reuse most device functions are implemented to run also on the CPU. Some device functions however just don't make sense on the CPU, like for instance the ballot function. That means you can just use device functions and then you will know that Alea GPU will fastest on GPU. The compiler is also mapping some of the .NET math functions automatically to GPU device functions.

Related

How use the function "getdata" (imaqtool) to transfer data directly on GPU

I am currently using the function "getdata" from the imaqtool library to get my camera data, and make some postprocessing on my GPU.
Hence, I would like to get the data directly transfer from the buffer CPU memory to my GPU memory.
It is my understanding that "getdata" move data from CPU memory (buffer) to CPU memory. Hence, it should be trivial to transfer these data to my GPU directly.
However, I cannot find anything about it.
Any help is appreciated.

In short: MATLAB is not the right tool for your desires. MATLAB provides quite an easy interface, but that means you dont have full control on some things, and the main one is memory allocation and management. This is generally a good thing, as it is non-trivial to handle memory, but in your case, this is what you are asking for.
If you want to make a fast acquisition system where the memory is fully controlled by you, you will need to use low level languages such as C++/CUDA, and play with asynchronous operations and threads.
In MATLAB, the most flexibility you can get is using gpuArray(captured_data) once is on CPU.

does wglGetCurrentContext sync GPU and CPU?

when programming with OpenGL, glGet functions should be avoided because they force the GPU and CPU to synchronize, does this also apply to the wgl function "wglGetCurrentContext" which obtains a handle to the current OpenGL context? if not, are there any other performance problems around wglGetCurrentContext?

There is some misconception implied in your question:
glGet functions should be avoided because they force the GPU and CPU to synchronize
The first part is true. The second part is not. Most glGet*() call do not force a GPU/CPU synchronization. They only get state stored in the driver code, and do not involve the GPU at all.
There are some exceptions, which include the glGet*() calls that actually get data produced by the GPU. Typical examples include:
glGetBufferSubData(): Has to block if data is produced by the GPU (e.g. using transform feedback).
glGetTexImage(): Block if texture data is produced by GPU (e.g. if used as a render target).
glGetQueryObjectiv(..., GL_QUERY_RESULT, ...): Blocks if query has not finished.
Now, it's still true that you should avoid glGet*() calls where possible. Mainly for two reasons:
Most of them are unnecessary, since they query state that you set yourself, so you should already know what it is. And any unnecessary call is a waste.
They may cause synchronization between threads in multi-threaded driver implementations. So they may result in synchronization, but not with the GPU.
There are of course some good uses for glGet*() calls, for example:
To get implementation limits, like maximum texture sizes, etc. Call once during startup.
Calls like glGetUniformLocation() and glGetAttribLocation(). Make sure that you only call them once after shader linkage. Even though these are also avoidable by using qualifiers in the shader code at least in recent GLSL versions.
glGetError(), during debugging only.
As for wglGetCurrentContext(), I doubt that it would be very expensive. The current context is typically stored in some kind of thread local storage, which can be accessed very efficiently.
Still, I don't see why calling it would be necessary. If you need the context again later, you can store it away when you create it. And if that's not possible for some reason, you can call wglGetCurrentContext() once, and store the result. There definitely shouldn't be a need to call it repeatedly.

All of the WGL functions vary widely in there effects on the performance characteristics depending on the vendor and driver.
I don't expect that wglGetCurrentContext would be an especially performance hogging call (unless you do it a ton of times). Since the WGL functions are generally divorced from the GL context's state vector.
That being said, SETTING the current context will cause all manner of syncing between contexts, often in deeply undocumented ways. I've dealt with a couple of AMD and Intel driver bugs where some things that were supposed to be synchronized via other means could ONLY be synchronized with a redundant MakeCurrent call every frame.

OpenCL: work group concept

I don't really understand the purpose of Work-Groups in OpenCL.
I understand that they are a group of Work Items (supposedly, hardware threads), which ones get executed in parallel.
However, why is there this need of coarser subdivision ? Wouldn't it be OK to have only the grid of threads (and, de facto, only one W-G)?
Should a Work-Group exactly map to a physical core ? For example, the TESLA c1060 card is said to have 240 cores. How would the Work-Groups map to this??
Also, as far as I understand, work-items inside a work group can be synchronized thanks to memory fences. Can work-groups synchronize or is that even needed ? Do they talk to each other via shared memory or is this only for work items (not sure on this one)?

Part of the confusion here I think comes down to terminology. What GPU people often call cores, aren't really, and what GPU people often call threads are only in a certain sense.
Cores
A core, in GPU marketing terms may refer to something like a CPU core, or it may refer to a single lane of a SIMD unit - in effect a single core x86 CPU would be four cores of this simpler type. This is why GPU core counts can be so high. It isn't really a fair comparison, you have to divide by 16, 32 or a similar number to get a more directly comparable core count.
Work-items
Each work-item in OpenCL is a thread in terms of its control flow, and its memory model. The hardware may run multiple work-items on a single thread, and you can easily picture this by imagining four OpenCL work-items operating on the separate lanes of an SSE vector. It would simply be compiler trickery that achieves that, and on GPUs it tends to be a mixture of compiler trickery and hardware assistance. OpenCL 2.0 actually exposes this underlying hardware thread concept through sub-groups, so there is another level of hierarchy to deal with.
Work-groups
Each work-group contains a set of work-items that must be able to make progress in the presence of barriers. In practice this means that it is a set, all of whose state is able to exist at the same time, such that when a synchronization primitive is encountered there is little overhead in switching between them and there is a guarantee that the switch is possible.
A work-group must map to a single compute unit, which realistically means an entire work-group fits on a single entity that CPU people would call a core - CUDA would call it a multiprocessor (depending on the generation), AMD a compute unit and others have different names. This locality of execution leads to more efficient synchronization, but it also means that the set of work-items can have access to locally constructed memory units. They are expected to communicate frequently, or barriers wouldn't be used, and to make this communication efficient there may be local caches (similar to a CPU L1) or scratchpad memories (local memory in OpenCL).
As long as barriers are used, work-groups can synchronize internally, between work-items, using local memory, or by using global memory. Work-groups cannot synchronize with each other and the standard makes no guarantees on forward progress of work-groups relative to each other, which makes building portable locking and synchronization primitives effectively impossible.
A lot of this is due to history rather than design. GPU hardware has long been designed to construct vector threads and assign them to execution units in a fashion that optimally processes triangles. OpenCL falls out of generalising that hardware to be useful for other things, but not generalising it so much that it becomes inefficient to implement.

There are already alot of good answers, for further understanding of the terminology of OpenCL this paper ("An Introduction to the OpenCL Programming Model" by Jonathan Tompson and Kristofer Schlachter) actually describes all the concepts very well.

Use of the work-groups allows more optimization for the kernel compilers. This is because data is not transferred between work-groups. Depending on used OpenCL device, there might be caches that can be used for local variables to result faster data accesses. If there is only one work-group, local variables would be just the same as global variables which would lead to slower data accesses.
Also, usually OpenCL devices use Single Instruction Multiple Data (SIMD) extensions to achieve good parallelism. One work group can be run in parallel with SIMD extensions.
Should a Work-Group exactly map to a physical core ?
I think that, only way to find the fastest work-group size, is to try different work-group sizes. It is also possible to query the CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE from the device with clGetKernelWorkGroupInfo. The fastest size should be multiple of that.
Can work-groups synchronize or is that even needed ?
Work-groups cannot be synchronized. This way there is no data dependencies between them and they can also be run sequentially, if that is considered to be the fastest way to run them. To achieve same result, than synchronization between work-groups, kernel needs to split into multiple kernels. Variables can be transferred between the kernels with buffers.

One benefit of work groups is they enable using shared local memory as a programmer-defined cache. A value read from global memory can be stored in shared work-group local memory and then accessed quickly by any work item in the work group. A good example is the game of life: each cell depends on itself and the 8 around it. If each work item read this information you'd have 9x global memory reads. By using work groups and shared local memory you can approach 1x global memory reads (only approach since there is redundant reads at the edges).

Why don't GPU libraries support automated function composition?

Intel's Integrated Performance Primitives (IPP) library has a feature called Deferred Mode Image Processing (DMIP). It lets you specify a sequence of functions, composes the functions, and applies the composed function to an array via cache-friendly tiled processing. This gives better performance than naively iterating through the whole array for each function.
It seems like this technique would benefit code running on a GPU as well. There are many GPU libraries available, such as NVIDIA Performance Primitives (NPP), but none seem to have a feature like DMIP. Am I missing something? Or is there a reason that GPU libraries would not benefit from automated function composition?

GPU programming has similar concepts to DMIP function composition on CPU.
Although it's not easy to be "automated" on GPU (some 3rd party lib may be able to do it), manually doing it is easier than CPU programming (see Thrust example below).
Two main features of DMIP:
processing by the image fragments so data can fit into cache;
parallel processing to different fragments or execution of different independent branches of a graph.
When applying a sequence of basic operations on a large image,
feature 1 omits RAM read/write between basic operations. All the read/write is done in cache, and feature 2 can utilize multi-core CPU.
The similar concept of DMIP feature 1 for GPGPU is kernel fusion. Instead of applying multiple kernels of the basic operation to image data, one could combine basic operations in one kernel to avoid multiple GPU global memory read/write.
A manually kernel fusion example using Thrust can be found in page 26 of this slides.
It seems the library ArrayFire has made notable efforts on automaic kernel fusion.
The similar concept of DMIP feature 2 for GPGPU is concurrent kernel execution.
This feature enlarge the bandwidth requirement, but most GPGPU programs are already bandwidth bound. So the concurrent kernel execution is not likely to be used very often.
CPU cache vs. GPGPU shared mem/cache
CPU cache omits the RAM read/write in DMIP, while for a fused kernel of GPGPU, registers do the same thing. Since a thread of CPU in DMIP processes on a small image fragment, but a thread of GPGPU often processes only one pixel. A few registers are large enough to buffer the data of a GPU thread.
For image processing, GPGPU shared mem/cache is often used when the result pixel depends on surrounding pixels. Image smoothing/filtering is a typical example requiring GPGPU shared mem/cache.

What types of code domains is OpenCL suited to?

I read the OpenCL overview, and it states it is suitable for code that runs of CPUs, GPGPUs, DSPs, etc. However, from looking through the command reference, it seems to be all math and image type operations. I didn't see anything for say strings.
This makes me wonder what would you run on a CPU via OpenCL?
Further, I know OpenCL can be used to perform sorting on GPGPUs. But would one ever use it (or, for that matter, a current GPGPU) to perform string processing such as pattern matching, metaphone extraction, dictionary lookup, or anything else that requires the processing of arrays of strings.
EDIT
I noticed that Intel's upcoming Ivy Bridge is touted as "OpenCL compliant" with reference to its graphics units. Does this infer that the CPU cores are not OpenCL compliant, or is there no such inference?
EDIT
In the interests of non-debate and constructiveness, I would appreciate if anyone could point me to official references that would answer my question.

You can think of OpenCL as a combination of a runtime (for device discovery, queueing) and a C-based programming language. This programming language has native vector types and built-in functions and operations for doing all sorts fun stuff to these vectors. This is nice in that you can write a vectorized kernel in OpenCL, and it it the responsibility of the implementation to map that to the actual vector ISA of your hardware.
From this 4/2011 article, which might vanish:
There are two major CPU architectures out there, x86 and ARM, both of
which should soon run OpenCL code.
If you write an OpenCL application that targets both of these architectures, you wouldn't have to worry about writing two versions, one SSE and one NEON. Just write OpenCL C and be done with it. Yes, I know. This assumes the vendor has done his job and written a solid implementation that fully utilizes the underlying ISA. But if he doesn't, complain!
In addition, some CL implementations offer auto-vectorization of scalar kernels, which are usually easier to write. A good auto-vectorizer would give you a solid performance increase for no effort. Since CL kernels are compiled "online," obtaining such a benefit wouldn't require shipping rebuilt code.

No links, but I would assume this is because algorithms that use strings may do a lot of dynamic memory allocation and branching, both of which GPGPUs are not well-suited for. GPGPUs also have a lot in common with vector processing, so doing units of work with different sized blocks of memory (which a string algorithm will generally work on, you usually don't have a homogeneous group of strings), yields poorer performance and is hard to program.
GPUs were designed to do the same work, with little to no branching, on a homogeneous group of data (such as per-vector or per-pixel operations). Algorithms that can mimic this type of behavior are great on GPUs.

This makes me wonder what would you run on a CPU via OpenCL?
I prefer to use ocl to offload work from the cpu to my graphics hardware. Sometimes there is a limitation with my video card, so I like having a backup kernel for cpu use. Such limitations can be memory size, memory bottleneck, low clock speed, or when the pci-e bus gets in the way.
I say I like using a separate kernel for cpu, because I think all kernels should be tweaked to run on their target hardware. I even like to have an openmp backup plan, as most algorithms I use get tested out in this manner ahead of time.
I suppose it is best practice to test out a gpu kernel on the cpu to make sure it runs as expected. If a user of your software has opencl installed, but only a cpu (or a low-end gpu) it's nice to be able to execute the same code on the different devices.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio