CUDA reduction using thrust inside kernel - parallel-processing

I want to do parallel reduction, but inside my kernel with data in shared memory. Is this possible with thrust library ?
Something like
int sum = thrust::reduce(myIntArray, myIntArray+numberOfItems, (int) 0, thrust::max_element<int>());
But this doesn't work inside kernel. Is it possible? Thank you.

No, thrust::reduce() is a host function that results in the execution of CUDA kernels if the data is on the GPU.
You would have to dig into the thrust source and find the __device__ functions it uses for reduction. Those would be callable from your kernel. If the logic for reduction is contained in other __global__ kernels, you'll have to piece it together manually in order to use it.


Are function calls in Metal Shaders expensive?

I have code that's shared between different compute shaders, located in different #include files. It ranges from custom data types to utility functions.
I'm wondering whether these functions could become a performance issue as the project gets bigger and more of them need to be called?
Are functions automatically inlined when appropriate?
The Metal shader compiler should flatten out all shader code down into one method. You should not need to be concerned about inlining, the more important thing is that your code is constructed to take advantage of parallel processing and coalesced reads and writes.

Is batching same functions with SIMD instruction possible?

I have a scenario that many exact same functions(for simplicity let's just consider C/C++ and python here) will be executed at the same time on my machine. Intuitively I just use multi-threading to treat each instance of a function as a thread to utilize the parallism, they do not contend for same resources but they will do many branch operation(e.g., for loop). However, since they are actually the same functions, I'm thinking about batching them using some SIMD instructions, e.g., AVX-512. Of course, it should be automatic so that users do not have to modify their code.
The reason? Because every thread/process/container/VM occupies resources, but AVX only needs one instructions. So I can hold more users with the same hardware.
Most articles I find online focus on using AVX instructions inside the function, for example, to accelerate the stream data processing, or deal with some large calculation. None of them mentions batching different instances of same function.
I know there are some challenges, such as different execution path caused by different input, and it is not easy to turn a normal function into a batched version automatically, but I think it is indeed possible technically.
Here are my questions
Is it hard(or possible) to automatically change a normal function into a batched version?
If 1 is no, what restrictions should I put on the function to make it possible? For example, if the function only has one path regardless of the data?
Is there other technologies to better solve the problem? I don't think GPU is a good option to me because GPU cannot support IO or branch instruction, although its SIMT fits perfectly into my goal.
SSE/AVX is basically a vector unit, it allows simple operations (like +-*/ and,or,XOR etc) on arrays of multiple elements at once. AVX1 and 2 has 256 byte registers, so you can do e.g. 8 32-bit singles at once, or 4 doubles. AVX-512 is coming but quite rare atm.
So if your functions are all operations on arrays of basic types, it is a natural fit. Rewriting your function using AVX intrinsics is doable if the operations are very simple. Complex things (like not matching vector widths) or even doing it in assembler is a challenge though.
If your function is not operating on vectors then it becomes difficult, and the possibilities are mostly theoretical. Autovectorizing compilers sometimes can do this, but it s rare and limited, and extremely complex.
There's two ways to fix this: vectorization (SIMD) and parallelization (threads).
GCC can already do the SIMD vectorization you want provided that the function is inlined, and the types and operations are compatible (and it will automatically inline smallish functions without you asking it to).
inline void func (int i) {
somearray[i] = someotherarray[i] * athirdarray[i];
for (int i = 0; i < ABIGNUMBER; i++)
func (i);
Vectorization and inlining are enabled at -O3.
If the functions are too complex, and/or GCC doesn't vectorize it yet, then you can use OpenMP or OpenACC to parallelize it.
OpenMP uses special markup to tell the compiler where to spawn threads.
#pragma omp parallel
#pragma omp for
for (int i = 0; i < ABIGNUMBER; i++)
And yes, you can do that on a GPU too! You do have to do a bit more typing to get the data copied in and out correctly. Only the marked up areas run on the GPU. Everything else runs on the CPU, so I/O etc. is not a problem.
#pragma omp target map(somearray,someotherarray,athirdarray)
#pragma omp parallel
#pragma omp for
for (int i = 0; i < ABIGNUMBER; i++)
OpenACC is a similar idea, but more specialized towards GPUs.
You can find OpenMP and OpenACC compilers in many places. Both GCC and LLVM support NVidia GPUs. LLVM has some support for AMD GPUs, and there are unofficial GCC builds available too (with official support coming soon).

Is a reduction or atomic operation on mat/vec types with OpenGL compute shader possible?

Is it possible to do reduction/update or atomic operations in the computer shader on e.g. mat3, vec3 data types?
Like this scheme:
some_type mat3 A;
void main() {
A += mat3(1);
I have tried out to use shader storage buffer objects (SSBO) but it seems like the update is not atomic (at least I get wrong results when I read back the buffer).
Does anyone have an idea to realize this? Maybe creating a tiny 3x3 image2D and store the result by imageAtomicAdd in there?
There are buffer-based atomics in GLES 3.1.
Section 7.7.
Maybe creating a tiny 3x3 image2D and store the result by imageAtomicAdd in there?
Image atomics are not core and require an extension.
Thank you for the links. I forgot to mention that I work with ARM Mali GPUs and as such they do not expose TLP and do not have warps/wave fronts as Nvidia or AMD. That is, I might have to figure out another quick way.
The techniques proposed in the comments for your post (in particular the log(N) divisor approach where you fold the top half of the results down) still work fine on Mali. The technique doesn't rely on warps/wavefronts - as the original poster said, you just need synchronization (e.g. use a barrier() rather than relying on the implicit barrier which wavefronts would give you).

Alternative for dynamic parallelism for CUDA

I am very new to the CUDA programming model and programming in general, I suppose. I'm attempting to parallelize an expectation maximization algorithm. I am working on a gtx 480 which has compute capability 2.0. At first, I sort of assumed that there's no reason for the device to launch its own threads, but of course, I was sadly mistaken. I came across this pdf.
Unfortunately, dynamic parallelism only works on the latest and greatest GPUs, with compute capability 3.5. Without diving into too much specifics, what is the alternative to dynamic parallelism? The loops in the CPU EM algorithm have many dependencies and are highly nested, which seems to make dynamic parallelism an attractive ability. I'm not sure if my question makes sense so please ask if you need clarification.
Thank you!
As indicated by #JackOLantern, dynamic parallelism can be described in a nutshell as the ability to call a kernel (i.e. a __global__ function) from device code (a __global__ or __device__ function).
Since the kernel call is the principal method by which the machine spins up multiple threads in response to a single function call, there is really no direct alternative that provides all the capability of dynamic parallelism in a device that does not support it (ie. pre cc 3.5 devices).
Without dynamic parallelism, your overall code will almost certainly involve more synchronization and communication between CPU code and GPU code.
The principal method would be to realize some unit of your code as parallelizable, convert it to a kernel, and work through your code in essentially a non-nested fashion. Repetetive functions might be done via looping in the kernel, or else looping in the host code that calls the kernel.
For a pictorial example of what I am trying to describe, please refer to slide 14 of this deck which introduces some of the new features of CUDA 5 including dynamic parallelism. The code architecture on the right is an algorithm realized with dynamic parallelism. The architecture on the left is the same function realized without dynamic parallelism.
I have checked your algorithm in Wikipedia and I'm not sure you need dynamic parallelism at all.
You do the expectation step in your kernel, __syncthreads(), do the maximization step, and __syncthreads() again. From this distance, the expectation looks like a reduction primitive, and the maximization is a filter one.
If it doesn't work, and you need real task parallelism, a GPU may not be the best choice. While the Kepler GPUs can do that to some degree, this is not what this architecture is designed for. In that case you might be better off using a multi-CPU system, such as an office grid, a supercomputer, or a Xeon Phi accelerator. You should also check OpenMP and MPI, these are the languages used for task-parallel programming (actually OpenMP is just a handful of pragmas in most cases).


I'm wondering about NVIDIA's cuBLAS Library. Does anybody have experience with it? For example if I write a C program using BLAS will I be able to replace the calls to BLAS with calls to cuBLAS? Or even better implement a mechanism which let's the user choose at runtime?
What about if I use the BLAS Library provided by Boost with C++?
The answer by janneb is incorrect, cuBLAS is not a drop-in replacement for a CPU BLAS. It assumes data is already on the device, and the function signatures have an extra parameter to keep track of a cuBLAS context.
However, coming in CUDA 6.0 is a new library called NVBLAS which provides exactly this "drop-in" functionality. It intercepts Level3 BLAS calls (GEMM, TRSV, etc) and automatically sends them to the GPU, effectively tiling the PCIE transfer with on-GPU computation.
There is some information here:, and CUDA 6.0 is available to CUDA registered developers today.
Full docs will be online once CUDA 6.0 is released to the general public.
CUBLAS does not wrap around BLAS.
CUBLAS also accesses matrices in a column-major ordering, such as some Fortran codes and BLAS.
I am more used to writing code in C, even for CUDA.
A code written with CBLAS (which is a C wrap of BLAS) can easily be change into a CUDA code.
Be aware that Fortran codes that use BLAS are quite different from C/C++ codes that use CBLAS.
Fortran and BLAS normally store matrices or double arrays in column-major ordering,
but C/C++ normally handle Row-major ordering.
I normally handle this problem writing saving the matrices in a 1D arrays,
and use #define to write a macro toa access the element i,j of a matrix as:
/* define macro to access Aij in the row-wise array A[M*N] */
#define indrow(ii,jj,N) (ii-1)*N+jj-1 /* does not depend on rows M */
/* define macro to access Aij in the col-wise array A[M*N] */
#define indcol(ii,jj,M) (jj-1)*M+ii-1
CBLAS library has a well organize parameters and conventions (const enum variables)
to give to each function the ordering of the matrix.
Beware that also the storage of matrices vary, a row-wise banded matrix is not stored the same as a column-wise band matrix.
I don't think there are mechanics to allow the user to choose between using BLAS or CUBLAS,
without writing the code twice.
CUBLAS also has on most function calls a "handle" variable that does not appear on BLAS.
I though of #define to change the name at each function call, but this might not work.
I've been porting BLAS code to CUBLAS. The BLAS library I use is ATLAS, so what I say may be correct only up to choice of BLAS library.
ATLAS BLAS requires you to specify if you are using Column major ordering or row major ordering, and I chose column major ordering since I was using CLAPACK which uses column major ordering. LAPACKE on the other hand would use row major ordering. CUBLAS is column major ordering. You may need to adjust accordingly.
Even if ordering is not an issue porting to CUBLAS was by no means a drop in replacement. The largest issue is that you must move the data onto and off of the GPU's memory space. That memory is setup using cudaMalloc() and released with cudaFree() which acts as one might expect. You move data into GPU memory using cudaMemcpy(). The time to do this will be a large determining factor on if it's worthwhile to move from CPU to GPU.
Once that's done however, the calls are fairly similar. CblasNoTrans becomes CUBLAS_OP_N and CblasTrans becomes CUBLAS_OP_T. If your BLAS library (as ATLAS does) allows you to pass scalars by value you will have to convert that to pass by reference (as is normal for FORTRAN).
Given this, any switch that allows for a choice of CPU/GPU would most easily be at a higher level than within the function using BLAS. In my case I have CPU and GPU variants of the algorithm and chose them at a higher level depending on the size of the problem.
