Vulkan compute shader for parallel sum reduction

Vulkan compute shader for parallel sum reduction - gpgpu

I want to implement this algorithm
https://dournac.org/info/gpu_sum_reduction
in Vulkan's compute shader. In OpenCL it would be easy because I can explicitly declare
what buffers are __local and which are __global. Unfortunately, I can't seem to find
any such mechanisms in Vulkan. Could somebody more experienced, show me an example, how to get such things working in Vulkan, please?

Subgroups in Vulkan seem equivalent functionality. It is a functionality where the shader invocations can cooperate in the subgroup.
Maybe something like:
void main(){
int partial_sum = subgroupAdd(arr[gl_GlobalInvocationID.x]);
if (subgroupElect()) {
atomicAdd(mem, partial_sum);
}
}
You could study the Subgroup Tutorial.
Then again you could just try going about it the "normal" way by simply:
void main(){
int partial_sum = 0;
for( int i = 0; i < LOCAL_SIZE; ++i ){
partial_sum += arr[gl_WorkGroupID.x * gl_WorkGroupSize.x + i]
}
atomicAdd(mem, partial_sum); // or perhaps without atomics by recursively reducing
}
It is not so much "parallel", but then again needs no barriers. It is just a matter of measuring performance to find what works best, and also might depend how large do you assume the input arrays are.
Disclaimer: I have not tried the shaders, so assume they are kind of a pseudocode and can have bugs.
It should also be possible to implement your linked algorithm nearly verbatim. The equivalent to __local in compute GLSL is shared. The workgroup barrier in GLSL is memoryBarrierShared().

Related

PyOpenCL - Multi-dimensional reduction kernel

I'm a total newbie to OpenCL.
I'm trying to code a reduction kernel that sums along one axis for a multi-dimensional array. I have stumbled upon that code which comes from here: https://tmramalho.github.io/blog/2014/06/16/parallel-programming-with-opencl-and-python-parallel-reduce/
__kernel void reduce(__global float *a, __global float *r, __local float *b) {
uint gid = get_global_id(0);
uint wid = get_group_id(0);
uint lid = get_local_id(0);
uint gs = get_local_size(0);
b[lid] = a[gid];
barrier(CLK_LOCAL_MEM_FENCE);
for(uint s = gs/2; s > 0; s >>= 1) {
if(lid < s) {
b[lid] += b[lid+s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(lid == 0) r[wid] = b[lid];
}
I don't understand the for loop part. I get that uint s = gs/2 means that we split the array in half, but then it is a complete mystery. Without understanding it, I can't really implement another version for taking the maximum of an array for instance, even less for multi-dimensional arrays.
Furthermore, as far as I understand, the reduce kernel needs to be rerun another time if "N is bigger than the number of cores in a single unit".
Could you give me further explanations on that whole piece of code? Or even guidance on how to implement it for taking the max of an array?
Complete code can be found here: https://github.com/tmramalho/easy-pyopencl/blob/master/008_localreduce.py

Your first question about the meaning of the for loop:
for(uint s = gs/2; s > 0; s >>= 1)
It means that you divide the local size gs by 2, and keep dividing by 2 (the shift part s >>= 1 is equivalent to s = s/2) while s > 0, in other words, until s = 1. This algorithm depends on your array's size being a power of 2, otherwise you'd have to deal with the excess of a power of 2 until you have reduced the whole array, or you'd have to fill your array with neutral values for the reduction until completing a power of 2 size.
Your second concern when N is bigger than the capacity of your GPU, you are right: you have to run your reduction in portions that fit and then merge the results.
Finally, when you ask for guidance on how to implement a reduction to get the max of an array, I would suggest the following:
For a simple reduction like max or sum, try using numpy, especially if you are dealing with programming the reduction by axis.
If you think that the GPU would give you an advantage, try first using pyopencl's Multidimensional Array functionality, e.g. max.
If the reduction is more math intensive, try using pyopencl's Parallel Algorithms, e.g. reduction
I think that the whole point of using pyopencl is to avoid dealing with the underlying GPU's architecture. Otherwise, it is easier to deal with CUDA or HIP directly instead of OpenCL.

Eigen: Reduction operations very slow with custom scalar type

This is an issue that came up when profiling my convex optimization library. It uses a non-primitive custom Scalar type. We found that reduction operations like sum() and squaredNorm() are very slow compared to raw loops.
Here's a minimal example to reproduce the issue:
#include "epigraph.hpp"
size_t T = 200;
cvx::OptimizationProblem qp;
cvx::MatrixX u = qp.addVariable("u", 1, T);
// this takes very long
cvx::Scalar u_sum = u.sum();
// a raw loop is about 5x faster
cvx::Scalar sum = 0.;
for (int i = 0; i < u.size(); i++)
{
sum += u(i);
}
I did some profiling but could not get to the root of the issue since the reduction implementations are somewhat convoluted. I suspect that this has to do with internal optimizations that turn out to be slower in this case lead to a lot of allocations. Is there a way to turn those optimizations off?
Profiler Output

Perform both multiplication and addition in parallel using cuda

I have two vectors and I want to calculate dot product of those two vectors in parallel. I was able to do multiplication of each element in parallel and after that addition in parallel. following is the code which I have tried.
But I want to do both multiplication and addition in parallel. That means elements which have performed multiplication should be added even if other elements haven't done multiplication yet. Hope you understood what I have said.
#include<stdio.h>
#include<cuda.h>
__global__ void dotproduct(int *a,int *b,int *c,int N)
{
int k=N;
int i=threadIdx.x;
c[i]=a[i]*b[i];
if(N%2==1)
N=N+1;
__syncthreads();
while(i<(N/2))
{
if((i+1)*2<=k)
{
c[i]=c[i*2]+c[i*2+1];
}
else
{
c[i]=c[k-1];
}
k=N/2;
N=N/2;
if(N%2==1&&N!=1)
N=N+1;
__syncthreads(); //wait for all the threads to synchronize
}
}
int main()
{
int N=10; //vector size
int a[N],b[N],c[N];
int *dev_a,*dev_b,*dev_c;
cudaMalloc((void**)&dev_a,N*sizeof(int));
cudaMalloc((void**)&dev_b,N*sizeof(int));
cudaMalloc((void**)&dev_c,N*sizeof(int));
for(int i=0;i<N;i++)
{
a[i]=rand()%10;
b[i]=rand()%10;
}
cudaMemcpy(dev_a,a,N*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dev_b,b,N*sizeof(int),cudaMemcpyHostToDevice);
dotproduct<<<1,N>>>(dev_a,dev_b,dev_c,N);
cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost);
for(int i=0;i<N;i++)
{
printf("%d,%d\n",a[i],b[i]);
}
printf("the answer is : %d in GPU\n",c[0]);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
cudaThreadExit();
return 0;
}

I don't think it makes sense to do multiplication and addition in parallel - all multiplications will take the same time, and by trying to run different instructions at the same time can reduce the performance. But the part in which you sum the multiplication results can be optimized.
You many need to use atomics or shuffle instructions - read this for a good explanation: https://devblogs.nvidia.com/faster-parallel-reductions-kepler/
And if it's not an exercise but a real task, I suggest you use cuBLAS, it has this functionality build in: https://docs.nvidia.com/cuda/cublas/index.html#cublas-lt-t-gt-dot

Take your favorite reduction (= adding-up of the values of a vector), and modify it so that instead of every read from the single input vector, you perform two reads, with the same index, from two input vectors, and multiply the results.
This will maintain as much parallelism as you were able to effect using the reduction kernel; and if your reads were memory-coalesced before, they will also be coalesced now. Your throughput in terms of output elements per time unit should be almost exactly half what it was before, for very long vectors.

I think that there is also a special PTX ISA instruction which does dot-product at once ( dp4a.atype.btype d, a, b, c; ). With little effort you can try to write a small inline PTX assembly function. Check the documentation.

Using the boost random number generator with OpenMP

I would like to parallelize my boost random number generator code in C++ with OpenMP. I'd like to do it in way that is both efficient and thread safe. Can someone give me pointers on how this is done? I am currently enclosing what I have below; this is clearly not thread safe since the static variable in the sampleNormal function is likely to give
a race condition. The number of samples (nsamples) is much bigger than n.
#pragma omp parallel for private(i,j)
for (i = 0; i < nsamples; i++) {
for (j = 0; j < n; j++) {
randomMatrix[i + nsamples*j] = SampleNormal(0.0, 1.0);
}
}
double SampleNormal (double mean, double sigma)
{
// Create a Mersenne twister random number generator
static mt19937 rng(static_cast<unsigned> (std::time(0)));
// select Gaussian probability distribution
normal_distribution<double> norm_dist(mean, sigma);
// bind random number generator to distribution
variate_generator<mt19937&, normal_distribution<double> > normal_sampler(rng, norm_dist);
// sample from the distribution
return normal_sampler();
}

Do you just need something that's thread-safe or something that scales well? If you don't need very high performance in your PRNG, you can just wrap a lock around uses of the rng object. For higher performance, you need to find or write a parallel pseudorandom number generator -- http://www.cs.berkeley.edu/~mhoemmen/cs194/Tutorials/prng.pdf has a tutorial on them. One option would be to put your mt19937 objects in thread-local storage, making sure to seed different threads with different seeds; that makes reproducing the same results in different runs difficult, if that's important to you.

"find or write a parallel pseudorandom number generator" use TRNG "TINAS random number generator". Its a parallel random number generator library designed to be run on multicore clusters. Much better than Boost. There's an introduction here http://www.lindonslog.com/programming/parallel-random-number-generation-trng/

Random Number Generator in CUDA

I've struggled with this all day, I am trying to get a random number generator for threads in my CUDA code. I have looked through all forums and yes this topic comes up a fair bit but I've spent hours trying to unravel all sorts of codes to no avail. If anyone knows of a simple method, probably a device kernel that can be called to returns a random float between 0 and 1, or an integer that I can transform I would be most grateful.
Again, I hope to use the random number in the kernel, just like rand() for instance.
Thanks in advance

For anyone interested, you can now do it via cuRAND.

I'm not sure I understand why you need anything special. Any traditional PRNG should port more or less directly. A linear congruential should work fine. Do you have some special properties you're trying to establish?

The best way for this is writing your own device function , here is the one
void RNG()
{
unsigned int m_w = 150;
unsigned int m_z = 40;
for(int i=0; i < 100; i++)
{
m_z = 36969 * (m_z & 65535) + (m_z >> 16);
m_w = 18000 * (m_w & 65535) + (m_w >> 16);
cout <<(m_z << 16) + m_w << endl; /* 32-bit result */
}
}
It'll give you 100 random numbers with 32 bit result.
If you want some random numbers between 1 and 1000, you can also take the result%1000, either at the point of consumption, or at the point of generation:
((m_z << 16) + m_w)%1000
Changing m_w and m_z starting values (in the example, 150 and 40) allows you to get a different results each time. You can use threadIdx.x as one of them, which should give you different pseudorandom series each time.
I wanted to add that it works 2 time faster than rand() function, and works great ;)

I think any discussion of this question needs to answer Zenna's orginal request and that is for a thread level implementation. Specifically a device function that can be called from within a kernel or thread. Sorry if I overdid the "in bold" phrases but I really think the answers so far are not quite addressing what is being sought here.
The cuRAND library is your best bet. I appreciate that people are wanting to reinvent the wheel (it makes one appreciate and more properly use 3rd party libraries) but high performance high quality number generators are plentiful and well tested. The best info I can recommend is on the documentation for the GSL library on the different generators here:http://www.gnu.org/software/gsl/manual/html_node/Random-number-generator-algorithms.html
For any serious code it is best to use one of the main algorithms that mathematicians/computer-scientists have into the ground over and over looking for systemic weaknesses. The "mersenne twister" is something with a period (repeat loop) on the order of 10^6000 (MT19997 algorithm means "Mersenne Twister 2^19997") that has been especially adapted for Nvidia to use at a thread level within threads of the same warp using thread id calls as seeds. See paper here:http://developer.download.nvidia.com/compute/cuda/2_2/sdk/website/projects/MersenneTwister/doc/MersenneTwister.pdf. I am actually working to implement somehting using this library and IF I get it to work properly I will post my code. Nvidia has some examples at their documentation site for the current CUDA toolkit.
NOTE: Just for the record I do not work for Nvidia, but I will admit their documentation and abstraction design for CUDA is something I have so far been impressed with.

Depending on your application you should be wary of using LCGs without considering whether the streams (one stream per thread) will overlap. You could implement a leapfrog with LCG, but then you would need to have a sufficiently long period LCG to ensure that the sequence doesn't repeat.
An example leapfrog could be:
template <typename ValueType>
__device__ void leapfrog(unsigned long &a, unsigned long &c, int leap)
{
unsigned long an = a;
for (int i = 1 ; i < leap ; i++)
an *= a;
c = c * ((an - 1) / (a - 1));
a = an;
}
template <typename ValueType>
__device__ ValueType quickrand(unsigned long &seed, const unsigned long a, const unsigned long c)
{
seed = seed * a;
return seed;
}
template <typename ValueType>
__global__ void mykernel(
unsigned long *d_seeds)
{
// RNG parameters
unsigned long a = 1664525L;
unsigned long c = 1013904223L;
unsigned long ainit = a;
unsigned long cinit = c;
unsigned long seed;
// Generate local seed
seed = d_seeds[bid];
leapfrog<ValueType>(ainit, cinit, tid);
quickrand<ValueType>(seed, ainit, cinit);
leapfrog<ValueType>(a, c, blockDim.x);
...
}
But then the period of that generator is probably insufficient in most cases.
To be honest, I'd look at using a third party library such as NAG. There are some batch generators in the SDK too, but that's probably not what you're looking for in this case.
EDIT
Since this just got up-voted, I figure it's worth updating to mention that cuRAND, as mentioned by more recent answers to this question, is available and provides a number of generators and distributions. That's definitely the easiest place to start.

There's an MDGPU package (GPL) which includes an implementation of the GNU rand48() function for CUDA here.
I found it (quite easily, using Google, which I assume you tried :-) on the NVidia forums here.

I haven't found a good parallel number generator for CUDA, however I did find a parallel random number generator based on academic research here: http://sprng.cs.fsu.edu/

You could try out Mersenne Twister for GPUs
It is based on SIMD-oriented Fast Mersenne Twister (SFMT) which is a quite fast and reliable random number generator. It passes Marsaglias DIEHARD tests for Random Number Generators.

In case you're using cuda.jit in Numba for Python, this Random number generator is useful.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio