GLSL Vulkan compute shader to efficiently prune zeros from a shared array to make a compressed array - algorithm

I have a vulkan compute shader with an array shared inside the local group and I want to perform the following transform:
Basically, I want to remove/prune all the zeros. Is there a fast or parallel method to do this?
I tried to do this in series as follows
shared int arraySize;
shared int array[256];
shared int compressed_array[256];
/*... prepare array in parallel ... */
// run in series on 1st worker
if(gl_LocalInvocationID.x == 0 && gl_LocalInvocationID.y == 0){
arraySize= 0; // initilize arraySize
for (int i = 0; i < 256; i++) {
if(array[i] > 0) // incrementally search for non-zeroes
{
compressed_array[arraySize] = array[i];
arraySize= arraySize + 1;
}
}
}
but it seems to take 1-2[ms] with 256 elements of the array on my GPU, is there any faster way to do this? A parallel algorithm would presumably be faster, does such an algorithm exist?

Thanks to Yuri Kilochek's answer, I was able to find a solution via parallel prefix-sums/scan
Suppose a 'flag' array was added to the data that is 1 if the corresponding array cell is non-zero and 0 otherwise. Then a 'parallel exclusive scan' on the flag array will yield the index of the compressed_array to which invocations with an active flag should write their respective 'array' contents to (scatter operation).
In Vulkan this can be implemented efficiently using subgroups.
However, each subgroup can typically only perform scans up to 32 (Nvidia) or 64 (AMD) elements long while the local group may be several times larger. To perform scans over the whole local group, a layered approach of scans as described here and coded here is necessary.

Yes. This problem is called parallel stream compaction.

Related

CSR x CSR Matrix Multiplication for Finding Out Cycles

I am trying to find out number of cycles in an undirected graph with specified length(k) containing vertex u for each vertex u in the graph. To do so I am trying to find out adjacency matrix's k'th power. I created CSR representation of the graph from the edge list. It is working really fast. But the CSR x CSR multiplication part is really slow, (it seems to be taking 50 min with an input size of 500k x 500k matrix). I am curious about a better solution. Is there a more efficient way to go since this is a adjacency matrix? Or Is there any better CSRxCSR matrix multiplication that I could look at? I could not find any CSR X CSR matrix multiplication example as an algorithm or c++ implementation.
void multiply_matrix(std::vector<int> &adj, std::vector<int> &xadj, std::vector<int> &values, std::vector<int> &adj2, std::vector<int> &xadj2, std::vector<int> &values2, int size)
{
std::vector<int> result_adj;
std::vector<int> result_xadj(size+1,0);
std::vector<int> result_value(values.size(),0);
for(int i = 0; i<size; i++)
{
for(int j = 0; j<size; j++)
{
int result = 0;
int startIndex = xadj[i];
int endIndex = xadj[i+1];
for(int index = startIndex; index<endIndex; index++)
{
int currentValRow = values[adj[index]];
bool shouldContinue = false;
for(int colIndex = xadj2[j]; colIndex<xadj2[j+1]; colIndex++)
{
if(adj[index] == adj2[colIndex])
{
shouldContinue = true;
break;
}
}
if(!shouldContinue)
continue;
int currentValCol = values2[adj2[index]];
result += currentValCol*currentValRow;
}
if(result != 0)
{
result_xadj[i+1]++;
if(i+2 < result_xadj.size())
result_xadj[i+2] = result_xadj[i+1];
result_adj.push_back(j);
result_value[j] = result;
}
}
}
}
I solved my problem and wanted to share with those who also lacks the required "terminology" to find out lots of resources on the topic. When you google "sparse matrix multiplication" it is hard to find sparse matrix x sparse matrix. Which is called SpGEMM. There are lots of informative papers about the process.
The pseudocode of the algorithm I used:
General SpGEMM algorithm
I modified the algorithm a little bit to produce CSR output. The challange with that seems to be the allocation for result arrays to hold csr arrays (values, index_array, etc..). There are different methods used to solve that issue such as:
Allocating the arrays as big as the upper bound. Which may be a problem if your matrices are too big. If you decide to go this way you can look into: https://math.stackexchange.com/questions/1042096/bounds-of-sparse-matrix-multiplication.
Before allocating any memory for the result the multiplication operation can be done to determine the amount of non zeros in the result. Since there is no memory write operation exists in this space the result comes out really fast. So the memory required for the result arrays can be allocated after this "dummy run".
Allocating a pre-determined amount and when it is not sufficient allocating a new array and copying the content to the new, bigger array.
I implemented the function for both CPU (using OpenMP) and GPU (using CUDA). In the OpenMP approach I used a method similar to option 3 that i have listed. I used separate vectors for results of each row. Than I added the resulting vectors. The vector approach may be slower than doing the re-allocation operation manually but it was easier so I choose that way and it is fast enough (the test matrix had 500k row and 500k column the multiplication operation takes around 1.3 seconds using 60 threads on my test machine). For the GPU approach I used the option 2. At first I calculated the required amount then the actual operation happens.
Edit: Also this method finds out "walks" rather than paths. So there might be repeated vertices.

PyOpenCL - Multi-dimensional reduction kernel

I'm a total newbie to OpenCL.
I'm trying to code a reduction kernel that sums along one axis for a multi-dimensional array. I have stumbled upon that code which comes from here: https://tmramalho.github.io/blog/2014/06/16/parallel-programming-with-opencl-and-python-parallel-reduce/
__kernel void reduce(__global float *a, __global float *r, __local float *b) {
uint gid = get_global_id(0);
uint wid = get_group_id(0);
uint lid = get_local_id(0);
uint gs = get_local_size(0);
b[lid] = a[gid];
barrier(CLK_LOCAL_MEM_FENCE);
for(uint s = gs/2; s > 0; s >>= 1) {
if(lid < s) {
b[lid] += b[lid+s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(lid == 0) r[wid] = b[lid];
}
I don't understand the for loop part. I get that uint s = gs/2 means that we split the array in half, but then it is a complete mystery. Without understanding it, I can't really implement another version for taking the maximum of an array for instance, even less for multi-dimensional arrays.
Furthermore, as far as I understand, the reduce kernel needs to be rerun another time if "N is bigger than the number of cores in a single unit".
Could you give me further explanations on that whole piece of code? Or even guidance on how to implement it for taking the max of an array?
Complete code can be found here: https://github.com/tmramalho/easy-pyopencl/blob/master/008_localreduce.py
Your first question about the meaning of the for loop:
for(uint s = gs/2; s > 0; s >>= 1)
It means that you divide the local size gs by 2, and keep dividing by 2 (the shift part s >>= 1 is equivalent to s = s/2) while s > 0, in other words, until s = 1. This algorithm depends on your array's size being a power of 2, otherwise you'd have to deal with the excess of a power of 2 until you have reduced the whole array, or you'd have to fill your array with neutral values for the reduction until completing a power of 2 size.
Your second concern when N is bigger than the capacity of your GPU, you are right: you have to run your reduction in portions that fit and then merge the results.
Finally, when you ask for guidance on how to implement a reduction to get the max of an array, I would suggest the following:
For a simple reduction like max or sum, try using numpy, especially if you are dealing with programming the reduction by axis.
If you think that the GPU would give you an advantage, try first using pyopencl's Multidimensional Array functionality, e.g. max.
If the reduction is more math intensive, try using pyopencl's Parallel Algorithms, e.g. reduction
I think that the whole point of using pyopencl is to avoid dealing with the underlying GPU's architecture. Otherwise, it is easier to deal with CUDA or HIP directly instead of OpenCL.

Fill device array consecutively in CUDA

(This might be more of a theoretical parallel optimization problem then a CUDA specific problem per se. I'm very new to Parallel Programming in general so this may just be personal ignorance.)
I have a workload that consists of a 64-bit binary numbers upon which I run analysis. If the analysis completes successfully then that binary number is a "valid solution". If the analysis breaks midway then the number is "invalid". The end goal is to get a list of all the valid solutions.
Now there are many trillions of 64 bit binary numbers I am analyzing, but only ~5% or less will be valid solutions, and they usually come in bunches (i.e. every consecutive 1000 numbers are valid and then every random billion or so are invalid). I can't find a pattern to the space between bunches so I can't ignore the large chunks of invalid solutions.
Currently, every thread in a kernel call analyzes just one number. If the number is valid it denotes it as such in it's respective place on a device array. Ditto if it's invalid. So basically I generate a data point for very value analyzed regardless if it's valid or not. Then once the array is full I copy it to host only if a valid solution was found (denoted by a flag on the device). With this, overall throughput is greatest when the array is the same size as the # of threads in the grid.
But Copying Memory to & from the GPU is expensive time wise. That said what I would like to do is copy data over only when necessary; I want to fill up a device array with only valid solutions and then once the array is full then copy it over from the host. But how do you consecutively fill an array up in a parallel environment? Or am I approaching this problem the wrong way?
EDIT 1
This is the Kernel I initially developed. As you see I am generating 1 byte of data for each value analyzed. Now I really only need each 64 bit number which is valid; if I need be I can make a new kernel. As suggested by some of the commentators I am currently looking into stream compaction.
__global__ void kValid(unsigned long long*kInfo, unsigned char*values, char *solutionFound) {
//a 64 bit binary value to be evaluated is called a kValue
unsigned long long int kStart, kEnd, kRoot, kSize, curK;
//kRoot is the kValue at the start of device array, this is used is the device array is larger than the total threads in the grid
//kStart is the kValue to start this kernel call on
//kEnd is the last kValue to validate
//kSize is how many bits long is kValue (we don't necessarily use all 64 bits but this value stays constant over the entire chunk of values defined on the host
//curK is the current kValue represented as a 64 bit unsigned integer
int rowCount, kBitLocation, kMirrorBitLocation, row, col, nodes, edges;
kStart = kInfo[0];
kEnd = kInfo[1];
kRoot = kInfo[2];
nodes = kInfo[3];
edges = kInfo[4];
kSize = kInfo[5];
curK = blockIdx.x*blockDim.x + threadIdx.x + kStart;
if (curK > kEnd) {//check to make sure you don't overshoot the end value
return;
}
kBitLocation = 1;//assuming the first bit in the kvalue has a position 1;
for (row = 0; row < nodes; row++) {
rowCount = 0;
kMirrorBitLocation = row;//the bit position for the mirrored kvals is always starts at the row value (assuming the first row has a position of 0)
for (col = 0; col < nodes; col++) {
if (col > row) {
if (curK & (1 << (unsigned long long int)(kSize - kBitLocation))) {//add one to kIterator to convert to counting space
rowCount++;
}
kBitLocation++;
}
if (col < row) {
if (col > 0) {
kMirrorBitLocation += (nodes - 2) - (col - 1);
}
if (curK & (1 << (unsigned long long int)(kSize - kMirrorBitLocation))) {//if bit is set
rowCount++;
}
}
}
if (rowCount != edges) {
//set the ith bit to zero
values[curK - kRoot] = 0;
return;
}
}
//set the ith bit to one
values[curK - kRoot] = 1;
*solutionFound = 1; //not a race condition b/c it will only ever be set to 1 by any thread.
}
(This answer assumes output order is inconsequential and so are the positions of the valid values.)
Conceptually, your analysis produces a set of valid values. The implementation you described uses a dense representation of this set: One bit for every potential value. Yet you've indicated that the data is quite sparse (either 5e-2 or 1000/10^9 = 1e-6); moreover, copying data across PCI express is quite a pain.
Well, then, why not consider a sparse representation? The simplest one would be merely an unordered sequence of the valid values. Of course, writing that requires some synchronization across threads - perhaps even across blocks. Roughly, you can have warps collect their valid values in shared memory; then synchronize at the block level to collect the block's valid values (for a given chunk of the input it has analyzed); and finally use atomics to collect the data from all the blocks.
Oh, also - have each thread analyze multiple values, so you don't have to do that much synchronization.
So, you would want to have each thread analyze multiple numbers (thousands or millions) before you do a return from the computation. So if you analyze a million numbers in your thread, you will only need %5 of that amount of space to possible hold the results of that computation.

How to parallelise a nested loop with cross element dependencies in cuda?

I'm a beginner at cuda and am having some difficulties with it
If I have an input vector A and a result vector B both with size N, and B[i] depends on all elements of A except A[i], how can I code this without having to call a kernel multiple times inside a serial for loop? I can't think of a way to paralelise both the outer and inner loop simultaneously.
edit: Have a device with cc 2.0
example:
// a = some stuff
int i;
int j;
double result = 0;
for(i=0; i<1000; i++) {
double ai = a[i];
for(j=0; j<1000; j++) {
double aj = a[j];
if (i == j)
continue;
result += ai - aj;
}
}
I have this at the moment:
//in host
int i;
for(i=0; i<1000; i++) {
kernelFunc <<<2, 500>>> (i, d_a)
}
Is there a way to eliminate the serial loop?
Something like this should work, I think:
__global__ void my_diffs(const double *a, double *b, const length){
unsigned idx = threadIdx.x + blockDim.x*blockIdx.x;
if (idx < length){
double my_a = a[idx];
double result = 0.0;
for (int j=0; j<length; j++)
result += my_a - a[j];
b[idx] = result;
}
}
(written in browser, not tested)
This can possibly be further optimized in a couple ways, however for cc 2.0 and newer devices that have L1 cache, the benefits of these optimizations might be small:
use shared memory - we can reduce the number of global loads to one per element per block. However, the initial loads will be cached in L1, and your data set is quite small (1000 double elements ?) so the benefits might be limited
create an offset indexing scheme, so each thread is using a different element from the cacheline to create coalesced access (i.e. modify j index for each thread). Again, for cc 2.0 and newer devices, this may not help much, due to L1 cache as well as the ability to broadcast warp global reads.
If you must use a cc 1.x device, then you'll get significant mileage out of one or more optimizations -- the code I've shown here will run noticeably slower in that case.
Note that I've chosen not to bother with the special case where we are subtracting a[i] from itself, as that should be approximately zero anyway, and should not disturb your results. If you're concerned about that, you can special-case it out, easily enough.
You'll also get more performance if you increase the blocks and reduce the threads per block, perhaps something like this:
my_diffs<<<8,128>>>(d_a, d_b, len);
The reason for this is that many GPUs have more than 1 or 2 SMs. To maximize perf on these GPUs with such a small data set, we want to try and get at least one block launched on each SM. Having more blocks in the grid makes this more likely.
If you want to fully parallelize the computation, the approach would be to create a 2D matrix (let's call it c[...]) in GPU memory, of square dimensions equal to the length of your vector. I would then create a 2D grid of threads, and have each thread perform the subtraction (a[row] - a[col]) and store it's result in c[row*len+col]. I would then launch a second (1D) kernel to sum the columns of c (each thread has a loop to sum a column) to create the result vector b. However I'm not sure this would be any faster than the approach I've outlined. Such a "more fully parallelized" approach also wouldn't lend itself as easily to the optimizations I discussed.

Fast way to find the maximum of a float array in OpenCL

I'm having trouble with the simple task of finding the maximum of an array in OpenCL.
__kernel void ndft(/* lots of stuff*/)
{
size_t thread_id = get_global_id(0); // thread_id = [0 .. spectrum_size[
/* MATH MAGIC */
// Now I have float spectrum_abs[spectrum_size] and
// I want the maximum as well as the index holding the maximum
barrier();
// this is the old, sequential code:
if (*current_max_value < spectrum_abs[i])
{
*current_max_value = spectrum_abs[i];
*current_max_freq = i;
}
}
Now I could add if (thread_id == 0) and loop through the entire thing as I would do on a single core system, but since performance is a critical issue (otherwise I wouldn't be doing spectrum calculations on a GPU), is there a faster way to do that?
Returning to the CPU at the end of the kernel above is not an option, because the kernel actually continues after that.
You will need to write a parallel reduction. Split your "large" array into small pieces (a size a single workgroup can effectively process) and compute the min-max in each.
Do this iteratively (involves both host and device code) till you are left with only one set of min/max values.
Note that you might need to write a separate kernel that does this unless the current work-distribution works for this piece of the kernel (see my question to you above).
An alternative if your current work distribution is amenable is to find the min max inside of each workgroup and write it to a buffer in global memory (index = local_id). After a barrier(), simply make the kernel running on thread_id == 0 loop across the reduced results and find the max in it. This will not be the optimal solution, but might be one that fits inside your current kernel.

Resources