What is the right way to implement a stack push on OpenCL 1.2? - algorithm

Question
Suppose multiple work-items want to append to a global stack:
void kernel(__global int* stack) {
... do stuff ...
push(stack, value);
... do stuff ...
return y;
}
It is desirable that, after the kernel runs, stack contains every value pushed to it. Order does not matter. What is the proper way to do it in OpenCL 1.2?
What I've tried
An obvious idea would be to use atomic_inc to get the length and just write to it:
void push(__global int* stack, int val) {
int idx = atomic_inc(stack) + 1; // first element is the stack length
stack[idx] = val;
}
But I speculate having all work-items call atomic_inc separately on the same memory position ruins the parallelism. A separate idea would be to just write to a temporary array larger than the number of work items:
void push(__global int* stack, int val) {
stack[get_global_id(0)] = val;
}
That'd leave us with a sparse array of values:
[0, 0, 0, 7, 0, 0, 0, 2, 0, 0, 3, 0, 0, 0, 9, 0, 0, ...]
Which could then be compacted using "stream compaction". I, thus, wonder what of those ideas is the most efficient, and if perhaps there is a third option I'm not aware of.

I can't give you a definite answer here, but I can make a few suggestions of things to try - if you have the resources, try to implement more than one of them and profile their performance on all the different types of OpenCL implementation you're planning to deploy on. You might find that different solutions perform differently on different hardware/software.
Create a stack per work-group in local memory (either explicitly or by compacting after all values have been generated) and only increment the global stack by the per-work-group count and copy the whole local stack into the global one. This means you only have one global atomic add per work-group. Works better for large groups of course.
Your biggest source of atomic contention in the naive approach will be from items on the same work-group. So you could create as many stacks as items per work group, and have each item in the group submit to its "own" stack. You'll still need a compaction step after this to combine it all into one list. Vary group size if you try this. I'm not sure to what extent current GPUs suffer from false sharing (atomics locking a whole cache line, not just that word) so you'll want to check that and/or experiment with different gaps between stack counters in memory.
Write all results to fixed offsets (based on global id) an array large enough to catch the worst case, and queue a separate compaction kernel that post-processes the result into a contiguous array.
Don't bother with a compact representation of the result. Instead, use the sparse array as the input for the next stage of computation. This next stage's work group can compact a fixed subset of the sparse array into local memory. When that's done, each work item then works on one item of the compacted array. Iterate inside the kernel until all have been processed. How well this works will depend on how predictable the statistical distribution of the sparse items in the array is, and your choice of work group size and how much of the sparse array each work group processes. This version also avoids the round trip to the host processor.
On Intel IGPs specifically, I have heard that DirectX/OpenGL/Vulkan geometry shaders with variable number of outputs perform exceptionally well. If you can write your algorithm in the format of a geometry shader, this might be worth a try if you're targeting those devices. For nvidia/AMD, don't bother with this.
There are probably other options, but those should give you some ideas.

Related

Is there any efficient way to compact a sparse array in OpenCL / CUDA? [duplicate]

I'm trying to construct a parallel algorithm with CUDA that takes an array of integers and removes all of the 0's with or without keeping the order.
Example:
Global Memory: {0, 0, 0, 0, 14, 0, 0, 17, 0, 0, 0, 0, 13}
Host Memory Result: {17, 13, 14, 0, 0, ...}
The simplest way is to use the host to remove the 0's in O(n) time. But considering I have around 1000 elements, it probably will be faster to leave everything on the GPU and condense it first, before sending it.
The preferred method would be to create an on-device stack, such that each thread can pop and push (in any order) onto or off of the stack. However, I don't think CUDA has an implementation of this.
An equivalent (but much slower) method would be to keep attempting to write, until all threads have finished writing:
kernalRemoveSpacing(int * array, int * outArray, int arraySize) {
if (array[threadId.x] == 0)
return;
for (int i = 0; i < arraySize; i++) {
array = arr[threadId.x];
__threadfence();
// If we were the lucky thread we won!
// kill the thread and continue re-reincarnated in a different thread
if (array[i] == arr[threadId.x])
return;
}
}
This method has only benefit in that we would perform in O(f(x)) time, where f(x) is the average number of non-zero values there are in an array (f(x) ~= ln(n) for my implementation, thus O(ln(n)) time, but has a high O constant)
Finally, a sort algorithm such as quicksort or mergesort would also solve the problem, and does in fact run in O(ln(n)) relative time. I think there might be an algorithm faster than this even, as we do not need to waste time ordering (swapping) zero-zero element pairs, and non-zero non-zero element pairs (the order does not need to be kept).
So I'm not quite sure which method would be the fastest, and I still
think there's a better way of handling this. Any suggestions?
What you are asking for is a classic parallel algorithm called stream compaction1.
If Thrust is an option, you may simply use thrust::copy_if. This is a stable algorithm, it preserves relative order of all elements.
Rough sketch:
#include <thrust/copy.h>
template<typename T>
struct is_non_zero {
__host__ __device__
auto operator()(T x) const -> bool {
return x != 0;
}
};
// ... your input and output vectors here
thrust::copy_if(input.begin(), input.end(), output.begin(), is_non_zero<int>());
If Thrust is not an option, you may implement stream compaction yourself (there is plenty of literature on the topic). It's a fun and reasonably simple exercise, while also being a basic building block for more complex parallel primitives.
(1) Strictly speaking, it's not exactly stream compaction in the traditional sense, as stream compaction is traditionally a stable algorithm but your requirements do not include stability. This relaxed requirement could perhaps lead to a more efficient implementation?
Stream compaction is a well-known problem for which a lot of code was written (Thrust, Chagg to cite two libraries that implement stream compaction on CUDA).
If you have a relatively new CUDA-capable device that supports intrinsic function as __ballot (compute capability >= 3.0) it is worth trying a small CUDA procedure that performs stream compaction much faster than Thrust.
Here finds the code and minimal doc.
https://github.com/knotman90/cuStreamComp
It uses a ballotting function in a single kernel fashion to perform the compaction.
Edit:
I wrote an article explaining the inner workings of this approach. You can find it here if you are interested.
With this answer, I'm only trying to provide more details to Davide Spataro's approach.
As you mentioned, stream compaction consists of removing undesired elements in a collection depending on a predicate. For example, considering an array of integers and the predicate p(x)=x>5, the array A={6,3,2,11,4,5,3,7,5,77,94,0} is compacted to B={6,11,7,77,94}.
The general idea of stream compaction approaches is that a different computational thread be assigned to a different element of the array to be compacted. Each of such threads must decide to write its corresponding element to the output array depending on whether it satisfies the relevant predicate or not. The main problem of stream compaction is thus letting each thread know in which position the corresponding element must be written in the output array.
The approach in [1,2] is an alternative to Thrust's copy_if mentioned above and consists of three steps:
Step #1. Let P be the number of launched threads and N, with N>P, the size of the vector to be compacted. The input vector is divided in sub-vectors of size S equal to the block size. The __syncthreads_count(pred) block intrinsic is exploited which counts the number of threads in a block satisfying the predicate pred. As a result of the first step, each element of the array d_BlockCounts, which has size N/P, contains the number of elements meeting the predicate pred in the corresponding block.
Step #2. An exclusive scan operation is performed on the array d_BlockCounts. As a result of the second step, each thread knows how many elements in the previous blocks write an element. Accordingly, it knows the position where to write its corresponding element, but for an offset related to its own block.
Step #3. Each thread computes the mentioned offset using warp intrinsic functions and eventually writes to the output array. It should be noted that the execution of step #3 is related to warp scheduling. As a consequence, the elements order in the output array does not necessarily reflect the elements order in the input array.
Of the three steps above, the second is performed by CUDA Thrust’s exclusive_scan primitive and is computationally significantly less demanding than the other two.
For an array of 2097152 elements, the mentioned approach has executed in 0.38ms on an NVIDIA GTX 960 card, in contrast to 1.0ms of CUDA Thrust’s copy_if. The mentioned approach appears to be faster for two reasons:
1) It is specifically tailored to cards supporting warp intrinsic elements;
2) The approach does not guarantee the output ordering.
It should be noticed that we have tested the approach also against the code available at inkc.sourceforge.net. Although the latter code is arranged in a single kernel call (it does not employ any CUDA Thrust primitive), it has not better performance as compared to the three-kernels version.
The full code is available here and is slightly optimized as compared to the original Davide Spataro's routine.
[1] M.Biller, O. Olsson, U. Assarsson, “Efficient stream compaction on wide SIMD many-core architectures,” Proc. of the Conf. on High Performance Graphics, New Orleans, LA, Aug. 01 - 03, 2009, pp. 159-166.
[2] D.M. Hughes, I.S. Lim, M.W. Jones, A. Knoll, B. Spencer, “InK-Compact: in-kernel stream compaction and its application to multi-kernel data visualization on General-Purpose GPUs,” Computer Graphics Forum, vol. 32, n. 6, pp. 178-188, 2013.

Stack overflow solution with O(n) runtime

I have a problem related to runtime for push and pop in a stack.
Here, I implemented a stack using array.
I want to avoid overflow in the stack when I insert a new element to a full stack, so when the stack is full I do the following (Pseudo-Code):
(I consider a stack as an array)
Generate a new array with the size of double of the origin array.
Copy all the elements in the origin stack to the new array in the same order.
Now, I know that for a single push operation to the stack with the size of n the action executes in the worst case in O(n).
I want to show that the runtime of n pushes to an empty stack in the worst case is also O(n).
Also how can I update this algorithm that for every push the operation will execute in a constant runtime in the worst case?
Amortized constant-time is often just as good in practice if not better than constant-time alternatives.
Generate a new array with the size of double of the origin array.
Copy all the elements in the origin stack to the new array in the same order.
This is actually a very decent and respectable solution for a stack implementation because it has good locality of reference and the cost of reallocating and copying is amortized to the point of being almost negligible. Most generalized solutions to "growable arrays" like ArrayList in Java or std::vector in C++ rely on this type of solution, though they might not exactly double in size (lots of std::vector implementations increase their size by something closer to 1.5 than 2.0).
One of the reasons this is much better than it sounds is because our hardware is super fast at copying bits and bytes sequentially. After all, we often rely on millions of pixels being blitted many times a second in our daily software. That's a copying operation from one image to another (or frame buffer). If the data is contiguous and just sequentially processed, our hardware can do it very quickly.
Also how can I update this algorithm that for every push the operation
will execute in a constant runtime in the worst case?
I have come up with stack solutions in C++ that are ever-so-slightly faster than std::vector for pushing and popping a hundred million elements and meet your requirements, but only for pushing and popping in a LIFO pattern. We're talking about something like 0.22 secs for vector as opposed to 0.19 secs for my stack. That relies on just allocating blocks like this:
... of course typically with more than 5 elements worth of data per block! (I just didn't want to draw an epic diagram). There each block stores an array of contiguous data but when it fills up, it links to a next block. The blocks are linked (storing a previous link only) but each one might store, say, 512 bytes worth of data with 64-byte alignment. That allows constant-time pushes and pops without the need to reallocate/copy. When a block fills up, it just links a new block to the previous block and starts filling that up. When you pop, you just pop until the block becomes empty and then once it's empty, you traverse its previous link to get to the previous block before it and start popping from that (you can also free the now-empty block at this point).
Here's your basic pseudo-C++ example of the data structure:
template <class T>
struct UnrolledNode
{
// Points to the previous block. We use this to get
// back to a former block when this one becomes empty.
UnrolledNode* prev;
// Stores the number of elements in the block. If
// this becomes full with, say, 256 elements, we
// allocate a new block and link it to this one.
// If this reaches zero, we deallocate this block
// and begin popping from the previous block.
size_t num;
// Array of the elements. This has a fixed capacity,
// say 256 elements, though determined at runtime
// based on sizeof(T). The structure is a VLS to
// allow node and data to be allocated together.
T data[];
};
template <class T>
struct UnrolledStack
{
// Stores the tail end of the list (the last
// block we're going to be using to push to and
// pop from).
UnrolledNode<T>* tail;
};
That said, I actually recommend your solution instead for performance since mine barely has a performance edge over the simple reallocate and copy solutions and yours would have a slight edge when it comes to traversal since it can just traverse the array in a straightforward sequential fashion (as well as straightforward random-access if you need it). I didn't actually implement mine for performance reasons. I implemented it to prevent pointers from being invalidated when you push things to the container (the actual thing is a memory allocator in C) and, again, in spite of achieving true constant-time push backs and pop backs, it's still barely any faster than the amortized constant-time solution involving reallocation and memory copying.

CUDA stream compaction algorithm

I'm trying to construct a parallel algorithm with CUDA that takes an array of integers and removes all of the 0's with or without keeping the order.
Example:
Global Memory: {0, 0, 0, 0, 14, 0, 0, 17, 0, 0, 0, 0, 13}
Host Memory Result: {17, 13, 14, 0, 0, ...}
The simplest way is to use the host to remove the 0's in O(n) time. But considering I have around 1000 elements, it probably will be faster to leave everything on the GPU and condense it first, before sending it.
The preferred method would be to create an on-device stack, such that each thread can pop and push (in any order) onto or off of the stack. However, I don't think CUDA has an implementation of this.
An equivalent (but much slower) method would be to keep attempting to write, until all threads have finished writing:
kernalRemoveSpacing(int * array, int * outArray, int arraySize) {
if (array[threadId.x] == 0)
return;
for (int i = 0; i < arraySize; i++) {
array = arr[threadId.x];
__threadfence();
// If we were the lucky thread we won!
// kill the thread and continue re-reincarnated in a different thread
if (array[i] == arr[threadId.x])
return;
}
}
This method has only benefit in that we would perform in O(f(x)) time, where f(x) is the average number of non-zero values there are in an array (f(x) ~= ln(n) for my implementation, thus O(ln(n)) time, but has a high O constant)
Finally, a sort algorithm such as quicksort or mergesort would also solve the problem, and does in fact run in O(ln(n)) relative time. I think there might be an algorithm faster than this even, as we do not need to waste time ordering (swapping) zero-zero element pairs, and non-zero non-zero element pairs (the order does not need to be kept).
So I'm not quite sure which method would be the fastest, and I still
think there's a better way of handling this. Any suggestions?
What you are asking for is a classic parallel algorithm called stream compaction1.
If Thrust is an option, you may simply use thrust::copy_if. This is a stable algorithm, it preserves relative order of all elements.
Rough sketch:
#include <thrust/copy.h>
template<typename T>
struct is_non_zero {
__host__ __device__
auto operator()(T x) const -> bool {
return x != 0;
}
};
// ... your input and output vectors here
thrust::copy_if(input.begin(), input.end(), output.begin(), is_non_zero<int>());
If Thrust is not an option, you may implement stream compaction yourself (there is plenty of literature on the topic). It's a fun and reasonably simple exercise, while also being a basic building block for more complex parallel primitives.
(1) Strictly speaking, it's not exactly stream compaction in the traditional sense, as stream compaction is traditionally a stable algorithm but your requirements do not include stability. This relaxed requirement could perhaps lead to a more efficient implementation?
Stream compaction is a well-known problem for which a lot of code was written (Thrust, Chagg to cite two libraries that implement stream compaction on CUDA).
If you have a relatively new CUDA-capable device that supports intrinsic function as __ballot (compute capability >= 3.0) it is worth trying a small CUDA procedure that performs stream compaction much faster than Thrust.
Here finds the code and minimal doc.
https://github.com/knotman90/cuStreamComp
It uses a ballotting function in a single kernel fashion to perform the compaction.
Edit:
I wrote an article explaining the inner workings of this approach. You can find it here if you are interested.
With this answer, I'm only trying to provide more details to Davide Spataro's approach.
As you mentioned, stream compaction consists of removing undesired elements in a collection depending on a predicate. For example, considering an array of integers and the predicate p(x)=x>5, the array A={6,3,2,11,4,5,3,7,5,77,94,0} is compacted to B={6,11,7,77,94}.
The general idea of stream compaction approaches is that a different computational thread be assigned to a different element of the array to be compacted. Each of such threads must decide to write its corresponding element to the output array depending on whether it satisfies the relevant predicate or not. The main problem of stream compaction is thus letting each thread know in which position the corresponding element must be written in the output array.
The approach in [1,2] is an alternative to Thrust's copy_if mentioned above and consists of three steps:
Step #1. Let P be the number of launched threads and N, with N>P, the size of the vector to be compacted. The input vector is divided in sub-vectors of size S equal to the block size. The __syncthreads_count(pred) block intrinsic is exploited which counts the number of threads in a block satisfying the predicate pred. As a result of the first step, each element of the array d_BlockCounts, which has size N/P, contains the number of elements meeting the predicate pred in the corresponding block.
Step #2. An exclusive scan operation is performed on the array d_BlockCounts. As a result of the second step, each thread knows how many elements in the previous blocks write an element. Accordingly, it knows the position where to write its corresponding element, but for an offset related to its own block.
Step #3. Each thread computes the mentioned offset using warp intrinsic functions and eventually writes to the output array. It should be noted that the execution of step #3 is related to warp scheduling. As a consequence, the elements order in the output array does not necessarily reflect the elements order in the input array.
Of the three steps above, the second is performed by CUDA Thrust’s exclusive_scan primitive and is computationally significantly less demanding than the other two.
For an array of 2097152 elements, the mentioned approach has executed in 0.38ms on an NVIDIA GTX 960 card, in contrast to 1.0ms of CUDA Thrust’s copy_if. The mentioned approach appears to be faster for two reasons:
1) It is specifically tailored to cards supporting warp intrinsic elements;
2) The approach does not guarantee the output ordering.
It should be noticed that we have tested the approach also against the code available at inkc.sourceforge.net. Although the latter code is arranged in a single kernel call (it does not employ any CUDA Thrust primitive), it has not better performance as compared to the three-kernels version.
The full code is available here and is slightly optimized as compared to the original Davide Spataro's routine.
[1] M.Biller, O. Olsson, U. Assarsson, “Efficient stream compaction on wide SIMD many-core architectures,” Proc. of the Conf. on High Performance Graphics, New Orleans, LA, Aug. 01 - 03, 2009, pp. 159-166.
[2] D.M. Hughes, I.S. Lim, M.W. Jones, A. Knoll, B. Spencer, “InK-Compact: in-kernel stream compaction and its application to multi-kernel data visualization on General-Purpose GPUs,” Computer Graphics Forum, vol. 32, n. 6, pp. 178-188, 2013.

C++ - need suggestion on 2-d Data structure - size of 1-D is not fixed

I wish to have a 2-D data structure to store SAT formulas. What I want is similar to 2-D arrays. But I wish to do dynamic memory allocation.
I was thinking in terms of array of vectors of the following form :-
typedef vector<int> intVector;
intVector *clause;
clause = new intVector [numClause + 1] ;
Thus clause[0] will be one vector, clause[1] will be another and so on. And each vector may have a different size. But I am unsure if this is the right thing to do in terms of memory allocation.
Can an array of vectors be made, so that the vectors have different sizes ? How bad is it on memory management ?
Thanks.
To memory management will be fine using STL (dynamic memory alloc), performance will go down because is dynamic and doesn't do direct access.
If you're in complete dispair you may use a vector of pairs, with the first as an static array and the second, the count of used elements. If some array grows bigger, you may reallocate with more memory and increase the count. This will be a big mess, but may work. You'll lose less memory and make direct acess to the second level of arrays, but is ugly and dirt.

Sum reduction of binary sequence

Consider a binary sequence:
11000111
I have to find sum of this series (actually in parallel)
Sum =1+1+0+0+0+1+1+1= 5
This is a waste of resource as why invest time in adding 0s?
Is there any clever way to sum this sequence so I can avoid unnecessary additions?
Operate at the byte level rather than the bit level. Use a small LUT to convert a byte to a population count. That way you're only doing one lookup and one add per 8 bits. Unless your data is likely to be very sparse this should be quite efficient.
Well it depends on how you store your bitset.
If it's an array, then you can't do more than a plain for. If you want to do this in parallel, just split the array in chunks and process them concurrently.
If we are talking about a bitset (storing the bits in a native (32/64-bit) integer type), then the simplest way to count bits would be this one:
int bitset;
int s = 0;
for (; bitset; s++)
bitset &= bitset-1;
This removes the last bit of 1 at every step, so you have O(s).
Of course, you can combine these two methods if you need more than 32/64 bits
I dunno why people are answering, not even looking into link from the 1st comment to the question. You can easily make it under O(size_of_bitset). At lewast when it comes to constant factor.
You could use this method (found in link by J.F. Sebastian):
inline int count_bits(int num){
int sum = 0;
for (; bitset; sum++) bitset &= bitset-1;
return sum;
}
int main (void){
int array[N];
int total_sum = 0;
#pragma omp parallel for reduction(+:total_sum)
for (size_t i = 0; i < N, i++){
total_sum += count_bits(array[i]);
}
}
This will count number of bits in memory range of array in parallel. The inline is important to avoid unnecessary copying, also the compiler should optimize it much better.
You can swap the count_bits with anything better that counts bits in an integer to get faster if you find anything. This version has complexity of O(bits_set) (not size of the bit set!).
Invoking the parallel construct will introduce quite a lot of overhead compared to a single summation that it does need to be quite large to compensate.
The parallelism is done via OpenMP. The partial sum of each thread is summed at the end of the parallel loop and stored in total_sum. Note the total_sum will be private inside the loop for each thread reduction due to reduction clause.
You could alter the code to make it count bits set in arbitrary memory region but it is quite important for it to be memory aligned when you perform operations on such low level.
As far as I can see, it would be wasteful to try to handle the zeros specially. As #bdares said, addition is really cheap. At a minimum, you'll need to execute N instructions to sum up the an N-bit sequence, that would be if you unconditionally sum ever bit. If you add a test to see whether the bit is a 0 or 1, that's another instruction that needs to be executed for each bit. Even if there's no branch penalty, you're executing minimum 1 instruction for every bit (the conditional test), and then you're also executing the original instruction (the add) for any bits that are equal to 1. So even without branch penalty, this takes more time to execute.
#bdares mentions that the compiler will optimize out the branches, but that's only if the value of each bit is known at compile time, and if you know the values of the bits at compile time, you should just add them up yourself in advance.
There might be some cute things you can do with bit twiddling. For instance, if you take the bits two at a time you're adding up values of 0, 1, 2, or 3, and only have half as many additions to do. There may by something you can then do with the result to convert it into the value you want, but I haven't actually thought about how to do that.

Resources