CUDA stream compaction algorithm - algorithm

I'm trying to construct a parallel algorithm with CUDA that takes an array of integers and removes all of the 0's with or without keeping the order.
Example:
Global Memory: {0, 0, 0, 0, 14, 0, 0, 17, 0, 0, 0, 0, 13}
Host Memory Result: {17, 13, 14, 0, 0, ...}
The simplest way is to use the host to remove the 0's in O(n) time. But considering I have around 1000 elements, it probably will be faster to leave everything on the GPU and condense it first, before sending it.
The preferred method would be to create an on-device stack, such that each thread can pop and push (in any order) onto or off of the stack. However, I don't think CUDA has an implementation of this.
An equivalent (but much slower) method would be to keep attempting to write, until all threads have finished writing:
kernalRemoveSpacing(int * array, int * outArray, int arraySize) {
if (array[threadId.x] == 0)
return;
for (int i = 0; i < arraySize; i++) {
array = arr[threadId.x];
__threadfence();
// If we were the lucky thread we won!
// kill the thread and continue re-reincarnated in a different thread
if (array[i] == arr[threadId.x])
return;
}
}
This method has only benefit in that we would perform in O(f(x)) time, where f(x) is the average number of non-zero values there are in an array (f(x) ~= ln(n) for my implementation, thus O(ln(n)) time, but has a high O constant)
Finally, a sort algorithm such as quicksort or mergesort would also solve the problem, and does in fact run in O(ln(n)) relative time. I think there might be an algorithm faster than this even, as we do not need to waste time ordering (swapping) zero-zero element pairs, and non-zero non-zero element pairs (the order does not need to be kept).
So I'm not quite sure which method would be the fastest, and I still
think there's a better way of handling this. Any suggestions?

What you are asking for is a classic parallel algorithm called stream compaction1.
If Thrust is an option, you may simply use thrust::copy_if. This is a stable algorithm, it preserves relative order of all elements.
Rough sketch:
#include <thrust/copy.h>
template<typename T>
struct is_non_zero {
__host__ __device__
auto operator()(T x) const -> bool {
return x != 0;
}
};
// ... your input and output vectors here
thrust::copy_if(input.begin(), input.end(), output.begin(), is_non_zero<int>());
If Thrust is not an option, you may implement stream compaction yourself (there is plenty of literature on the topic). It's a fun and reasonably simple exercise, while also being a basic building block for more complex parallel primitives.
(1) Strictly speaking, it's not exactly stream compaction in the traditional sense, as stream compaction is traditionally a stable algorithm but your requirements do not include stability. This relaxed requirement could perhaps lead to a more efficient implementation?

Stream compaction is a well-known problem for which a lot of code was written (Thrust, Chagg to cite two libraries that implement stream compaction on CUDA).
If you have a relatively new CUDA-capable device that supports intrinsic function as __ballot (compute capability >= 3.0) it is worth trying a small CUDA procedure that performs stream compaction much faster than Thrust.
Here finds the code and minimal doc.
https://github.com/knotman90/cuStreamComp
It uses a ballotting function in a single kernel fashion to perform the compaction.
Edit:
I wrote an article explaining the inner workings of this approach. You can find it here if you are interested.

With this answer, I'm only trying to provide more details to Davide Spataro's approach.
As you mentioned, stream compaction consists of removing undesired elements in a collection depending on a predicate. For example, considering an array of integers and the predicate p(x)=x>5, the array A={6,3,2,11,4,5,3,7,5,77,94,0} is compacted to B={6,11,7,77,94}.
The general idea of stream compaction approaches is that a different computational thread be assigned to a different element of the array to be compacted. Each of such threads must decide to write its corresponding element to the output array depending on whether it satisfies the relevant predicate or not. The main problem of stream compaction is thus letting each thread know in which position the corresponding element must be written in the output array.
The approach in [1,2] is an alternative to Thrust's copy_if mentioned above and consists of three steps:
Step #1. Let P be the number of launched threads and N, with N>P, the size of the vector to be compacted. The input vector is divided in sub-vectors of size S equal to the block size. The __syncthreads_count(pred) block intrinsic is exploited which counts the number of threads in a block satisfying the predicate pred. As a result of the first step, each element of the array d_BlockCounts, which has size N/P, contains the number of elements meeting the predicate pred in the corresponding block.
Step #2. An exclusive scan operation is performed on the array d_BlockCounts. As a result of the second step, each thread knows how many elements in the previous blocks write an element. Accordingly, it knows the position where to write its corresponding element, but for an offset related to its own block.
Step #3. Each thread computes the mentioned offset using warp intrinsic functions and eventually writes to the output array. It should be noted that the execution of step #3 is related to warp scheduling. As a consequence, the elements order in the output array does not necessarily reflect the elements order in the input array.
Of the three steps above, the second is performed by CUDA Thrust’s exclusive_scan primitive and is computationally significantly less demanding than the other two.
For an array of 2097152 elements, the mentioned approach has executed in 0.38ms on an NVIDIA GTX 960 card, in contrast to 1.0ms of CUDA Thrust’s copy_if. The mentioned approach appears to be faster for two reasons:
1) It is specifically tailored to cards supporting warp intrinsic elements;
2) The approach does not guarantee the output ordering.
It should be noticed that we have tested the approach also against the code available at inkc.sourceforge.net. Although the latter code is arranged in a single kernel call (it does not employ any CUDA Thrust primitive), it has not better performance as compared to the three-kernels version.
The full code is available here and is slightly optimized as compared to the original Davide Spataro's routine.
[1] M.Biller, O. Olsson, U. Assarsson, “Efficient stream compaction on wide SIMD many-core architectures,” Proc. of the Conf. on High Performance Graphics, New Orleans, LA, Aug. 01 - 03, 2009, pp. 159-166.
[2] D.M. Hughes, I.S. Lim, M.W. Jones, A. Knoll, B. Spencer, “InK-Compact: in-kernel stream compaction and its application to multi-kernel data visualization on General-Purpose GPUs,” Computer Graphics Forum, vol. 32, n. 6, pp. 178-188, 2013.

Related

What is the right way to implement a stack push on OpenCL 1.2?

Question
Suppose multiple work-items want to append to a global stack:
void kernel(__global int* stack) {
... do stuff ...
push(stack, value);
... do stuff ...
return y;
}
It is desirable that, after the kernel runs, stack contains every value pushed to it. Order does not matter. What is the proper way to do it in OpenCL 1.2?
What I've tried
An obvious idea would be to use atomic_inc to get the length and just write to it:
void push(__global int* stack, int val) {
int idx = atomic_inc(stack) + 1; // first element is the stack length
stack[idx] = val;
}
But I speculate having all work-items call atomic_inc separately on the same memory position ruins the parallelism. A separate idea would be to just write to a temporary array larger than the number of work items:
void push(__global int* stack, int val) {
stack[get_global_id(0)] = val;
}
That'd leave us with a sparse array of values:
[0, 0, 0, 7, 0, 0, 0, 2, 0, 0, 3, 0, 0, 0, 9, 0, 0, ...]
Which could then be compacted using "stream compaction". I, thus, wonder what of those ideas is the most efficient, and if perhaps there is a third option I'm not aware of.
I can't give you a definite answer here, but I can make a few suggestions of things to try - if you have the resources, try to implement more than one of them and profile their performance on all the different types of OpenCL implementation you're planning to deploy on. You might find that different solutions perform differently on different hardware/software.
Create a stack per work-group in local memory (either explicitly or by compacting after all values have been generated) and only increment the global stack by the per-work-group count and copy the whole local stack into the global one. This means you only have one global atomic add per work-group. Works better for large groups of course.
Your biggest source of atomic contention in the naive approach will be from items on the same work-group. So you could create as many stacks as items per work group, and have each item in the group submit to its "own" stack. You'll still need a compaction step after this to combine it all into one list. Vary group size if you try this. I'm not sure to what extent current GPUs suffer from false sharing (atomics locking a whole cache line, not just that word) so you'll want to check that and/or experiment with different gaps between stack counters in memory.
Write all results to fixed offsets (based on global id) an array large enough to catch the worst case, and queue a separate compaction kernel that post-processes the result into a contiguous array.
Don't bother with a compact representation of the result. Instead, use the sparse array as the input for the next stage of computation. This next stage's work group can compact a fixed subset of the sparse array into local memory. When that's done, each work item then works on one item of the compacted array. Iterate inside the kernel until all have been processed. How well this works will depend on how predictable the statistical distribution of the sparse items in the array is, and your choice of work group size and how much of the sparse array each work group processes. This version also avoids the round trip to the host processor.
On Intel IGPs specifically, I have heard that DirectX/OpenGL/Vulkan geometry shaders with variable number of outputs perform exceptionally well. If you can write your algorithm in the format of a geometry shader, this might be worth a try if you're targeting those devices. For nvidia/AMD, don't bother with this.
There are probably other options, but those should give you some ideas.

Is there any efficient way to compact a sparse array in OpenCL / CUDA? [duplicate]

I'm trying to construct a parallel algorithm with CUDA that takes an array of integers and removes all of the 0's with or without keeping the order.
Example:
Global Memory: {0, 0, 0, 0, 14, 0, 0, 17, 0, 0, 0, 0, 13}
Host Memory Result: {17, 13, 14, 0, 0, ...}
The simplest way is to use the host to remove the 0's in O(n) time. But considering I have around 1000 elements, it probably will be faster to leave everything on the GPU and condense it first, before sending it.
The preferred method would be to create an on-device stack, such that each thread can pop and push (in any order) onto or off of the stack. However, I don't think CUDA has an implementation of this.
An equivalent (but much slower) method would be to keep attempting to write, until all threads have finished writing:
kernalRemoveSpacing(int * array, int * outArray, int arraySize) {
if (array[threadId.x] == 0)
return;
for (int i = 0; i < arraySize; i++) {
array = arr[threadId.x];
__threadfence();
// If we were the lucky thread we won!
// kill the thread and continue re-reincarnated in a different thread
if (array[i] == arr[threadId.x])
return;
}
}
This method has only benefit in that we would perform in O(f(x)) time, where f(x) is the average number of non-zero values there are in an array (f(x) ~= ln(n) for my implementation, thus O(ln(n)) time, but has a high O constant)
Finally, a sort algorithm such as quicksort or mergesort would also solve the problem, and does in fact run in O(ln(n)) relative time. I think there might be an algorithm faster than this even, as we do not need to waste time ordering (swapping) zero-zero element pairs, and non-zero non-zero element pairs (the order does not need to be kept).
So I'm not quite sure which method would be the fastest, and I still
think there's a better way of handling this. Any suggestions?
What you are asking for is a classic parallel algorithm called stream compaction1.
If Thrust is an option, you may simply use thrust::copy_if. This is a stable algorithm, it preserves relative order of all elements.
Rough sketch:
#include <thrust/copy.h>
template<typename T>
struct is_non_zero {
__host__ __device__
auto operator()(T x) const -> bool {
return x != 0;
}
};
// ... your input and output vectors here
thrust::copy_if(input.begin(), input.end(), output.begin(), is_non_zero<int>());
If Thrust is not an option, you may implement stream compaction yourself (there is plenty of literature on the topic). It's a fun and reasonably simple exercise, while also being a basic building block for more complex parallel primitives.
(1) Strictly speaking, it's not exactly stream compaction in the traditional sense, as stream compaction is traditionally a stable algorithm but your requirements do not include stability. This relaxed requirement could perhaps lead to a more efficient implementation?
Stream compaction is a well-known problem for which a lot of code was written (Thrust, Chagg to cite two libraries that implement stream compaction on CUDA).
If you have a relatively new CUDA-capable device that supports intrinsic function as __ballot (compute capability >= 3.0) it is worth trying a small CUDA procedure that performs stream compaction much faster than Thrust.
Here finds the code and minimal doc.
https://github.com/knotman90/cuStreamComp
It uses a ballotting function in a single kernel fashion to perform the compaction.
Edit:
I wrote an article explaining the inner workings of this approach. You can find it here if you are interested.
With this answer, I'm only trying to provide more details to Davide Spataro's approach.
As you mentioned, stream compaction consists of removing undesired elements in a collection depending on a predicate. For example, considering an array of integers and the predicate p(x)=x>5, the array A={6,3,2,11,4,5,3,7,5,77,94,0} is compacted to B={6,11,7,77,94}.
The general idea of stream compaction approaches is that a different computational thread be assigned to a different element of the array to be compacted. Each of such threads must decide to write its corresponding element to the output array depending on whether it satisfies the relevant predicate or not. The main problem of stream compaction is thus letting each thread know in which position the corresponding element must be written in the output array.
The approach in [1,2] is an alternative to Thrust's copy_if mentioned above and consists of three steps:
Step #1. Let P be the number of launched threads and N, with N>P, the size of the vector to be compacted. The input vector is divided in sub-vectors of size S equal to the block size. The __syncthreads_count(pred) block intrinsic is exploited which counts the number of threads in a block satisfying the predicate pred. As a result of the first step, each element of the array d_BlockCounts, which has size N/P, contains the number of elements meeting the predicate pred in the corresponding block.
Step #2. An exclusive scan operation is performed on the array d_BlockCounts. As a result of the second step, each thread knows how many elements in the previous blocks write an element. Accordingly, it knows the position where to write its corresponding element, but for an offset related to its own block.
Step #3. Each thread computes the mentioned offset using warp intrinsic functions and eventually writes to the output array. It should be noted that the execution of step #3 is related to warp scheduling. As a consequence, the elements order in the output array does not necessarily reflect the elements order in the input array.
Of the three steps above, the second is performed by CUDA Thrust’s exclusive_scan primitive and is computationally significantly less demanding than the other two.
For an array of 2097152 elements, the mentioned approach has executed in 0.38ms on an NVIDIA GTX 960 card, in contrast to 1.0ms of CUDA Thrust’s copy_if. The mentioned approach appears to be faster for two reasons:
1) It is specifically tailored to cards supporting warp intrinsic elements;
2) The approach does not guarantee the output ordering.
It should be noticed that we have tested the approach also against the code available at inkc.sourceforge.net. Although the latter code is arranged in a single kernel call (it does not employ any CUDA Thrust primitive), it has not better performance as compared to the three-kernels version.
The full code is available here and is slightly optimized as compared to the original Davide Spataro's routine.
[1] M.Biller, O. Olsson, U. Assarsson, “Efficient stream compaction on wide SIMD many-core architectures,” Proc. of the Conf. on High Performance Graphics, New Orleans, LA, Aug. 01 - 03, 2009, pp. 159-166.
[2] D.M. Hughes, I.S. Lim, M.W. Jones, A. Knoll, B. Spencer, “InK-Compact: in-kernel stream compaction and its application to multi-kernel data visualization on General-Purpose GPUs,” Computer Graphics Forum, vol. 32, n. 6, pp. 178-188, 2013.

Is there any probabilistic data structure that reduces the space complexity of a large number of counters?

Basically I need to keep track of a large number of counters. I can increment or decrement each counter by name. The simplest way to do so is to use a hash table, using counter_name as key and its corresponding count as the value for that key.
The counters don't need to be 100% accurate, approximate values for count are fine. So I'm wondering if there is any probabilistic data structure that can reduce the space complexity of N counters to lower than O(N), kinda similar to how HyperLogLog reduces the memory requirement of counting N items by giving only an approximate result. Any ideas?
In my opinion, the thing you are looking for is Count-min sketch.
Reading a stream of elements a1, a2, a3, ..., an where there can be a
lot of repeated elements, in any time it will give you the answer to
the following question: how many ai elements have you seen so far.
basically your unique elements can be bijected into your counters. Countmin sketch allows you to adjust parameters to trade your memory for the accuracy.
P.S. I described some other popular probabilistic data structures here.
Stefan Haustein's correct that the names are likely to take more space than the counters, and you may be able to prioritise certain names as he suggests, but failing that you can consider how best to store the names. If they're fairly short (e.g. 8 characters or less), you might consider using a closed hashing table that stores them directly in the buckets. If they're long, you could store them contiguously (NUL terminated) in a block of memory, and in the hash table store the offset into that block of their first character.
For the counter itself, you can save space by using a probabilistic approach as follows:
template <typename T, typename Q = unsigned>
class Approx_Counter
{
public:
Approx_Counter() : n_(0) { }
Approx_Counter& operator++()
{
if (n_ < 2 || rand() % (operator Q()) == 0)
++n_;
return *this;
}
operator Q() const { return n_ < 2 ? n_ : 1 << n_; }
private:
T n_;
};
Then you can use e.g. Approx_Counter<unsigned char, unsigned long>. Swap out rand() for a C++11 generator if you care.
The idea's simple:
when n_ is 0, ++ has definitely not be invoked
when n_ is 1, ++ has definitely been invoked exactly once
when n_ >= 2, it indicates ++ has probably been invoked about 2n_ times
To keep that last implication in line with the number of ++ invocations actually made, each invocation has a 1 in 2n_ chance of actually incrementing n_ again.
Just make sure your rand() or substitute returns values much larger than the largest counter value you want to track, otherwise you'll get rand() % (operator Q()) == 0 too often and increment inappropriately.
That said, having a smaller counter doesn't help much if you have pointers or offsets to it, so you'll want to squeeze the counter into the bucket too, another reason to prefer your own closed hashing implementation if you genuinely need to tighten up memory usage but want to stick with a hash table (a trie is another possibility).
The above is still O(N) in counter space, just with a smaller constant. For genuinely < O(N) options, you need to consider whether/how keys are related, such that incrementing a counter might reasonable impact multiple keys. You've given us no insights in your question to date.
The names probably take up more space than the counters.
How about having a fixed number of counters and only keep the ones with the highest counts, plus some kind of LRU mechanism to allow new counters to rise to the top? I guess it really depends on your use case...

How is a prefix sum a bulk-synchronous algorithmic primitive?

Concerning NVIDIA GPU the author in High Performance and Scalable GPU Graph Traversal paper says:
1-A sequence of kernel invocations is bulk- synchronous: each kernel is
initially presented with a consistent view of the results from the
previous.
2-Prefix-sum is a bulk-synchronous algorithmic primitive
I am unable to understand these two points (I know GPU based prefix sum though), Can smeone help me this concept.
1-A sequence of kernel invocations is bulk- synchronous: each kernel is initially presented with a consistent view of the results from the previous.
It's about parallel computation model: each processor has its own memory which is fast (like cache in CPU) and is performing computation using values stored there without any synchronization. Then non-blocking synchronization takes place - processor puts data it has computed so far and gets data from its neighbours. Then there's another synchronization - barrier, which makes all of them wait for each other.
2-Prefix-sum is a bulk-synchronous algorithmic primitive
I believe that's about the second step of BSP model - synchronization. That's the way processors store and get data for the next step.
Name of the model implies that it is highly concurrent (many many processes that work synchronously relatively to each other). And this is how we get to the second point.
As far as we want to live up to the name (be highly concurrent) we want get rid of sequential parts where it is possible. We can achieve that with prefix-sum.
Consider prefix-sum associative operator +. Then scan on set [5 2 0 3 1] returns the set [0 5 7 7 10 11]. So, now we can replace such sequential pseudocode:
foreach i = 1...n
foo[i] = foo[i-1] + bar(i);
with this pseudocode, which now can be parallel(!):
foreach(i)
baz[i] = bar(i);
scan(foo, baz);
That is very much naive version, but it's okay for explanation.

Adaptive IO Optimization Problem

Here is an interesting optimization problem that I think about for some days now:
In a system I read data from a slow IO device. I don't know beforehand how much data I need. The exact length is only known once I have read an entire package (think of it as it has some kind of end-symbol). Reading more data than required is not a problem except that it wastes time in IO.
Two constrains also come into play: Reads are very slow. Each byte I read costs. Also each read-request has a constant setup cost regardless of the number of bytes I read. This makes reading byte by byte costly. As a rule of thumb: the setup costs are roughly as expensive as a read of 5 bytes.
The packages I read are usually between 9 and 64 bytes, but there are rare occurrences larger or smaller packages. The entire range will be between 1 to 120 bytes.
Of course I know a little bit of my data: Packages come in sequences of identical sizes. I can classify three patterns here:
Sequences of reads with identical sizes:
A A A A A ...
Alternating sequences:
A B A B A B A B ...
And sequences of triples:
A B C A B C A B C ...
The special case of degenerated triples exist as well:
A A B A A B A A B ...
(A, B and C denote some package size between 1 and 120 here).
Question:
Based on the size of the previous packages, how do I predict the size of the next read request? I need something that adapts fast, uses little storage (lets say below 500 bytes) and is fast from a computational point of view as well.
Oh - and pre-generating some tables won't work because the statistic of read sizes can vary a lot with different devices I read from.
Any ideas?
You need to read at least 3 packages and at most 4 packages to identify the pattern.
Read 3 packages. If they are all same size, then the pattern is AAAAAA...
If they are all not the same size, read the 4th package. If 1=3 & 2=4, pattern is ABAB. Otherwise, pattern is ABCABC...
With that outline, it is probably a good idea to do a speculative read of 3 package sizes (something like 3*64 bytes at a single go).
I don't see a problem here.. But first, several questions:
1) Can you read the input asyncronously (e.g. separate thread, interrupt routine, etc)?
2) Do you have some free memory for a buffer?
3) If you've commanded a longer read, are you able to obtain first byte(s) before the whole packet is read?
If so (and I think in most cases it can be implemented), then you can just have a separate thread that reads them at highest possible speed and stores them in a buffer, with stalling when the buffer gets full, so that you normal process can use a synchronous getc() on that buffer.
EDIT: I see.. it's because of CRC or encryption? Well, then you could use some ideas from data compression:
Consider a simple adaptive algorithm of order N for M possible symbols:
int freqs[M][M][M]; // [a][b][c] : occurences of outcome "c" when prev vals were "a" and "b"
int prev[2]; // some history
int predict(){
int prediction = 0;
for (i = 1; i < M; i++)
if (freqs[prev[0]][prev[1]][i] > freqs[prev[0]][prev[1]][prediction])
prediction = i;
return prediction;
};
void add_outcome(int val){
if (freqs[prev[0]][prev[1]][val]++ > DECAY_LIMIT){
for (i = 0; i < M; i++)
freqs[prev[0]][prev[1]][i] >>= 1;
};
pred[0] = pred[1];
pred[1] = val;
};
freqs has to be an array of order N+1, and you have to remember N previsous values. N and DECAY_LIMIT have to be adjusted according to the statistics of the input. However, even they can be made adaptive (for example, if it producess too many misses, then the decay limit can be shortened).
The last problem would be the alphabet. Depending on the context, if there are several distinct sizes, you can create a one-to-one mapping to your symbols. If more, then you can use quantitization to limit the number of symbols. The whole algorithm can be written with pointer arithmetics, so that N and M won't be hardcoded.
Since reading is so slow, I suppose you can throw some CPU power at it so you can try to make an educated guess of how much to read.
That would be basically a predictor, that would have a model based on probabilities. It would generate a sample of predictions of the upcoming message size, and the cost of each. Then pick the message size that has the best expected cost.
Then when you find out the actual message size, use Bayes rule to update the model probabilities, and do it again.
Maybe this sounds complicated, but if the probabilities are stored as fixed-point fractions you won't have to deal with floating-point, so it may be not much code. I would use something like a Metropolis-Hastings algorithm as my basic simulator and bayesian update framework. (This is just an initial stab at thinking about it.)

Resources