Num Threads trade-off in non-parallelizable work - gpgpu

I've been a good boy and parallelized my compute shader to execute 955 threads for 20 iterations
[numthreads(955, 1, 1)]
void main( uint3 pos : SV_DispatchThreadID )
{
...
for (uint i = 0; i < 20; i++)
{
GroupMemoryBarrierWithGroupSync();
//read from and write to groupshared memory
}
}
But this isn't going to work out (because the parallelization introduces a realtime delay) so I have to do it a less parallel way. The easy way to approach the problem is to have 20 threads doing 955 iterations each
[numthreads(20, 1, 1)]
void main( uint3 pos : SV_DispatchThreadID )
{
...
for (uint i = 0; i < 955; i++)
{
GroupMemoryBarrierWithGroupSync();
//read from and write to groupshared memory
}
}
However, I can't reason about how this is going to perform (probably terribly).
I under this new approach I must keep the number iterations the same, but can trade off the frequency which I call the compute shader with the number of threads. Which gives me two options:
Increase 20 -> 32 to have a full warp.
Increase 20 -> 32 * n to have warps running in parallel.
Maybe accessing groupshared memory is very cheap and so I don't have a performance problem in the first place.
Maybe I should try to optimize this on the cpu (I've already tried unoptimized and the performance was less than desired).

Someone commented on this answer
To be specific, a single-thread group will generally cap utilization to around 3-6%. Dispatching only one group compounds the issue, capping utilization to well under 1%. Sticking to 256 threads with power-of-two dimension sizes is a good rule of thumb, and you should dispatch at least 2048 or so threads total to keep the hardware busy.
and I decided that doing this work on the gpu is a stupid thing to do. It's always best to look for robust solutions.
The rubust solution for my problem is to use SIMD, which I will have to now learn the hard way.

Related

CUDA Parallel Cross Product

Disclaimer: I am fairly new to CUDA and parallel programming - so if you're not going to bother to answer my question, just ignore this, or at least point me to the right resources so I can find the answer myself.
Here's the particular problem I'm looking to solve using parallel programming. I have some 1D arrays that store 3D vectors in this format -> [v0x, v0y, v0z, ... vnx, vny, vnz], where n is the vector, and x, y, z are the respective components.
Suppose I want to find the cross product between vectors [v0, v1, ... vn] in one array and their corresponding vectors [v0, v1, ... vn] in another array.
The calculation is pretty straightforward without parallelization:
result[x] = vec1[y]*vec2[z] - vec1[z]*vec2[y];
result[y] = vec1[z]*vec2[x] - vec1[x]*vec2[z];
result[z] = vec1[x]*vec2[y] - vec1[y]*vec2[x];
The problem I'm having is understanding how to implement CUDA parallelization for the arrays I currently have. Since each value in the result vector is a separate calculation, I can effectively run the above calculation for each vector in parallel. Since each component of the resulting cross product is a separate calculation, those too could run in parallel. How would I go about setting up the blocks and threads/ go about thinking about setting up the threads for such a problem?
The top 2 optimization priorities for any CUDA programmer are to use memory efficiently, and expose enough parallelism to hide latency. We'll use those to guide our algorithmic choices.
A very simple thread strategy (the thread strategy answers the question, "what will each thread do or be responsible for?") in any transformation (as opposed to reduction) type problem is to have each thread be responsible for 1 output value. Your problem fits the description of transformation - the output data set size is on the order of the input data set size(s).
I'll assume that you intended to have two equal length vectors containing your 3D vectors, and that you want to take the cross product of the first 3D vectors in each and the 2nd 3D vectors in each, and so on.
If we choose a thread strategy of 1 output point per thread (i.e. result[x] or result[y] or result[z], all together would be 3 output points), then we will need 3 threads to compute the output of each vector cross product. If we have enough vectors to multiply, then we will have enough threads to keep our machine "busy" and do a good job of hiding latency. As a rule of thumb, your problem will start to become interesting on GPUs if the number of threads is 10000 or more, so this means we would want your 1D vectors to consist of about 3000 3D vectors or more. Let's assume that is the case.
In order to tackle the memory efficiency objective, our first task is to load your vector data from global memory. We will want this ideally to be coalesced, which roughly means adjacent threads access adjacent elements in memory. We'll want the output stores to be coalesced also, and our thread strategy of choosing one output point/one vector component per thread will work nicely to support that.
For efficient memory usage, we'd like to ideally load each item from global memory only once. Your algorithm naturally involves a small amount of data reuse. The data reuse is evident since the computation of result[y] depends on vec2[z] and the computation of result[x] also depends on vec2[z] to pick just one example. Therefore a typical strategy when there is data reuse is to load the data first into CUDA shared memory, and then allow the threads to perform their computations based on the data in shared memory. As we will see, this makes it fairly easy/convenient for us to arrange for coalesced loads from global memory, since the global data load arrangement is no longer tightly coupled to the threads or the usage of the data for computation.
The last challenge is to figure out an indexing pattern so that each thread will select the proper elements from shared memory to multiply together. If we look at your calculation pattern that you have depicted in your question, we see that the first load from vec1 follows an offset pattern of +1(modulo 3) from the index that the result is being computed for. So x->y, y->z, and z -> x. Likewise we see a +2(modulo 3) for the next load from vec2, another +2(modulo 3) pattern for the next load from vec1 and another +1(modulo 3) pattern for the final load from vec2.
If we combine all these ideas, we can then write a kernel that should have generally efficient characteristics:
$ cat t1003.cu
#include <stdio.h>
#define TV1 1
#define TV2 2
const size_t N = 4096; // number of 3D vectors
const int blksize = 192; // choose as multiple of 3 and 32, and less than 1024
typedef float mytype;
//pairwise vector cross product
template <typename T>
__global__ void vcp(const T * __restrict__ vec1, const T * __restrict__ vec2, T * __restrict__ res, const size_t n){
__shared__ T sv1[blksize];
__shared__ T sv2[blksize];
size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
while (idx < 3*n){ // grid-stride loop
// load shared memory using coalesced pattern to global memory
sv1[threadIdx.x] = vec1[idx];
sv2[threadIdx.x] = vec2[idx];
// compute modulo/offset indexing for thread loads of shared data from vec1, vec2
int my_mod = threadIdx.x%3; // costly, but possibly hidden by global load latency
int off1 = my_mod+1;
if (off1 > 2) off1 -= 3;
int off2 = my_mod+2;
if (off2 > 2) off2 -= 3;
__syncthreads();
// each thread loads its computation elements from shared memory
T t1 = sv1[threadIdx.x-my_mod+off1];
T t2 = sv2[threadIdx.x-my_mod+off2];
T t3 = sv1[threadIdx.x-my_mod+off2];
T t4 = sv2[threadIdx.x-my_mod+off1];
// compute result, and store using coalesced pattern, to global memory
res[idx] = t1*t2-t3*t4;
idx += gridDim.x*blockDim.x;} // for grid-stride loop
}
int main(){
mytype *h_v1, *h_v2, *d_v1, *d_v2, *h_res, *d_res;
h_v1 = (mytype *)malloc(N*3*sizeof(mytype));
h_v2 = (mytype *)malloc(N*3*sizeof(mytype));
h_res = (mytype *)malloc(N*3*sizeof(mytype));
cudaMalloc(&d_v1, N*3*sizeof(mytype));
cudaMalloc(&d_v2, N*3*sizeof(mytype));
cudaMalloc(&d_res, N*3*sizeof(mytype));
for (int i = 0; i<N; i++){
h_v1[3*i] = TV1;
h_v1[3*i+1] = 0;
h_v1[3*i+2] = 0;
h_v2[3*i] = 0;
h_v2[3*i+1] = TV2;
h_v2[3*i+2] = 0;
h_res[3*i] = 0;
h_res[3*i+1] = 0;
h_res[3*i+2] = 0;}
cudaMemcpy(d_v1, h_v1, N*3*sizeof(mytype), cudaMemcpyHostToDevice);
cudaMemcpy(d_v2, h_v2, N*3*sizeof(mytype), cudaMemcpyHostToDevice);
vcp<<<(N*3+blksize-1)/blksize, blksize>>>(d_v1, d_v2, d_res, N);
cudaMemcpy(h_res, d_res, N*3*sizeof(mytype), cudaMemcpyDeviceToHost);
// verification
for (int i = 0; i < N; i++) if ((h_res[3*i] != 0) || (h_res[3*i+1] != 0) || (h_res[3*i+2] != TV1*TV2)) { printf("mismatch at %d, was: %f, %f, %f, should be: %f, %f, %f\n", i, h_res[3*i], h_res[3*i+1], h_res[3*i+2], (float)0, (float)0, (float)(TV1*TV2)); return -1;}
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
return 0;
}
$ nvcc t1003.cu -o t1003
$ cuda-memcheck ./t1003
========= CUDA-MEMCHECK
no error
========= ERROR SUMMARY: 0 errors
$
Note that I've chosen to write the kernel using a grid-stride loop. This isn't terribly important to this discussion, and not that relevant for this problem, because I've chosen a grid size equal to the problem size (4096*3). However for much larger problem sizes, you might choose a smaller grid size than the overall problem size, for some possible small efficiency gain.
For such a simple problem as this, it's fairly easy to define "optimality". The optimal scenario would be however long it takes to load the input data (just once) and write the output data. If we consider a larger version of the test code above, changing N to 40960 (and making no other changes), then the total data read and written would be 40960*3*4*3 bytes. If we profile that code and then compare to bandwidthTest as a proxy for peak achievable memory bandwidth, we observe:
$ CUDA_VISIBLE_DEVICES="1" nvprof ./t1003
==27861== NVPROF is profiling process 27861, command: ./t1003
no error
==27861== Profiling application: ./t1003
==27861== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 65.97% 162.22us 2 81.109us 77.733us 84.485us [CUDA memcpy HtoD]
30.04% 73.860us 1 73.860us 73.860us 73.860us [CUDA memcpy DtoH]
4.00% 9.8240us 1 9.8240us 9.8240us 9.8240us void vcp<float>(float const *, float const *, float*, unsigned long)
API calls: 99.10% 249.79ms 3 83.263ms 6.8890us 249.52ms cudaMalloc
0.46% 1.1518ms 96 11.998us 374ns 454.09us cuDeviceGetAttribute
0.25% 640.18us 3 213.39us 186.99us 229.86us cudaMemcpy
0.10% 255.00us 1 255.00us 255.00us 255.00us cuDeviceTotalMem
0.05% 133.16us 1 133.16us 133.16us 133.16us cuDeviceGetName
0.03% 71.903us 1 71.903us 71.903us 71.903us cudaLaunchKernel
0.01% 15.156us 1 15.156us 15.156us 15.156us cuDeviceGetPCIBusId
0.00% 7.0920us 3 2.3640us 711ns 4.6520us cuDeviceGetCount
0.00% 2.7780us 2 1.3890us 612ns 2.1660us cuDeviceGet
0.00% 1.9670us 1 1.9670us 1.9670us 1.9670us cudaGetLastError
0.00% 361ns 1 361ns 361ns 361ns cudaGetErrorString
$ CUDA_VISIBLE_DEVICES="1" /usr/local/cuda/samples/bin/x86_64/linux/release/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: Tesla K20Xm
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6375.8
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6554.3
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 171220.3
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
$
The kernel takes 9.8240us to execute, and in that time loads or stores a total of 40960*3*4*3 bytes of data. Therefore the achieved memory bandwidth by the kernel is 40960*3*4*3/0.000009824 or 150 GB/s. The proxy measurement for peak achievable on this GPU is 171 GB/s, so this kernel achieves 88% of the optimal throughput. With more careful benchmarking to run the kernel twice in a row, the 2nd execution requires only 8.99us to execute. This brings the achieved bandwidth in this case up to 96% of peak achievable throughput.

Cache friendly offline random read

Consider this function in C++:
void foo(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, uint32_t *o) {
while (b1 != b2) {
// assert(0 <= *b1 && *b1 < a2 - a1)
*o++ = a1[*b1++];
}
}
Its purpose should be clear enough. Unfortunately, b1 contains random data and trash the cache, making foo the bottleneck of my program. Is there anyway I can optimize it?
This is an SSCCE that should resemble my actual code:
#include <iostream>
#include <chrono>
#include <algorithm>
#include <numeric>
namespace {
void foo(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, uint32_t *o) {
while (b1 != b2) {
// assert(0 <= *b1 && *b1 < a2 - a1)
*o++ = a1[*b1++];
}
}
constexpr unsigned max_n = 1 << 24, max_q = 1 << 24;
uint32_t data[max_n], index[max_q], result[max_q];
}
int main() {
uint32_t seed = 0;
auto rng = [&seed]() { return seed = seed * 9301 + 49297; };
std::generate_n(data, max_n, rng);
std::generate_n(index, max_q, [rng]() { return rng() % max_n; });
auto t1 = std::chrono::high_resolution_clock::now();
foo(data, data + max_n, index, index + max_q, result);
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration<double>(t2 - t1).count() << std::endl;
uint32_t hash = 0;
for (unsigned i = 0; i < max_q; i++)
hash += result[i] ^ (i << 8) ^ i;
std::cout << hash << std::endl;
}
This is not Cache-friendly copying of an array with readjustment by known index, gather, scatter, which asks about random writes and assumes b is a permutation.
First, let's take a look at the actual performance of the code above:
$ sudo perf stat ./offline-read
0.123023
1451229184
Performance counter stats for './offline-read':
184.661547 task-clock (msec) # 0.997 CPUs utilized
3 context-switches # 0.016 K/sec
0 cpu-migrations # 0.000 K/sec
717 page-faults # 0.004 M/sec
623,638,834 cycles # 3.377 GHz
419,309,952 instructions # 0.67 insn per cycle
70,803,672 branches # 383.424 M/sec
16,895 branch-misses # 0.02% of all branches
0.185129552 seconds time elapsed
We are getting a low IPC of 0.67, probably caused almost entirely by load-misses to DRAM5. Let's confirm:
sudo ../pmu-tools/ocperf.py stat -e cycles,LLC-load-misses,cycle_activity.stalls_l3_miss ./offline-read
perf stat -e cycles,LLC-load-misses,cpu/event=0xa3,umask=0x6,cmask=6,name=cycle_activity_stalls_l3_miss/ ./offline-read
0.123979
1451229184
Performance counter stats for './offline-read':
622,661,371 cycles
16,114,063 LLC-load-misses
368,395,404 cycle_activity_stalls_l3_miss
0.184045411 seconds time elapsed
So ~370k cycles out of 620k are straight-up stalled on outstanding misses. In fact, the portion of cycles stalled this way in foo() is much higher, close to 90% since perf is also measuring the init and accumulate code which takes about a third of the runtime (but doesn't have significant L3 misses).
This is nothing unexpected, since we knew the random-read pattern a1[*b1++] was going to have essentially zero locality. In fact, the number of LLC-load-misses is 16 million1, corresponding almost exactly to the 16 million random reads of a1.2
If we just assume 100% of foo() is spending waiting on memory access, we can get an idea of the total cost of each miss: 0.123 sec / 16,114,063 misses == 7.63 ns/miss. On my box, the memory latency is around 60 ns in the best case, so less than 8 ns per miss means we are already extracting a lot of memory-level parallelism (MLP): about 8 misses would have to be overlapped and in-flight on average to achieve this (even totally ignoring the additional traffic from the streaming load of b1 and streaming write of o).
So I don't think there are many tweaks you can apply to the simple loop to do much better. Still, two possibilities are:
Non-temporal stores for the writes to o, if your platform supports them. This would cut out the reads implied by RFO for normal stores. It should be a straight win since o is never read again (inside the timed portion!).
Software prefetching. Carefully tuned prefetching of a1 or b1 could potentially help a bit. The impact is going to be fairly limited, however, since we are already approaching the limits of MLP as described above. Also, we expect the linear reads of b1 to be almost perfectly prefetched by the hardware prefetchers. The random reads of a1 seem like they could be amenable to prefetching, but in practice the ILP in the loop leads to enough MLP though out-of-order processing (at least on big OoO processors like recent x86).
In the comments user harold already mentioned that he tried prefetching with only a small effect.
So since the simple tweaks aren't likely to bear much fruit, you are left with transforming the loop. One "obvious" transformation is to sort the indexes b1 (along with the index element's original position) and then do the reads from a1 in sorted order. This transforms the reads of a1 from completely random, to almost3 linear, but now the writes are all random, which is no better.
Sort and then unsort
The key problem is that the reads of a1 under control of b1 are random, and a1 is large you get a miss-to-DRAM for essentially every read. We can fix that by sorting b1, and then reading a1 in order to get a permuted result. Now you need to "un-permute" the result a1 to get the result in the final order, which is simply another sort, this time on the "output index".
Here's a worked example with the given input array a, index array b and output array o, and i which is the (implicit) position of each element:
i = 0 1 2 3
a = [00, 10, 20, 30]
b = [ 3, 1, 0, 1]
o = [30, 10, 00, 10] (desired result)
First, sort array b, with the original array position i as secondary data (alternately you may see this as sorting tuples (b[0], 0), (b[1], 1), ...), this gives you the sorted b array b' and the sorted index list i' as shown:
i' = [ 2, 1, 3, 0]
b' = [ 0, 1, 1, 3]
Now you can read the permuted result array o' from a under the control of b'. This read is strictly increasing in order, and should be able to operate at close to memcpy speeds. In fact you may be able to take advantage of wide contiguous SIMD reads and some shuffles to do several reads and once and move the 4-byte elements into the right place (duplicating some elements and skipping others):
a = [00, 10, 20, 30]
b' = [ 0, 1, 1, 3]
o' = [00, 10, 10, 30]
Finally, you de-permute o' to get o, conceptually simply by sorting o' on the permuted indexes i':
i' = [ 2, 1, 3, 0]
o' = [00, 10, 10, 30]
i = [ 0, 1, 2, 3]
o = [30, 10, 00, 10]
Finished!
Now this is the simplest idea of the technique and isn't particularly cache-friendly (each pass conceptually iterates over one or more 2^26-byte arrays), but it at least fully uses every cache line it reads (unlike the original loop which only reads a single element from a cache line, which is why you have 16 million misses even though the data only occupies 1 million cache lines!). All of the reads are more or less linear, so hardware prefetching will help a lot.
How much speedup you get probably large depends on how will you implement the sorts: they need to be fast and cache sensitive. Almost certainly some type of cache-aware radix sort will work best.
Here are some notes on ways to improve this further:
Optimize the amount of sorting
You don't actually need to fully sort b. You just want to sort it "enough" such that the subsequent reads of a under the control of b' are more or less linear. For example, 16 elements fit in a cache line, so you don't need to sort based on the last 4 bits at all: the same linear sequence of cache lines will be read anyways. You could also sort on even fewer bits: e.g., if you ignored the 5 least-significant bits, you'd read cache lines in an "almost linear" way, sometimes swapping two cache lines from the perfectly linear pattern like: 0, 1, 3, 2, 5, 4, 6, 7. Here, you'll still get the full benefit of the L1 cache (subsequent reads to a cache line will always hit), and I suspect such a pattern would still be prefetched well and if not you can always help it with software prefetching.
You can test on your system what the optimal number of ignored bits is. Ignoring bits has two benefits:
Less work to do in the radix search, either from fewer passes needed or needing fewer buckets in one or more passes (which helps caching).
Potentially less work to do to "undo" the permutation in the last step: if the undo by examining the original index array b, ignoring bits means that you get the same savings when undoing the search.
Cache block the work
The above description lays out everything in several sequential, disjoint passes that each work on the entire data set. In practice, you'd probably want to interleave them to get better caching behavior. For example, assuming you use an MSD radix-256 sort, you might do the first pass, sorting the data into 256 buckets of approximately 256K elements each.
Then rather than doing the full second pass, you might finish sorting only the first (or first few) buckets, and proceed to do the read of a based on the resulting block of b'. You are guaranteed that this block is contiguous (i.e., a suffix of the final sorted sequence) so you don't give up any locality in the read, and your reads will generally be cached. You may also do the first pass of de-permuting o' since the block of o' is also hot in the cache (and perhaps you can combine the latter two phases into one loop).
Smart De-permutation
One area for optimization is how exactly the de-permutation of o' is implemented. In the description above, we assume some index array i initially with values [0, 1, 2, ..., max_q] which is sorted along with b. That's conceptually how it works, but you may not need to actually materialize i right away and sort it as auxillary data. In the first pass of the radix sort, for example, the value of i is implicitly known (since you are iterating through the data), so it could be calculated for free4 and written out during the first pass without every having appeared in sorted order.
There may also be more efficient ways to do the "unsort" operation than maintaining the full index. For example, the original unsorted b array conceptually has all the information needed to do the unsort, but it is clear to me how to use to efficiently unsort.
Is it be faster?
So will this actually be faster than the naive approach? It depends largely on implementation details especially including the efficiency of the implemented sort. On my hardware, the naive approach is processing about ~140 million elements per second. Online descriptions of cache-aware radix sorts seem to vary from perhaps 200 to 600 million elements/s, and since you need two of those, the opportunity for a big speedup would seem limited if you believe those numbers. On other hand, those numbers are from older hardware, and for slightly more general searches (e.g,. for all 32 bits of the key, while we may be able to use as few as 16 bits).
Only a careful implementation will determine if it is feasible, and feasibility also depends on the hardware. For example, on hardware that can't sustain as much MLP, the sorting-unsorting approach becomes relatively more favorable.
The best approach also depends on the relative values of max_n and max_q. For example, if max_n >> max_q, then the reads will be "sparse" even with optimal sorting, so the naive approach would be better. On the other hand if max_n << max_q, then the same index will usually be read many times, so the sorting approach will have good read locality, the sorting steps will themselves have better locality, and further optimizations which handle duplicate reads explicitly may be possible.
Multiple Cores
It isn't clear from the question whether you are interested in parallelizing this. The naive solution for foo() already does admit a "straightforward" parallelization where you simply partition the a and b arrays into equal sized chunks, on for each thread, which would seem to provide a perfect speedup. Unfortunately, you'll probably find that you get much worse than linear scaling, because you'll be running into resource contention in the memory controller and associated uncore/offcore resources which are shared between all cores on a socket. So it isn't clear how much more throughput you'll get for a purely parallel random read load to memory as you add more cores6.
For the radix-sort version, most of the bottlenecks (store throughput, total instruction throughput) are in the core, so I expect it to scale reasonably with additional cores. As Peter mentioned in the comment, if you are using hyperthreading, the sort may have the additional benefit of good locality in the core local L1 and L2 caches, effectively letting each sibling thread use the entire cache, rather than cutting the effective capacity in half. Of course, that involves carefully managing your thread affinity so that sibling threads actually use nearby data, and not just letting the scheduler do whatever it does.
1 You might ask why the LLC-load-misses isn't say 32 or 48 million, given that we also have to read all 16 million elements of b1 and then the accumulate() call reads all of result. The answer is that LLC-load-misses only counts demand misses that actually miss in the L3. The other mentioned read patterns are totally linear, so the prefetchers will always be bringing the line into the L3 before it is needed. These don't count as "LLC misses" by the definition perf uses.
2 You might want to know how I know that the load misses all come from the reads of a1 in foo: I simply used perf record and perf mem to confirm that the misses were coming from the expected assembly instruction.
3 Almost linear because b1 is not a permutation of all indexes, so in principle there can be skipped and duplicate indexes. At the cache-line level, however, it is highly likely that every cache line will be read in-order since each element has a ~63% chance of being included, and a cache line has 16 4-byte elements, so there's only about a 1 in 10 million chance that any given cache has zero elements. So prefetching, which works at the cache line level, will work fine.
4 Here I mean that the calculation of the value comes for free or nearly so, but of course the write still costs. This is still much better than the "up-front materialization" approach, however, which first creates the i array [0, 1, 2, ...] needing max_q writes and then again needs another max_q writes to sort it in the first radix sort pass. The implicit materialization only incurs the second write.
5 In fact, the IPC of the actual timed section foo() is much lower: about 0.15 based on my calculations. The reported IPC of the entire process is an average of the IPC of the timed section and the initialization and accumulation code before and after which has a much higher IPC.
6 Notably, this is different from a how a dependent-load latency bound workflow scales: a load that is doing random read but can only have one load in progress because each load depends on the result of last scales very well to multiple cores because the serial nature of the loads doesn't use many downstream resources (but such loads can conceptually also be sped up even on a single core by changing the core loop to handle more than one dependent load stream in parallel).
You can partition indices into buckets where higher bits of indices are the same. Beware that if indices are not random the buckets will overflow.
#include <iostream>
#include <chrono>
#include <cassert>
#include <algorithm>
#include <numeric>
#include <vector>
namespace {
constexpr unsigned max_n = 1 << 24, max_q = 1 << 24;
void foo(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, uint32_t *o) {
while (b1 != b2) {
// assert(0 <= *b1 && *b1 < a2 - a1)
*o++ = a1[*b1++];
}
}
uint32_t* foo_fx(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, const uint32_t b_offset, uint32_t *o) {
while (b1 != b2) {
// assert(0 <= *b1 && *b1 < a2 - a1)
*o++ = a1[b_offset+(*b1++)];
}
return o;
}
uint32_t data[max_n], index[max_q], result[max_q];
std::pair<uint32_t, uint32_t[max_q / 8]>index_fx[16];
}
int main() {
uint32_t seed = 0;
auto rng = [&seed]() { return seed = seed * 9301 + 49297; };
std::generate_n(data, max_n, rng);
//std::generate_n(index, max_q, [rng]() { return rng() % max_n; });
for (size_t i = 0; i < max_q;++i) {
const uint32_t idx = rng() % max_n;
const uint32_t bucket = idx >> 20;
assert(bucket < 16);
index_fx[bucket].second[index_fx[bucket].first] = idx % (1 << 20);
index_fx[bucket].first++;
assert((1 << 20)*bucket + index_fx[bucket].second[index_fx[bucket].first - 1] == idx);
}
auto t1 = std::chrono::high_resolution_clock::now();
//foo(data, data + max_n, index, index + max_q, result);
uint32_t* result_begin = result;
for (int i = 0; i < 16; ++i) {
result_begin = foo_fx(data, data + max_n, index_fx[i].second, index_fx[i].second + index_fx[i].first, (1<<20)*i, result_begin);
}
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration<double>(t2 - t1).count() << std::endl;
std::cout << std::accumulate(result, result + max_q, 0ull) << std::endl;
}

Strange under-performance of Titan X in memory-bound kernels (e.g. elementwise addition)

I've been noticing super-slow execution times for essentially all of my CUDA kernels on some machine (Fedora 24, GeForce Titan X maxwell), but not on others. Edit: I previously gave the CUDA vectorAdd sample as an MCVE, but due to doubts regarding whether that should really be memory-bottlenecked due to the low workload per thread; so, here's a hand-unrolling of that kernel:
enum { serialization_factor = 8 };
__global__ void vectorAdd(
const float* __restrict__ lhs,
const float* __restrict__ rhs,
float* __restrict__ result,
int length)
{
int pos = threadIdx.x + blockIdx.x * blockDim.x * serialization_factor;
if (length - pos >= blockDim.x * serialization_factor) {
#pragma unroll
for(int i = 0; i < serialization_factor; i++) {
result[pos] = lhs[pos] + rhs[pos];
pos += blockDim.x;
}
}
else {
for(; pos < length; pos += blockDim.x) {
result[pos] = lhs[pos] + rhs[pos];
}
}
}
... and suppose we run this for 5,000,000 elements; and launch the kernel twice, ignoring the first run.
Well, with my home GPU, a Geforce GTX 650 Ti Boost, I get 527 usec. This is a bit strange - I was expecting something like 555 usec, by bandwidth calculations: 3004 MHz clock * 192-bit bus = 72096 MB = 72 GB/sec , and 2 * 4 bytes per float * 5M of data. But it's pretty close so let's ignore the difference. The profiler tells me the "Global Load Throughput" is 72.355 GB/sec.
Now, on the Maxwell Titan X at work, I get 232 usec. That's about twice as fast - but the GPU's bandwidth is 5 times as high as my home GPU: ~336 GB/sec. I should be seeing something like 120 usec. And - the profiler tells me the "Global Load Throughput" is 343.271 GB/sec (!)
How could this be happening?
Notes:
If you think I've gotten something wrong with the kernel, please comment about that rather than writing an answer.
The Titan doesn't have ECC on.
Your bandwidth calculations are not fully accurate. The specified theoretical peak memory bandwidth of the GTX 650 Ti BOOST is twice as high (144.2 GB/s) as you calculated because of double data rate transfer (transfer of separate words on both the raising and the falling edge of the clock signal). The achieved bandwidth in the vector add example is 50% higher than you calculated, because writing back of results to memory also needs to be taken into account. This means your GTX 650 Ti BOOST measurements achieved ~79% of it's theoretical peak bandwidth.
The Titan X's specified peak memory bandwidth is 336.5 GB/s, so your test achieved ~77% of theoretical peak memory bandwidth.
This is about as good as it gets. The remaining discrepancy is due to overhead like memory refresh, the time needed to switch the transfer direction etc.
Adding some to tera's answer, your algorithm has a warm-up and cool-down phase: when having many requests in flight, latency gets hidden indeed, but at the cost of a warm-up, and a cool-down for the last iterations.
If your scheduling is good, you will have a work chunk of 2048 (max resident threads per sm) x 24 (number of sm on the GTX Titan X). Each of which will operate on 8 values. Hence, your work chunk is 393,216 entries.
For your 5,000,000 size sample, it results in 12.7 iterations (13 with the last being incomplete). The warm-up/cool-down cost is 1 iteration.
Depending on scheduling of threads (and this is not necessarily predictable), you may run 14 iterations total; for which you could have 5,111,808 entries at approximately same cost (still one warm-up/cool-down). That size would provide you the best performance I believe.
As a result, the incomplete iteration plus warm-up/cool-down could cost about 10% of performance, the achieved bandwidth being closer to 85% of peak if not more.
The minimal run time of a kernel should also be looked at as it might account for a few micros as well. Running on various data sizes should mitigate this point.
Last but not least, the memory frequency might be modifiable with nvidia-smi, as explained here.

GPU sorting vs CPU sorting

I made a very naive implementation of the mergesort algorithm, which i turned to work on CUDA with very minimal implementation changes, the algorith code follows:
//Merge for mergesort
__device__ void merge(int* aux,int* data,int l,int m,int r)
{
int i,j,k;
for(i=m+1;i>l;i--){
aux[i-1]=data[i-1];
}
//Copy in reverse order the second subarray
for(j=m;j<r;j++){
aux[r+m-j]=data[j+1];
}
//Merge
for(k=l;k<=r;k++){
if(aux[j]<aux[i] || i==(m+1))
data[k]=aux[j--];
else
data[k]=aux[i++];
}
}
//What this code do is performing a local merge
//of the array
__global__
void basic_merge(int* aux, int* data,int n)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
int tn = n / (blockDim.x*gridDim.x);
int l = i * tn;
int r = l + tn;
//printf("Thread %d: %d,%d: \n",i,l,r);
for(int i{1};i<=(tn/2)+1;i*=2)
for(int j{l+i};j<(r+1);j+=2*i)
{
merge(aux,data,j-i,j-1,j+i-1);
}
__syncthreads();
if(i==0){
//Complete the merge
do{
for(int i{tn};i<(n+1);i+=2*tn)
merge(aux,data,i-tn,i-1,i+tn-1);
tn*=2;
}while(tn<(n/2)+1);
}
}
The problem is that no matter how many threads i launch on my GTX 760, the sorting performance is always much much more worst than the same code on CPU running on 8 threads (My CPU have hardware support for up to 8 concurrent threads).
For example, sorting 150 million elements on CPU takes some hundred milliseconds, on GPU up to 10 minutes (even with 1024 threads per block)! Clearly i'm missing some important point here, can you please provide me with some comment? I strongly suspect the the problem is in the final merge operation performed by the first thread, at that point we have a certain amount of subarray (the exact amount depend on the number of threads) which are sorted and need to me merged, this is completed by just one thread (one tiny GPU thread).
I think i should use come kind of reduction here, so each thread perform in parallel further more merge, and the "Complete the merge" step just merge the last two sorted subarray..
I'm very new to CUDA.
EDIT (ADDENDUM):
Thanks for the link, I must admit I still need some time to learn better CUDA before taking full advantage of that material.. Anyway, I was able to rewrite the sorting function in order to take advantage as long as possible of multiple threads, my first implementation had a bottleneck in the last phase of the merge procedure, which was performed by only one multiprocessor.
Now after the first merge, I use each time up to (1/2)*(n/b) threads, where n is the amount of data to sort and b is the size of the chunk of data sorted by each threads.
The improvement in performance is surprising, using only 1024 threads it takes about ~10 seconds to sort 30 milion element.. Well, this is still a poor result unfortunately! The problem is in the threads syncronization, but first things first, let's see the code:
__global__
void basic_merge(int* aux, int* data,int n)
{
int k = blockIdx.x*blockDim.x + threadIdx.x;
int b = log2( ceil( (double)n / (blockDim.x*gridDim.x)) ) + 1;
b = pow( (float)2, b);
int l=k*b;
int r=min(l+b-1,n-1);
__syncthreads();
for(int m{1};m<=(r-l);m=2*m)
{
for(int i{l};i<=r;i+=2*m)
{
merge(aux,data,i,min(r,i+m-1),min(r,i+2*m-1));
}
}
__syncthreads();
do{
if(k<=(n/b)*.5)
{
l=2*k*b;
r=min(l+2*b-1,n-1);
merge(aux,data,l,min(r,l+b-1),r);
}else break;
__syncthreads();
b*=2;
}while((r+1)<n);
}
The function 'merge' is the same as before. Now the problem is that I'm using only 1024 threads instead of the 65000 and more I can run on my CUDA device, the problem is that __syncthreads does not work as sync primitive at grid level, but only at block level!
So i can syncronize up to 1024 threads,that is the amount of threads supported per block. Without a proper syncronization each thread mess up the data of the other, and the merging procedure does not work.
In order to boost the performance I need some kind of syncronization between all the threads in the grid, seems that no API exist for this purpose, and i read about a solution which involve multiple kernel launch from the host code, using the host as barrier for all the threads.
I have a certain plan on how to implement this tehcnique in my mergesort function, I will provide you with the code in the near future. Did you have any suggestion on your own?
Thanks
It looks like all the work is being done in __global __ memory. Each write takes a long time and each read takes a long time making the function slow. I think it would help to maybe first copy your data to __shared __ memory first and then do the work in there and then when the sorting is completed(for that block) copy the results back to global memory.
Global memory takes about 400 clock cycles (or about 100 if the data happens to be in L2 cache). Shared memory on the other hand only takes 1-3 clock cycles to write and read.
The above would help with performance a lot. Some other super minor things you can try are..
(1) remove the first __syncthreads(); It is not really doing anything because no data is being past in between warps at that point.
(2) Move the "int b = log2( ceil( (double)n / (blockDim.x*gridDim.x)) ) + 1; b = pow( (float)2, b);" outside the kernel and just pass in b instead. This is being calculated over and over when it really only needs to be calculated once.
I tried to follow along on your algorithm but was not able to. The variable names were hard to follow...or... your code is above my head and I cannot follow. =) Hope the above helps.

Generating threads based on a variable length array in cuda?

Fist, let me explain what I am implementing. The goal of my program is to generate all possible, non-distinct combinations of a given character set on a cuda enabled GPU. In order to parallelize the work, I am initializing each thread to a starting character.
For instance, consider the character set abcdefghijklmnopqrstuvwxyz. In this case, there will ideally be 26 threads: characterSet[threadIdx.x] = a for example (in practice, there would obviously be an offset to span the entire grid so that each thread has a unique identifier).
Here is my code thus far:
//Used to calculate grid dimensions
int* threads;
int* blocks;
int* tpb;
int charSetSize;
void calculate_grid_parameters(int length, int size, int* threads, int* blocks, int* tpb){
//Validate input
if(!threads || !blocks || ! tpb){
cout <<"An error has occured: Null pointer passed to function...\nPress enter to exit...";
getchar();
exit(1);
}
//Declarations
const int maxBlocks = 65535; //Does not change
int maxThreads = 512; //Limit in order to provide more portability
int dev = 0;
int maxCombinations;
cudaDeviceProp deviceProp;
//Query device
//cudaGetDeviceProperties(&deviceProp, dev);
//maxThreads = deviceProp.maxThreadsPerBlock;
//Determine total threads to spawn
//Length of password * size of character set
//Each thread will handle part of the total number of the combinations
if(length > 3) length = 3; //Max length is 3
maxCombinations = length * size;
assert(maxCombinations < (maxThreads * maxBlocks));
}
It is fairly basic.
I've limited length to 3 for a specific reason. The full character set, abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 !\"#$&'()*+-.:;<>=?#[]^_{}~| is, I believe, 92 characters. This means for a length of 3, there are 778,688 possible non-distinct combinations. If it were length 4, than it would be roughly 71 million, and the maximum number of threads for my GPU is about 69 million (in one dimension). Furthermore, these combinations have already been generated in a file that will be read into an array and then delegated a specific initializing thread.
This leads me to my problem.
The maximum number of blocks on a cuda GPU (for 1-d) is 65,535. Each of those blocks (on my gpu) can run 1024 threads in one dimension. I've limited it to 512 in my code for portability purposes (this may be unnecessary). Ideally, each block should run 32 threads or a multiple of 32 threads in order to be efficient. The issue I have is how many threads I need. Like I said above, if I am using a full character set of length 3 for the starting values, this necessitates 778,688 threads. This happens to be divisible by 32, yielding 24,334 blocks assuming each block runs 32 threads. However, if I run the same character set with length two, I am left with 264.5 blocks each running 32 threads.
Basically, my character set is variable and the length of the initializing combinations is variable from 1-3.
If I round up to the nearest whole number, my offset index, tid = threadIdx.x + .... will be accessing parts of the array that simply do not exist.
How can I handle this problem in such a way that is will still run efficiently and not spawn unnecessary threads that could potentially cause memory problems?
Any constructive input is appreciated.
The code you've posted doesn't seem to do anything significant and includes no cuda code.
Your question appears to be this:
How can I handle this problem in such a way that is will still run efficiently and not spawn unnecessary threads that could potentially cause memory problems?
It's common practice when launching a kernel to "round up" to the nearest increment of threads, perhaps 32, perhaps some multiple of 32, so that an integral number of blocks can be launched. In this case, it's common practice to include a thread check in the kernel code, such as:
__global__ void mykernel(.... int size){
int idx=threadIdx.x + blockDim.x*blockIdx.x;
if (idx < size){
//main body of kernel code here
}
}
In this case, size is your overall problem size (the number of threads that you actually want). The overhead of the additional threads that are doing nothing is normally not a significant performance issue.

Resources