How to leverage multichannel RAM?

How to leverage multichannel RAM? - performance

My objective
I want to write a RAM bandwidth benchmark
My solution
U64 benchRead(U64 &io_minLoopTicks) //io_minLoopTicks initial value is U64_MAX
{
const U32 elementsCount = 100'000'000;
U64 accumulator = 0;
std::vector<U64> tab;
tab.resize(elementsCount); //800MB
for(U8 i = 0; i < 10; ++i) // Loops to get a reliable value.
{
const U64 startTimestamp = profiler.start(); // profiler uses under the hood QueryPerformanceCounter
for(const U64 j : tab)
accumulator += j; // Do something with the RAM to not be optimized away.
const U64 loopTicks = profiler.end() - startTimestamp;
if(loopTicks < io_minLoopTicks)
io_minLoopTicks = loopTicks;
}
return accumulator;
}
My problem
The theory would be 57.6 GB/s (8 bytes (bus width) * 2 (channels) * 3.6 GT).
In practise I have 27.7 GB/s (9700K #4.8GHz, 4 single rank 3600MT DDR4 DIMMs).
So my number seems to reflect only one channel performance.
My guess is that the way I coded, the array is in an area of RAM that can only be accessed by one channel. I am not familiar at all with multi-channel so I don't really understand the limits of that technology.
For sure my RAM is dual channel (CPU-Z confirmed it)
Performance analysis
I checked the ouput of Clang14 and it uses SSE (O3 with no -march=native). I could be wrong (not an ASM guy) but it seems it unrolled the loop and does 128 bytes per iteration. A loop is 4.5 cycles on Coffee Lake according to uiCA leading to 139.38 GB/s, way above the RAM theoretical speed. So we should be RAM limited.
My questions
Is my array stored in one bank hence the fact it can only be accessed by one channel ?
How do we write code to leverage the benefits of multichannel RAM (here dual channel) ?
References
Godbolt

Related

Why is `_mm_stream_si128` much slower than `_mm_storeu_si128` on Skylake-Xeon when writing parts of 2 cache lines? But less effect on Haswell

I have code that looks like this (simple load, modify, store) (I've simplified it to make it more readable):
__asm__ __volatile__ ( "vzeroupper" : : : );
while(...) {
__m128i in = _mm_loadu_si128(inptr);
__m128i out = in; // real code does more than this, but I've simplified it
_mm_stream_si12(outptr,out);
inptr += 12;
outptr += 16;
}
This code runs about 5 times faster on our older Sandy Bridge Haswell hardware compared to our newer Skylake machines. For example, if the while loop runs about 16e9 iterations, it takes 14 seconds on Sandy Bridge Haswell and 70 seconds on Skylake.
We upgraded to the lasted microcode on the Skylake,
and also stuck in vzeroupper commands to avoid any AVX issues. Both fixes had no effect.
outptr is aligned to 16 bytes, so the stream command should be writing to aligned addresses. (I put in checks to verify this statement). inptr is not aligned by design. Commenting out the loads doesn't make any effect, the limiting commands are the stores. outptr and inptr are pointing to different memory regions, there is no overlap.
If I replace the _mm_stream_si128 with _mm_storeu_si128, the code runs way faster on both machines, about 2.9 seconds.
So the two questions are
1) why is there such a big difference between Sandy Bridge Haswell and Skylake when writing using the _mm_stream_si128 intrinsic?
2) why does the _mm_storeu_si128 run 5x faster than the streaming equivalent?
I'm a newbie when it comes to intrinsics.
Addendum - test case
Here is the entire test case: https://godbolt.org/z/toM2lB
Here is a summary of the benchmarks I took on two difference processors, E5-2680 v3 (Haswell) and 8180 (Skylake).
// icpc -std=c++14 -msse4.2 -O3 -DNDEBUG ../mre.cpp -o mre
// The following benchmark times were observed on a Intel(R) Xeon(R) Platinum 8180 CPU # 2.50GHz
// and Intel(R) Xeon(R) CPU E5-2680 v3 # 2.50GHz.
// The command line was
// perf stat ./mre 100000
//
// STORER time (seconds)
// E5-2680 8180
// ---------------------------------------------------
// _mm_stream_si128 1.65 7.29
// _mm_storeu_si128 0.41 0.40
The ratio of stream to store is 4x or 18x, respectively.
I'm relying on the default new allocator to align my data to 16 bytes. I'm getting luck here that it is aligned. I have tested that this is true, and in my production application, I use an aligned allocator to make absolutely sure it is, as well as checks on the address, but I left that off of the example because I don't think it matters.
Second edit - 64B aligned output
The comment from #Mystical made me check that the outputs were all cache aligned. The writes to the Tile structures are done in 64-B chunks, but the Tiles themselves were not 64-B aligned (only 16-B aligned).
So changed my test code like this:
#if 0
std::vector<Tile> tiles(outputPixels/32);
#else
std::vector<Tile, boost::alignment::aligned_allocator<Tile,64>> tiles(outputPixels/32);
#endif
and now the numbers are quite different:
// STORER time (seconds)
// E5-2680 8180
// ---------------------------------------------------
// _mm_stream_si128 0.19 0.48
// _mm_storeu_si128 0.25 0.52
So everything is much faster. But the Skylake is still slower than Haswell by a factor of 2.
Third Edit. Purposely misalignment
I tried the test suggested by #HaidBrais. I purposely allocated my vector class aligned to 64 bytes, then added 16 bytes or 32 bytes inside the allocator such that the allocation was either 16 Byte or 32 Byte aligned, but NOT 64 byte aligned. I also increased the number of loops to 1,000,000, and ran the test 3 times and picked the smallest time.
perf stat ./mre1 1000000
To reiterate, an alignment of 2^N means it is NOT aligned to 2^(N+1) or 2^(N+2).
// STORER alignment time (seconds)
// byte E5-2680 8180
// ---------------------------------------------------
// _mm_storeu_si128 16 3.15 2.69
// _mm_storeu_si128 32 3.16 2.60
// _mm_storeu_si128 64 1.72 1.71
// _mm_stream_si128 16 14.31 72.14
// _mm_stream_si128 32 14.44 72.09
// _mm_stream_si128 64 1.43 3.38
So it is clear that cache alignment gives the best results, but _mm_stream_si128 is better only on the 2680 processor and suffers some sort of penalty on the 8180 that I can't explain.
For furture use, here is the misaligned allocator I used (I did not templatize the misalignment, you'll have to edit the 32 and change to 0 or 16 as needed):
template <class T >
struct Mallocator {
typedef T value_type;
Mallocator() = default;
template <class U> constexpr Mallocator(const Mallocator<U>&) noexcept
{}
T* allocate(std::size_t n) {
if(n > std::size_t(-1) / sizeof(T)) throw std::bad_alloc();
uint8_t* p1 = static_cast<uint8_t*>(aligned_alloc(64, (n+1)*sizeof(T)));
if(! p1) throw std::bad_alloc();
p1 += 32; // misalign on purpose
return reinterpret_cast<T*>(p1);
}
void deallocate(T* p, std::size_t) noexcept {
uint8_t* p1 = reinterpret_cast<uint8_t*>(p);
p1 -= 32;
std::free(p1); }
};
template <class T, class U>
bool operator==(const Mallocator<T>&, const Mallocator<U>&) { return true; }
template <class T, class U>
bool operator!=(const Mallocator<T>&, const Mallocator<U>&) { return false; }
...
std::vector<Tile, Mallocator<Tile>> tiles(outputPixels/32);

The simplified code doesn't really show the actual structure of your benchmark. I don't think the simplified code will exhibit the slowness you've mentioned.
The actual loop from your godbolt code is:
while (count > 0)
{
// std::cout << std::hex << (void*) ptr << " " << (void*) tile <<std::endl;
__m128i value0 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 0 * diffBytes));
__m128i value1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 1 * diffBytes));
__m128i value2 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 2 * diffBytes));
__m128i value3 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 3 * diffBytes));
__m128i tileVal0 = value0;
__m128i tileVal1 = value1;
__m128i tileVal2 = value2;
__m128i tileVal3 = value3;
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 0), tileVal0);
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 1), tileVal1);
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 2), tileVal2);
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 3), tileVal3);
ptr += diffBytes * 4;
count -= diffBytes * 4;
tile += diffPixels * 4;
ipixel += diffPixels * 4;
if (ipixel == 32)
{
// go to next tile
ipixel = 0;
tileIter++;
tile = reinterpret_cast<uint16_t*>(tileIter->pixels);
}
}
Note the if (ipixel == 32) part. This jumps to a different tile every time ipixel reaches 32. Since diffPixels is 8, this happens every iteration. Hence you are only making 4 streaming stores (64 bytes) per tile. Unless each tile happens to be 64-byte aligned, which is unlikely to happen by chance and cannot be relied on, this means that every write writes to only part of two different cache lines. That's a known anti-pattern for streaming stores: for effective use of streaming stores you need to write out the full line.
On to the performance differences: streaming stores have widely varying performance on different hardware. These stores always occupy a line fill buffer for some time, but how long varies: on lots of client chips it seems to only occupy a buffer for about the L3 latency. I.e., once the streaming store reaches the L3 it can be handed off (the L3 will track the rest of the work) and the LFB can be freed on the core. Server chips often have much longer latency. Especially multi-socket hosts.
Evidently, the performance of NT stores is worse on the SKX box, and much worse for partial line writes. The overall worse performance is probably related to the redesign of the L3 cache.

On Skylake (SKL) why are there L2 writebacks in a read-only workload that exceeds the L3 size?

Consider the following simple code:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <time.h>
#include <err.h>
int cpu_ms() {
return (int)(clock() * 1000 / CLOCKS_PER_SEC);
}
int main(int argc, char** argv) {
if (argc < 2) errx(EXIT_FAILURE, "provide the array size in KB on the command line");
size_t size = atol(argv[1]) * 1024;
unsigned char *p = malloc(size);
if (!p) errx(EXIT_FAILURE, "malloc of %zu bytes failed", size);
int fill = argv[2] ? argv[2][0] : 'x';
memset(p, fill, size);
int startms = cpu_ms();
printf("allocated %zu bytes at %p and set it to %d in %d ms\n", size, p, fill, startms);
// wait until 500ms has elapsed from start, so that perf gets the read phase
while (cpu_ms() - startms < 500) {}
startms = cpu_ms();
// we start measuring with perf here
unsigned char sum = 0;
for (size_t off = 0; off < 64; off++) {
for (size_t i = 0; i < size; i += 64) {
sum += p[i + off];
}
}
int delta = cpu_ms() - startms;
printf("sum was %u in %d ms \n", sum, delta);
return EXIT_SUCCESS;
}
This allocates an array of size bytes (which is passed in on the command line, in KiB), sets all bytes to the same value (the memset call), and finally loops over the array in a read-only manner, striding by one cache line (64 bytes), and repeats this 64 times so that each byte is accessed once.
If we turn prefetching off1, we expect this to hit 100% in a given level of cache if size fits in the cache, and to mostly miss at that level otherwise.
I'm interested in two events l2_lines_out.silent and l2_lines_out.non_silent (and also l2_trans.l2_wb - but the values end up identical to non_silent), which counts lines that are silently dropped from l2 and that are not.
If we run this from 16 KiB up through 1 GiB, and measure these two events (plus l2_lines_in.all) for the final loop only, we get:
The y-axis here is the number of events, normalized to the number of accesses in the loop. For example, the 16 KiB test allocates a 16 KiB region, and makes 16,384 accesses to that region, and so a value of 0.5 means that on average 0.5 counts of the given event occurred per access.
The l2_lines_in.all behaves almost as we'd expect. It starts off around zero and when the size exceeds the L2 size it goes up to 1.0 and stays there: every access brings in a line.
The other two lines behave weirdly. In the region where the test fits in the L3 (but not in the L2), the eviction are nearly all silent. However, as soon as the region moves into main memory, the evictions are all non-silent.
What explains this behavior? It's hard to understand why the evictions from L2 would depend on whether the underlying region fits in main memory.
If you do stores instead of loads, almost everything is a non-silent writeback as expected, since the update value has to be propagated to the outer caches:
We can also take a look at what level of the cache the accesses are hitting in, using the mem_inst_retired.l1_hit and related events:
If you ignore the L1 hit counters, which seem impossibly high at a couple of points (more than 1 L1 hit per access?), the results look more or less as expected: mostly L2 hits when the the region fits cleanly in L2, mostly L3 hits for the L3 region (up to 6 MiB on my CPU), and then misses to DRAM thereafter.
You can find the code on GitHub. The details on building and running can be found in the README file.
I observed this behavior on my Skylake client i7-6700HQ CPU. The same effect seems not to exist on Haswell2. On Skylake-X, the behavior is totally different, as expected, as the L3 cache design has changed to be something like a victim cache for the L2.
1 You can do it on recent Intel with wrmsr -a 0x1a4 "$((2#1111))". In fact, the graph is almost exactly the same with prefetch on, so turning it off is mostly just to eliminate a confounding factor.
2 See the comments for more details, but briefly l2_lines_out.(non_)silent doesn't exist there, but l2_lines_out.demand_(clean|dirty) does which seem to have a similar definition. More importantly, the l2_trans.l2_wb which mostly mirrors non_silent on Skylake exists also on Haswell and appears to mirror demand_dirty and it also does not exhibit the effect on Haswell.

Optimize Cuda Kernel time execution

I'm a learning Cuda student, and I would like to optimize the execution time of my kernel function. As a result, I realized a short program computing the difference between two pictures. So I compared the execution time between a classic CPU execution in C, and a GPU execution in Cuda C.
Here you can find the code I'm talking about:
int *imgresult_data = (int *) malloc(width*height*sizeof(int));
int size = width*height;
switch(computing_type)
{
case GPU:
HANDLE_ERROR(cudaMalloc((void**)&dev_data1, size*sizeof(unsigned char)));
HANDLE_ERROR(cudaMalloc((void**)&dev_data2, size*sizeof(unsigned char)));
HANDLE_ERROR(cudaMalloc((void**)&dev_data_res, size*sizeof(int)));
HANDLE_ERROR(cudaMemcpy(dev_data1, img1_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_data2, img2_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_data_res, imgresult_data, size*sizeof(int), cudaMemcpyHostToDevice));
float time;
cudaEvent_t start, stop;
HANDLE_ERROR( cudaEventCreate(&start) );
HANDLE_ERROR( cudaEventCreate(&stop) );
HANDLE_ERROR( cudaEventRecord(start, 0) );
for(int m = 0; m < nb_loops ; m++)
{
diff<<<height, width>>>(dev_data1, dev_data2, dev_data_res);
}
HANDLE_ERROR( cudaEventRecord(stop, 0) );
HANDLE_ERROR( cudaEventSynchronize(stop) );
HANDLE_ERROR( cudaEventElapsedTime(&time, start, stop) );
HANDLE_ERROR(cudaMemcpy(imgresult_data, dev_data_res, size*sizeof(int), cudaMemcpyDeviceToHost));
printf("Time to generate: %4.4f ms \n", time/nb_loops);
break;
case CPU:
clock_t begin = clock(), diff;
for (int z=0; z<nb_loops; z++)
{
// Apply the difference between 2 images
for (int i = 0; i < height; i++)
{
tmp = i*imgresult_pitch;
for (int j = 0; j < width; j++)
{
imgresult_data[j + tmp] = (int) img2_data[j + tmp] - (int) img1_data[j + tmp];
}
}
}
diff = clock() - begin;
float msec = diff*1000/CLOCKS_PER_SEC;
msec = msec/nb_loops;
printf("Time taken %4.4f milliseconds", msec);
break;
}
And here is my kernel function:
__global__ void diff(unsigned char *data1 ,unsigned char *data2, int *data_res)
{
int row = blockIdx.x;
int col = threadIdx.x;
int v = col + row*blockDim.x;
if (row < MAX_H && col < MAX_W)
{
data_res[v] = (int) data2[v] - (int) data1[v];
}
}
I obtained these execution time for each one
CPU: 1,3210ms
GPU: 0,3229ms
I wonder why GPU result is not as lower as it should be. I am a beginner in Cuda so please be comprehensive if there are some classic errors.
EDIT1:
Thank you for your feedback. I tried to delete the 'if' condition from the kernel but it didn't change deeply my program execution time.
However, after having install Cuda profiler, it told me that my threads weren't running concurrently. I don't understand why I have this kind of message, but it seems true because I only have a 5 or 6 times faster application with GPU than with CPU. This ratio should be greater, because each thread is supposed to process one pixel concurrently to all the other ones. If you have an idea of what I am doing wrong, it would be hepful...
Flow.

Here are two things you could do which may improve the performance of your diff kernel:
1. Let each thread do more work
In your kernel, each thread handles just a single element; but having a thread do anything already has a bunch of overhead, at the block and the thread level, including obtaining the parameters, checking the condition and doing address arithmetic. Now, you could say "Oh, but the reads and writes take much more time then that; this overhead is negligible" - but you would be ignoring the fact, that the latency of these reads and writes is hidden by the presence of many other warps which may be scheduled to do their work.
So, let each thread process more than a single element. Say, 4, as each thread can easily read 4 bytes at once into a register. Or even 8 or 16; experiment with it. Of course you'll need to adjust your grid and block parameters accordingly.
2. "Restrict" your pointers
__restrict is not part of C++, but it is supported in CUDA. It tells the compiler that accesses through different pointers passed to the function never overlap. See:
What does the restrict keyword mean in C++?
Realistic usage of the C99 'restrict' keyword?
Using it allows the CUDA compiler to apply additional optimizations, e.g. loading or storing data via non-coherent cache. Indeed, this happens with your kernel although I haven't measured the effects.
3. Consider using a "SIMD" instruction
CUDA offers this intrinsic:
__device__  unsigned int __vsubss4 ( unsigned int a, unsigned int b )
Which subtracts each signed byte value in a from its corresponding one in b. If you can "live" with the result, rather than expecting a larger int variable, that could save you some of work - and go very well with increasing the number of elements per thread. In fact, it might let you increase it even further to get to the optimum.

I don't think you are measuring times correctly, memory copy is a time consuming step in GPU that you should take into account when measuring your time.
I see some details that you can test:
I suppose you are using MAX_H and MAX_H as constants, you may consider doing so using cudaMemcpyToSymbol().
Remember to sync your threads using __syncthreads(), so you don't get issues between each loop iteration.
CUDA works with warps, so block and number of threads per block work better as multiples of 8, but not larger than 512 threads per block unless your hardware supports it. Here is an example using 128 threads per block: <<<(cols*rows+127)/128,128>>>.
Remember as well to free your allocated memory in GPU and destroying your time events created.
In your kernel function you can have a single variable int v = threadIdx.x + blockIdx.x * blockDim.x .
Have you tested, beside the execution time, that your result is correct? I think you should use cudaMallocPitch() and cudaMemcpy2D() while working with arrays due to padding.

Probably there are other issues with the code, but here's what I see. The following lines in __global__ void diff are considered not optimal:
if (row < MAX_H && col < MAX_W)
{
data_res[v] = (int) data2[v] - (int) data1[v];
}
Conditional operators inside a kernel result in warp divergence. It means that if and else parts inside a warp are executed in sequence, not in parallel. Also, as you might have realized, if evaluates to false only at borders. To avoid the divergence and needless computation, split your image in two parts:
Central part where row < MAX_H && col < MAX_W is always true. Create an additional kernel for this area. if is unnecessary here.
Border areas that will use your diff kernel.
Obviously you'll have modify your code that calls the kernels.
And on a separate note:
GPU has throughput-oriented architecture, but not latency-oriented as CPU. It means CPU may be faster then CUDA when it comes to processing small amounts of data. Have you tried using large data sets?
CUDA Profiler is a very handy tool that will tell you're not optimal in the code.

Need help explaining some CUDA performance results

We have been experimenting with different histogramming algorithms on a CUDA GPU. Most of the results I can explain, but we noticed some really weird features of which I have no clue what is causing them.
Kernels
The weird stuff happens in a data-parallel implementation. This means that the data is distributed over the threads. Each thread looks at a subset (ideally just 1) of the data, and adds its contribution to a histogram in global memory, which requires atomic operations.
__global__ void histogram1(float *data, uint *hist, uint n, float xMin, float binWidth, uin\
t nBins)
{
uint const nThreads = blockDim.x * gridDim.x;
uint const tid = threadIdx.x + blockIdx.x * blockDim.x;
uint idx = tid;
while (idx < n)
{
float x = data[idx];
uint bin = (x - xMin) / binWidth;
atomicAdd(hist + bin, 1);
idx += nThreads;
}
}
As a first optimization, each block first constructs a partial histogram in shared memory before doing a reduction of partial histograms to obtain the final result in global memory. The code is pretty straightforward, and I believe that it's very similar to that used in Cuda By Example.
__global__ void histogram2(float *data, uint *hist, uint n,
float xMin, float binWidth, uint nBins)
{
extern __shared__ uint partialHist[]; // size = nBins * sizeof(uint)
uint const nThreads = blockDim.x * gridDim.x;
uint const tid = threadIdx.x + blockIdx.x * blockDim.x;
// initialize shared memory to 0
uint idx = threadIdx.x;
while (idx < nBins)
{
partialHist[idx] = 0;
idx += blockDim.x;
}
__syncthreads();
// Calculate partial histogram (in shared mem)
idx = tid;
while (idx < n)
{
float x = data[idx];
uint bin = (x - xMin) / binWidth;
atomicAdd(partialHist + bin, 1);
idx += nThreads;
}
__syncthreads();
// Compute resulting total (global) histogram
idx = threadIdx.x;
while (idx < nBins)
{
atomicAdd(hist + idx, partialHist[idx]);
idx += blockDim.x;
}
}
Results
Speedup vs n
We benchmarked these two kernels to see how they behave as a function of n, which is the number of datapoints. The data was uniform randomly distributed. In the figure below, HIST_DP_1 is the unoptimized trivial version, whereas HIST_DP_2 is the one using shared memory to speed things up:
The timings have been taken relative to the CPU performance, and the weird stuff happens for very large datasets. The optimizing function, instead of flattening out like the unoptimized version, starts to improve again (relatively). We'd expect that for large datasets, the occupancy of our card will be near 100%, which would mean that from that point on the performance would scale linearly, like the CPU (and indeed the unoptimized blue curve).
The behavior could be due to the fact that the chance of having two threads performing an atomic operation on the same bin in shared/global memory going to zero for large data-sets, but in that case we would expect the drop to be in different places for different nBins. This is not what we observe, the drop is in all three panels at around 10^7 bins. What is happening here? Some complicated caching effect? Or is it something obvious that we missed?
Speedup vs nBins
To have a closer look at the behavior as a function of the number of bins, we fixed our dataset at 10^4 (10^5 in one case), and ran the algorithms for many different bin-numbers.
As a reference we also generated some non-random data. The red graph shows the results for perfectly sorted data, whereas the light-blue line corresponds to a dataset in which every value was identical (maximal congestion in the atomic operations). The question is obvious: what is the discontinuity doing there?
System Setup
NVidia Tesla M2075, driver 319.37
Cuda 5.5
Intel(R) Xeon(R) CPU E5-2603 0 # 1.80GHz
Thanks for your help!
EDIT: Reproduction Case
As requested: a compiling, runnable reproduction case. The code is quite long, which is why I didn't include it in the first place. The snippet is available on snipplr. To make your life even more easy, I'll include a little shell-script to run it for the same settings I used, and an Octave script to produce the plots.
Shell script
#!/bin/bash
runs=100
# format: [n] [nBins] [t_cpu] [t_gpu1] [t_gpu2]
for nBins in 100 1000 10000
do
for n in 10 50 100 200 500 1000 2000 5000 10000 50000 100000 500000 1000000 10000000 100000000
do
echo -n "$n $nBins "
./repro $n $nBins $runs
done
done
Octave script
T = load('repro.txt');
bins = unique(T(:,2));
t = cell(1, numel(bins));
for i = 1:numel(bins)
t{i} = T(T(:,2) == bins(i), :);
subplot(2, numel(bins), i);
loglog(t{i}(:,1), t{i}(:,3:5))
title(sprintf("nBins = %d", bins(i)));
legend("cpu", "gpu1", "gpu2");
subplot(2, numel(bins), i + numel(bins));
loglog(t{i}(:,1), t{i}(:,4)./t{i}(:,3), ...
t{i}(:,1), t{i}(:,5)./t{i}(:,3));
title("relative");
legend("gpu1/cpu", "gpu2/cpu");
end
Absolute Timings
Absolute timings show that it's not the CPU slowing down. Instead, the GPU is speeding up relatively:

Regarding question 1:
This is not what we observe, the drop is in all three panels at around 10^7 bins. What is happening here? Some complicated caching effect? Or is it something obvious that we missed?
This drop is due to the limit you've set on the maximum number of blocks (1<<14 == 16384). At n=10^7 gpuBench2 the limit has kicked in, and each thread starts processing multiple elements. At n=10^8 each thread works on 12 (sometimes 11) elements. If you remove this cap you can see that your performance continues to flatline.
Why is this faster? Multiple elements per thread allows for latency of the load from data to be hidden much better, especially in the case with 10000 bins where you are only able to fit one block on to each SM due to the high shared memory usage. In this case, every element in the block will reach the global load at around the same time, and none will be able to continue until it has completed its load. By having multiple elements we can pipeline these loads, getting many elements per thread for the latency of one.
(You don't see this in gupBench1 as it is not latency bound, but bandwidth bound to L2. You can see this very quickly if you import the output of nvprof into the visual profiler)
Regarding question 2:
The question is obvious: what is the discontinuity doing there?
I don't have a Fermi to hand, and I can't reproduce this on my Kepler, so I'd assume it's something that is Fermi specific. That's the danger of answering questions with two parts, I suppose!

OpenCL very low GFLOPS, no data transfer bottleneck

I am trying to optimize an algorithm I am running on my GPU (AMD HD6850). I counted the number of floating point operations inside my kernel and measured its execution time. I found it to achieve ~20 SP GFLOPS, however according to the GPUs specs I should achieve ~1500 GFLOPS.
To find the bottleneck I created a very simple kernel:
kernel void test_gflops(const float d, global float* result)
{
int gid = get_global_id(0);
float cd;
for (int i=0; i<100000; i++)
{
cd = d*i;
}
if (cd == -1.0f)
{
result[gid] = cd;
}
}
Running this kernel I get ~5*10^5 work_items/sec. I count one floating point operation (not sure if that's right, what about incrementing i and comparing it to 100000?) per iteration of the loop.
==> 5*10^5 work_items/sec * 10^5 FLOPS = 50 GFLOPS.
Even if there are 3 or 4 operations going on in the loop, it's much slower than the what the card should be able to do. What am I doing wrong?
The global work size is big enough (no speed change for 10k vs 100k work items).

Here are a couple of tricks:
GPU doesn't like cycles at all. Use #pragma unroll to unwind them.
Your GPU is good at vector operations. Stick to it, that will allow you to process multiple operands at once.
Use vector load/store whether it's possible.
Measure the memory bandwidth - I'm almost sure that you are bandwidth-limited because of poor access pattern.
In my opinion, kernel should look like this:
typedef union floats{
float16 vector;
float array[16];
} floats;
kernel void test_gflops(const float d, global float* result)
{
int gid = get_global_id(0);
floats cd;
cd.vector = vload16(gid, result);
cd.vector *= d;
#pragma unroll
for (int i=0; i<16; i++)
{
if(cd.array[i] == -1.0f){
result[gid] = cd;
}
}
Make your NDRange bigger to compensate difference between 16 & 1000 in loop condition.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio