Analysing performance of transpose function - openmp

I've written a naive and an "optimized" transpose functions for order-3 tensors containing double-precision complex numbers and I would like to analyze their performance.
Approximate code for naive transpose function:
#pragma omp for schedule(static)
for (auto i2 = std::size_t(0); i2 < n2; ++i2)
{
for (auto i1 = std::size_t{}; i1 < n1; ++i1)
{
for (auto i3 = std::size_t{}; i3 < n3; ++i3)
{
tens_tr(i3, i2, i1) = tens(i1, i2, i3);
}
}
}
Approximate code for optimized transpose function (remainder loop not shown, assume divisibility):
#pragma omp for schedule(static)
for (auto i2 = std::size_t(0); i2 < n2; ++i2)
{
// blocked loop
for (auto bi1 = std::size_t{}; bi1 < n1; bi1 += block_size)
{
for (auto bi3 = std::size_t{}; bi3 < n3; bi3 += block_size)
{
for (auto i1 = std::size_t{}; i1 < block_size; ++i1)
{
for (auto i3 = std::size_t{}; i3 < block_size; ++i3)
{
cache_buffer[i3 * block_size + i1] = tens(bi1 + i1, i2, bi3 + i3);
}
}
for (auto i1 = std::size_t{}; i1 < block_size; ++i1)
{
for (auto i3 = std::size_t{}; i3 < block_size; ++i3)
{
tens_tr(bi3 + i1, i2, bi1 + i3) = cache_buffer[i1 * block_size + i3];
}
}
}
}
}
Assumption: I decided to use a streaming function as reference because I reasoned that the transpose function, in its perfect implementation, would closely resemble any bandwidth-saturating streaming function.
For this purpose, I chose the DAXPY loop as reference.
#pragma omp parallel for schedule(static)
for (auto i1 = std::size_t{}; i1 < tens_a_->get_n1(); ++i1)
{
auto* slice_a = reinterpret_cast<double*>(tens_a_->get_slice_data(i1));
auto* slice_b = reinterpret_cast<double*>(tens_b_->get_slice_data(i1));
const auto slice_size = 2 * tens_a_->get_slice_size(); // 2 doubles for a complex
#pragma omp simd safelen(8)
for (auto index = std::size_t{}; index < slice_size; ++index)
{
slice_b[index] += lambda_ * slice_a[index]; // fp_count: 2, traffic: 2+1
}
}
Also, I used a simple copy kernel as a second reference.
#pragma omp parallel for schedule(static)
for (auto i1 = std::size_t{}; i1 < tens_a_->get_n1(); ++i1)
{
const auto* op1_begin = reinterpret_cast<double*>(tens_a_->get_slice_data(index));
const auto* op1_end = op1_begin + 2 * tens_a_->get_slice_size(); // 2 doubles in a complex
auto* op2_iter = reinterpret_cast<double*>(tens_b_->get_slice_data(index));
#pragma omp simd safelen(8)
for (auto* iter = op1_begin; iter != op1_end; ++iter, ++op2_iter)
{
*op2_iter = *iter;
}
}
Hardware:
Intel(R) Xeon(X) Platinum 8168 (Skylake) with 24 cores # 2.70 GHz and L1, L2 and L3 caches sized 32 kB, 1 MB and 33 MB respectively.
Memory of 48 GiB # 2666 MHz. Intel Advisor's roof-line view says memory BW is 115 GB/s.
Benchmarking: 20 warm-up runs, 100 timed experiments, each with newly allocated data "touched" such that page-faults will not be measured.
Compiler and flags:
Intel compiler from OneAPI 2022.1.0, optimization flags -O3;-ffast-math;-march=native;-qopt-zmm-usage=high.
Results (sizes assumed to be adequately large):
Using 24 threads pinned on 24 cores (total size of both tensors ~10 GiB):
DAXPY 102 GB/s
Copy 101 GB/s
naive transpose 91 GB/s
optimized transpose 93 GB/s
Using 1 thread pinned on a single core (total size of both tensors ~10 GiB):
DAXPY 20 GB/s
Copy 20 GB/s
naive transpose 9.3 GB/s
optimized transpose 9.3 GB/s
Questions:
Why is my naive transpose function performing so well?
Why is the difference in performance between reference and transpose functions so high when using only 1 thread?
I'm glad to receive any kind of input for any of the above questions. Also, I will gladly provide additional information when required. Unfortunately, I cannot provide a minimum reproducer because of the size and complexity of each benchmark program. Thank you very much for you time and help in advance!
Updates:
Could it be that the Intel compiler performed loop-blocking for the naive transpose function as optimization?

Is the above-mentioned assumption valid? [asked before the edit]
Not really.
Transpositions of large arrays tends not to saturate the bandwidth of the RAM on some platforms. This can be due to cache effects like cache trashing. For more information about this, you can read this post for example. In your specific case, things works quite well though (see below).
On NUMA platforms, the data page distribution on NUMA nodes has can have a strong impact on the performance. This can be due to the (temporary) unbalanced page distribution, a non-uniform latency, a non-uniform throughput or even the (temporary) saturation of the RAM of some NUMA node. NUMA can be seen on recent AMD processors but also on some Intel ones (eg. since Skylake, see this post) regarding the system configuration.
Even assuming the above points do not apply in your case, considering the perfect case while the naive code may not behave like a perfect transposition can result in wrong interpretations. If this assumption is broken, results could overestimate the performance of the naive implementation for example.
Why is my naive transpose function performing so well?
A good throughput does not means the computation is fast. The computation can be slower with a higher throughput if more data needs to be transferred from the RAM. This is possible due to cache misses. More specifically, with a naive access pattern, cache lines can be replaced more frequently with a lower reuse (due to cache trashing) and thus the wall clock time should be higher. You need to measure the wall clock time. Metrics are good to understand what is going on but not to measure the performance of a kernel.
In this specific case, the chosen size (ie. 1050) should not cause too many conflict misses because it is not divisible by a large power of two. In the naive version, the tens_tr writes will fill many cache lines partially (1050) before they can be reused when i1 is increased (up to 8 subsequent increases are needed so to fill the cache lines). This means, 1050 * 64 ~= 66 KiB of cache is needed for the i1-i3-based transposition of one given i2 to complete. The cache lines cannot be reused with other i2 values so the cache do not need to be so huge for the transposition to be relatively efficient. That being said, one should also consider the tens reads (though it can be quite quickly evicted from the cache). In the end, the 16-way associative L2 cache of 1 MiB should be enough for that. Note that the naive implementation should perform poorly with significantly bigger arrays since the L2 cache should not be large enough so for cache lines to be fully reused (causing data to be reloaded many times from the memory hierarchy, typically from the L3 in sequential and the RAM in parallel). Also note that the naive transposition can also perform very poorly on processor with smaller caches (eg. x86-64 desktop processors except recent ones that often have bigger caches) or if you plan to change the size of the input array to something divisible by a large power of two.
While blocking enable a better use of the L1 cache, it is not so important in your specific case. Indeed, the naive computation does not benefit from the L1 cache but the effect is small since the transposition should be bounded by the L3 cache and the RAM anyway. That being said, a better L1 cache usage could help to reduce a bit the latency regarding the target processor architecture. You should see the effect mainly on significantly smaller arrays.
In parallel, the L3 cache is large enough so that the 24 cores can run in parallel without too many conflict misses. Even if the L3 performed poorly, the kernel would be mainly memory bound so the impact of the cache misses would be not much visible.
Why is the difference in performance between reference and transpose functions so high when using only 1 thread?
This is likely due to the latency of memory operations. Transpositions perform memory read/writes with huge strides and the hardware prefetchers may not be able to fully mitigate the huge latency of the L3 cache or the one of the main RAM. Indeed, the number of pending cache-line request per core is limited (to a dozen of them on Skylake), so the kernel is bound by the latency of the requests since there is not enough concurrency to fully overlap their latency.
For the DAXPY/copy, hardware prefetchers can better reduce the latency but the amount of concurrency is still too small compared to the latency on Xeon processor so to fully saturate the RAM with 1 thread. This is a quite reasonable architectural limitation since such processors are designed to execute applications scaling well on many cores.
With many threads, the per-core limitation vanishes and it is replaced by a stronger one: the practical RAM bandwidth.
Could it be that the Intel compiler performed loop-blocking for the naive transpose function as optimization?
This is theoretically possible since the Intel compiler (ICC) has such optimizer, but it is very unlikely for ICC to do that on a 3D transposition code (since it is a pretty complex relatively specific use-case). The best is to analyse the assembly code so to be sure.
Note on the efficiency of the optimized transposition
Due to the cache-line write allocation on x86-64 processors (like your Xeon processor), I expect the transposition to have a lower throughput assuming it do not take into account such effect. Indeed, the processor needs to read tens_tr cache lines so to fill them since it do not know if they will be completely filled ahead of time (it would be crazy for the naive transposition) and they may be evicted before (eg. during a context switch, by another running program).
There is several possible reasons to explain that:
The assumption is wrong and it means 1/3 of the bandwidth is wasted by reading cache lines meant to be actually written;
the DAXPY code also have the same issue and the reported maximum bandwidth is not really correct either (unlikely);
ICC succeed to rewrite the transposition so to use efficiently the caches and also generate non-temporal store instructions so to avoid this effect (unlikely).
Based on the possible reasons, I think the measured throughput already take into account write allocation and that the transposition implementation can be optimized further. Indeed, the optimized version doing the copy can use non-temporal store so to write the array back in memory without reading it. This is not possible with the naive implementation. With such optimization, the throughput may be the same, but the execution time can be about 33% lower (due to a better use of the memory bandwidth). This is a good example showing that the initial assumption is just wrong.

Related

Desired Compute-To-Memory-Ratio (OP/B) on GPU

I am trying to undertand the architecture of the GPUs and how we assess the performance of our programs on the GPU. I know that the application can be:
Compute-bound: performance limited by the FLOPS rate. The processor’s cores are fully utilized (always have work to do)
Memory-bound: performance limited by the memory
bandwidth. The processor’s cores are frequently idle because memory cannot supply data fast enough
The image below shows the FLOPS rate, peak memory bandwidth, and the Desired Compute to memory ratio, labeled by (OP/B), for each microarchitecture.
I also have an example of how to compute this OP/B metric. Example: Below is part of a CUDA code for applying matrix-matrix multiplication
for(unsigned int i = 0; i < N; ++i) {
sum += A[row*N + i]*B[i*N + col];
}
and the way to calculate OP/B for this matrix-matrix multiplication is as follows:
Matrix multiplication performs 0.25 OP/B
1 FP add and 1 FP mul for every 2 FP values (8B) loaded
Ignoring stores
and if we want to utilize this:
But matrix multiplication has high potential for reuse. For NxN matrices:
Data loaded: (2 input matrices)×(N^2 values)×(4 B) = 8N^2 B
Operations: (N^2 dot products)(N adds + N muls each) = 2N^3 OP
Potential compute-to-memory ratio: 0.25N OP/B
So if I understand this clearly well, I have the following questions:
It is always the case that the greater OP/B, the better ?
how do we know how much FP operations we have ? Is it the adds and the multiplications
how do we know how many bytes are loaded per FP operation ?
It is always the case that the greater OP/B, the better ?
Not always. The target value balances the load on compute pipe throughput and memory pipe throughput (i.e. that level of op/byte means that both pipes will be fully loaded). As you increase op/byte beyond that or some level, your code will switch from balanced to compute-bound. Once your code is compute bound, the performance will be dictated by the compute pipe that is the limiting factor. Additional op/byte increase beyond this point may have no effect on code performance.
how do we know how much FP operations we have ? Is it the adds and the multiplications
Yes, for the simple code you have shown, it is the adds and multiplies. Other more complicated codes may have other factors (e.g. sin, cos, etc.) which may also contribute.
As an alternative to "manually counting" the FP operations, the GPU profilers can indicate the number of FP ops that a code has executed.
how do we know how many bytes are loaded per FP operation ?
Similar to the previous question, for simple codes you can "manually count". For complex codes you may wish to try to use profiler capabilities to estimate. For the code you have shown:
sum += A[row*N + i]*B[i*N + col];
The values from A and B have to be loaded. If they are float quantities then they are 4 bytes each. That is a total of 8 bytes. That line of code will require 1 floating point multiplication (A * B) and one floating point add operation (sum +=). The compiler will fuse these into a single instruction (fused multiply-add) but the net effect is you are performing two floating point operations per 8 bytes. op/byte is 2/8 = 1/4. The loop does not change the ratio in this case. To increase this number, you would want to explore various optimization methods, such as a tiled shared-memory matrix multiply, or just use CUBLAS.
(Operations like row*N + i are integer arithmetic and don't contribute to the floating-point load, although its possible they may be significant, performance-wise.)

Does the using of SIMD load main CPU registers?

Let's imagine we have software developer that's goal is achieve absolute maximum of CPU's performance.
In today's CPUs we have many cores, we can load data in cache for faster processing and we also have SIMD instructions (AVX for example) that allow us to sum\multiply\do other ops with array of items (multiply 8 integers per one CPU clock). The disadvantage of this instruction is the cost of sending data & instructions to SIMD module + overhead of converting vector type to primitive types (sorry I familiar only with C#'s Vector) (We not looling on code complexety for now).
As far as I understand, while we using SIMD, main registers of CPU used only for sending and recieving data to this registers and main ALU blocks used for general purpose calculations are idle at this time.
And here is my question - will using of SIMD instructions load main CPU blocks? For example if we have huge amount of different calculations (let's imagine 40% of them are best to run on SIMD and 60% of them are better to run as a usual), will SIMD allow us to gain performance boost in this way: 100% of all cores performace + n% of SIMD's boost performance?
I'm asking this question because of for example with GPGPU we can use GPU for parallel calculations and CPU used in this case only for sending and recieving data, so it's idle all the time and we can utilize it's performance for sensitive for latency tasks.
Looks like this is a question about Out-Of-Order-Execution? Modern x64 have a number of execution ports on the CPU, and each can dispatch a new instruction per clock cycle (so about 8 CPU ops can run in parallel on an Intel SkyLake). Some of those ports handle memory loads/stores, some handle integer arithmetic, and some handle the SIMD instructions.
So for example, you may be able to displatch 2 AVX float mults, an AVX bitwise op, 2 AVX loads, a single AVX store, and a couple of bits of pointer arithmetic on the general purpose registers in a single cycle [you will have to wait for the operation to complete - the latency]. So in theory, as long as there aren't horrific dependency chains in the code, with some care you should able to keep each of those ports busy (or at least, that's the basic aim!).
Simple Rule 1: The busier you can keep the execution ports, the faster your code goes. This should be self evident. If you can keep 8 ports busy, you're doing 8 times more than if you can only keep 1 busy. In general though, it's mostly not worth worrying about (yes, there are always exceptions to the rule)
Simple Rule 2: When the SIMD execution ports are in use, the ALU doesn't suddenly become idle [A slight terminology error on your part here: The ALU is simply the bit of the CPU that does arithmetic. The computation for general purpose ops is done on an ALU, but it's also correct to call a SIMD unit an ALU. What you meant to ask is: do the general purpose parts of the CPU power down when SIMD units are in use? To which the answer is no... ]. Consider this AVX2 optimised method (which does nothing interesting!)
#include <immintrin.h>
typedef __m256 float8;
#define mul8f _mm256_mul_ps
void computeThing(float8 a[], float8 b[], float8 c[], int count)
{
for(int i = 0; i < count; ++i)
{
a[i] = mul8f(a[i], b[i]);
b[i] = mul8f(b[i], c[i]);
}
}
Since there are no dependencies between a, b, and c (which I should really be explicit about by specifying __restrict), then the two SIMD multiply instructions can both be dispatched in a single clock cycle (since there are two execution ports that can handle floating point multiply).
The General Purpose ALU doesn't suddenly power down here - The general purpose registers & instructions are still being used!
1. to compute memory addresses (for: a[i], b[i], c[i], d[i])
2. to load/store into those memory locations
3. to increment the loop counter
4. to test if the count has been reached?
It just so happens that we are also making use of the SIMD units to do a couple of multiplications...
Simple Rule 3: For floating point operations, using 'float' or '__m256' makes next to no difference. The same CPU hardware used to compute either float or float8 types is exactly the same. There are simply a couple of bits in the machine code encoding that specifies the choice between float/__m128/__m256.
i.e. https://godbolt.org/z/xTcLrf

Does data alignment really speed up execution by more than 5%?

Since ever I carefully consider alignment of data structures. It hurts letting the CPU shuffling bits before processing can be done. Gut feelings aside, I measured the costs of unaligned data: Write 64bit longs into some GB of memory and then read their values, checking correctness.
// c++ code
const long long MB = 1024 * 1024;
const long long GB = 1024 * MB;
void bench(int offset) // pass 0..7 for different alignments
{
int n = (1 * GB - 1024) / 8;
char* mem = (char*) malloc(1 * GB);
// benchmarked block
{
long long* p = (long long*) (mem + offset);
for (long i = 0; i < n; i++)
{
*p++ = i;
}
p = (long long*) (mem + offset);
for (long i = 0; i < n; i++)
{
if (*p++ != i) throw "wrong value";
}
}
free(mem);
}
The result surprised me:
1st run 2nd run %
i = 0 221 i = 0 217 100 %
i = 1 228 i = 1 227 105 %
i = 2 260 i = 2 228 105 %
i = 3 241 i = 3 228 105 %
i = 4 219 i = 4 215 99 %
i = 5 233 i = 5 228 105 %
i = 6 227 i = 6 229 106 %
i = 7 228 i = 7 228 105 %
The costs are just 5% (if we randomly store it at any memory location, costs would be 3,75% since 25% would land aligned). But storing data unaligned has the benefit of being a bit more compact, so the 3,75% benefit could even be compensated.
Tests run on Intel 3770 CPU. Did many variations of this benchmarks (eg using pointers instead of longs; random read access to change cache effects) all leading to similar results.
Question: Is data structure alignment still as important as we all thought it is?
I know there are atomicity aspects when 64bit values spread over cache lines, but that is not a strong argument either for alignment, because larger data structs (say 30, 200bytes or so) will often spread across them.
I always believed strongly in the speed argument as laid out nicely here for instance: Purpose of memory alignment and do not feel well disobeying the old rule. But : Can we measure the claimed performance boosts of proper alignment?
A good answer could provide a reasonable benchmark showing a boost of factor of > 1.25 for aligned vs unaligned data. Or demonstrate that commonly used other modern CPUs are much more affected by unalignment.
Thank you for your thoughts measurements.
edit: I am concerned about classical data structures where structs are held in memory. In contrast to special case scenarios like scientific number crunching scenarios.
update: insights from comments:
from http://www.agner.org/optimize/blog/read.php?i=142&v=t
Misaligned memory operands handled efficiently on Sandy Bridge
On the Sandy Bridge, there is no performance penalty for reading or writing misaligned memory operands, except for the fact that it uses more cache banks so that the risk of cache conflicts is higher when the operand is misaligned.Store-to-load forwarding also works with misaligned operands in most cases.
http://danluu.com/3c-conflict/
Unaligned access might be faster(!) on Sandy Bridge due to cache organisation.
Yes, data alignment is an important prerequisite for vectorisation on architectures that only support SSE, which has strict data alignment requirements or on newer architectures such as Xeon PHI. Intel AVX, does support unaligned access, but aligning data is still considered a good practice, to avoid unnecessary performance hits:
Intel® AVX has relaxed some memory alignment requirements, so now
Intel AVX by default allows unaligned access; however, this access may
come at a performance slowdown, so the old rule of designing your data
to be memory aligned is still good practice (16-byte aligned for
128-bit access and 32-byte aligned for 256-bit access). The main
exceptions are the VEX-extended versions of the SSE instructions that
explicitly required memory-aligned data: These instructions still
require aligned data
On these architectures, codes where vectorisation is useful (e.g. scientific computing applications with heavy use of floating point) may benefit from meeting the respective alignment prerequisites; the speedup would be proportional to the number of vector lanes in the FPU (4, 8, 16X). You can measure the benefits of vectorisation yourself by comparing software such as Eigen or PetSC or any other scientific software with / without vectorisation (-xHost for icc, -march=native for gcc), you should easily get a 2X speedup.

Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?

We've got a simple memory throughput benchmark. All it does is memcpy repeatedly for a large block of memory.
Looking at the results (compiled for 64-bit) on a few different machines, Skylake machines do significantly better than Broadwell-E, keeping OS (Win10-64), processor speed, and RAM speed (DDR4-2133) the same. We're not talking a few percentage points, but rather a factor of about 2. Skylake is configured dual-channel, and the results for Broadwell-E don't vary for dual/triple/quad-channel.
Any ideas why this might be happening? The code that follows is compiled in Release in VS2015, and reports average time to complete each memcpy at:
64-bit: 2.2ms for Skylake vs 4.5ms for Broadwell-E
32-bit: 2.2ms for Skylake vs 3.5ms for Broadwell-E.
We can get greater memory throughput on a quad-channel Broadwell-E build by utilizing multiple threads, and that's nice, but to see such a drastic difference for single-threaded memory access is frustrating. Any thoughts on why the difference is so pronounced?
We've also used various benchmarking software, and they validate what this simple example shows - single-threaded memory throughput is way better on Skylake.
#include <memory>
#include <Windows.h>
#include <iostream>
//Prevent the memcpy from being optimized out of the for loop
_declspec(noinline) void MemoryCopy(void *destinationMemoryBlock, void *sourceMemoryBlock, size_t size)
{
memcpy(destinationMemoryBlock, sourceMemoryBlock, size);
}
int main()
{
const int SIZE_OF_BLOCKS = 25000000;
const int NUMBER_ITERATIONS = 100;
void* sourceMemoryBlock = malloc(SIZE_OF_BLOCKS);
void* destinationMemoryBlock = malloc(SIZE_OF_BLOCKS);
LARGE_INTEGER Frequency;
QueryPerformanceFrequency(&Frequency);
while (true)
{
LONGLONG total = 0;
LONGLONG max = 0;
LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;
for (int i = 0; i < NUMBER_ITERATIONS; ++i)
{
QueryPerformanceCounter(&StartingTime);
MemoryCopy(destinationMemoryBlock, sourceMemoryBlock, SIZE_OF_BLOCKS);
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
total += ElapsedMicroseconds.QuadPart;
max = max(ElapsedMicroseconds.QuadPart, max);
}
std::cout << "Average is " << total*1.0 / NUMBER_ITERATIONS / 1000.0 << "ms" << std::endl;
std::cout << "Max is " << max / 1000.0 << "ms" << std::endl;
}
getchar();
}
Single-threaded memory bandwidth on modern CPUs is limited by max_concurrency / latency of the transfers from L1D to the rest of the system, not by DRAM-controller bottlenecks. Each core has 10 Line-Fill Buffers (LFBs) which track outstanding requests to/from L1D. (And 16 "superqueue" entries which track lines to/from L2).
(Update: experiments show that Skylake probably has 12 LFBs, up from 10 in Broadwell. e.g. Fig7 in the ZombieLoad paper, and other performance experiments including #BeeOnRope's testing of multiple store streams)
Intel's many-core chips have higher latency to L3 / memory than quad-core or dual-core desktop / laptop chips, so single-threaded memory bandwidth is actually much worse on a big Xeon, even though the max aggregate bandwidth with many threads is much better. They have many more hops on the ring bus that connects cores, memory controllers, and the System Agent (PCIe and so on).
SKX (Skylake-server / AVX512, including the i9 "high-end desktop" chips) is really bad for this: L3 / memory latency is significantly higher than for Broadwell-E / Broadwell-EP, so single-threaded bandwidth is even worse than on a Broadwell with a similar core count. (SKX uses a mesh instead of a ring bus because that scales better, see this for details on both. But apparently the constant factors are bad in the new design; maybe future generations will have better L3 bandwidth/latency for small / medium core counts. The private per-core L2 is bumped up to 1MiB though, so maybe L3 is intentionally slow to save power.)
(Skylake-client (SKL) like in the question, and later quad/hex-core desktop/laptop chips like Kaby Lake and Coffee Lake, still use the simpler ring-bus layout. Only the server chips changed. We don't yet know for sure what Ice Lake client will do.)
A quad or dual core chip only needs a couple threads (especially if the cores + uncore (L3) are clocked high) to saturate its memory bandwidth, and a Skylake with fast DDR4 dual channel has quite a lot of bandwidth.
For more about this, see the Latency-bound Platforms section of this answer about x86 memory bandwidth. (And read the other parts for memcpy/memset with SIMD loops vs. rep movs/rep stos, and NT stores vs. regular RFO stores, and more.)
Also related: What Every Programmer Should Know About Memory? (2017 update on what's still true and what's changed in that excellent article from 2007).
I finally got VTune (evalutation) up and running. It gives a DRAM bound score of .602 (between 0 and 1) on Broadwell-E and .324 on Skylake, with a huge part of the Broadwell-E delay coming from Memory Latency. Given that the memory sticks are the same speed (except dual-channel configured in Skylake and quad-channel in Broadwell-E), my best guess is that something about the memory controller in Skylake is just tremendously better.
It makes buying into the Broadwell-E architecture a much tougher call, and requires that you really need the extra cores to even consider it.
I also got L3/TLB miss counts. On Broadwell-E, TLB miss count was about 20% higher, and L3 miss count about 36% higher.
I don't think this is really an answer for "why" so I won't mark it as such, but is as close as I think I'll get to one for the time being. Thanks for all the helpful comments along the way.

Understanding Memory Replays and In-Flight Requests

I'm trying to understand how a matrix transpose can be faster reading naively from columns vs. rows. (example is from Professional CUDA C Programming) The matrix is in memory by row, i.e. (0,1),(0,2),(0,3)...(1,1),(1,2)
__global__ void transposeNaiveCol(float *out, float *in, const int nx, const int ny) {
unsigned int ix = blockDim.x * blockIdx.x + threadIdx.x;
unsigned int iy = blockDim.y * blockIdx.y + threadIdx.y;
if (ix < nx && iy < ny) {
out[iy*nx + ix] = in[ix*ny + iy]; //
// out[ix*ny + iy] = in[iy*nx + ix]; // for by row
}
}
This is what I don't understand: The load throughput for for transposeNaiveCol() is 642.33 GB/s and for tranposeNaiveRow() is 129.05 GB/s. The author says:
The results show that the highest load throughput is obtained with
cached, strided reads. In the case of cached reads, each memory
request is serviced with a 128-byte cache line. Reading data by
columns causes each memory request in a warp to replay 32 times
(because the stride is 2048 data elements), resulting in good latency
hiding from many in-flight global memory reads and then excellent L1
cache hit ratios once bytes are pre-fetched into L1 cache.
My question:
I thought that aligned/coalesced reads were ideal, but here it seems that strided reads improve performance.
Why is reading a cache line conducive to reduced performance in
this case?
Aren't replays in general a bad thing? It mentions here that it results in "good latency hiding".
Effective load throughput is not the only metric that determines the performance of your kernel! A kernel with perfectly coalesced loads will always have a lower effective load throughput than the equivalent, non coalesced kernel, but that alone says nothing about its execution time: in the end, the one metric that really matters is the wall clock time that your kernel takes to completion, of which the authors make no mention.
That being said, kernels usually fall into two categories:
Compute bound kernels, whose performance can be increased by trying to hide instruction latency: keep the pipeline full (maximize ILP).
I/O bound kernels, whose performance can be increased by trying to hide memory latency: keep data in flight (maximize bandwidth).
Matrix transpose being of very low compute intensity, it is therefore I/O bound, and as such to get better performance you should try to increase bandwidth usage.
Why is the column transpose better at maximizing bandwidth usage?
In the case of the row transpose, reads are coalesced: a single 128 bytes transaction is served per warp, that is 4 bytes per thread. Those 128 bytes are put in cache but are never reused, so the cache is effectively of no use in this case.
In the case of the column transpose, reads are not coalesced: each warp gets served 32 transactions of 128 bytes, all of which will get into L1 and will be reused for the next 31 replays (assuming they didn't get kicked out of cache). That is very low load efficiency for very high effective load throughput, and maximal cache usage.
You could of course get the same effect in the row transpose by simply requesting more data per thread (for example by loading 32 float, or 8 float4 per thread) or using CUDA's prefetch capabilities.

Resources