Inconsistent `perf annotate` memory load/store time reporting - performance

I'm having a hard time interpreting Intel performance events reporting.
Consider the following simple program that mainly reads/writes memory:
#include <stdint.h>
#include <stdio.h>
volatile uint32_t a;
volatile uint32_t b;
int main() {
printf("&a=%p\n&b=%p\n", &a, &b);
for(size_t i = 0; i < 1000000000LL; i++) {
a ^= (uint32_t) i;
b += (uint32_t) i;
b ^= a;
}
return 0;
}
I compile it with gcc -O2 and run under perf:
# gcc -g -O2 a.c
# perf stat -a ./a.out
&a=0x55a4bcf5f038
&b=0x55a4bcf5f034
Performance counter stats for 'system wide':
32,646.97 msec cpu-clock # 15.974 CPUs utilized
374 context-switches # 0.011 K/sec
1 cpu-migrations # 0.000 K/sec
1 page-faults # 0.000 K/sec
10,176,974,023 cycles # 0.312 GHz
13,010,322,410 instructions # 1.28 insn per cycle
1,002,214,919 branches # 30.699 M/sec
123,960 branch-misses # 0.01% of all branches
2.043727462 seconds time elapsed
# perf record -a ./a.out
&a=0x5589cc1fd038
&b=0x5589cc1fd034
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 0.997 MB perf.data (9269 samples) ]
# perf annotate
The result of perf annotate (annotated for memory loads/stores by me):
Percent│ for(size_t i = 0; i < 1000000000LL; i ++) {
│ xor %eax,%eax
│ nop
│ a ^= (uint32_t) i;
│28: mov a,%edx // 32-bit load
│ xor %eax,%edx
9.74 │ mov %edx,a // 32-bit store
│ b += (uint32_t) i;
12.12 │ mov b,%edx // 32-bit load
8.79 │ add %eax,%edx
│ for(size_t i = 0; i < 1000000000LL; i ++) {
│ add $0x1,%rax
│ b += (uint32_t) i;
18.69 │ mov %edx,b // 32-bit store
│ b ^= a;
0.04 │ mov a,%ecx // 32-bit load
22.39 │ mov b,%edx // 32-bit load
8.92 │ xor %ecx,%edx
19.31 │ mov %edx,b // 32-bit store
│ for(size_t i = 0; i < 1000000000LL; i ++) {
│ cmp $0x3b9aca00,%rax
│ ↑ jne 28
│ }
│ return 0;
│ }
│ xor %eax,%eax
│ add $0x8,%rsp
│ ← retq
My observations:
From the 1.28 insn per cycle I conclude that the program is mainly memory-bound.
a and b appear to be located in the same cache line, adjacent to each other.
My Question:
Shouldn't CPU time be more consistent for the various memory loads and stores?
Why is CPU time of the 1st memory load (mov a,%edx) zero?
Why is the time of the 3rd load mov a,%ecx 0.04%, while the one just next to it mov b,%edx 22.39%?
Why do some instructions take 0 time? The loop consists of 14 instructions, so each instruction must contribute some observable time.
Notes:
OS: Linux 4.19.0-amd64, CPU: Intel Core i9-9900K, 100% idle system (also tested on i7-7700, same result).

Not exactly "memory" bound, but bound on latency of store-forwarding. i9-9900K and i7-7700 have exactly the same microarchitecture for each core so that's not surprising :P https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake#Key_changes_from_Kaby_Lake. (Except possibly for improvement in hardware mitigation of Meltdown, and possibly fixing the loop buffer (LSD).)
Remember that when a perf event counter overflows and triggers a sample, the out-of-order superscalar CPU has to choose exactly one of the in-flight instructions to "blame" for this cycles event. Often this is the oldest un-retired instruction in the ROB, or the one after. Be very suspicious of cycles event samples over very small scales.
Perf never blames a load that was slow to produce a result, usually the instruction that was waiting for it. (In this case an xor or add). Here, sometimes the store consuming the result of that xor. These aren't cache-miss loads; store-forwarding latency is only about 3 to 5 cycles on Skylake (variable, and shorter if you don't try too soon: Loop with function call faster than an empty loop) so you do have loads completing at about 2 per 3 to 5 cycles.
You have two dependency chains through memory
The longest one involving two RMWs of b. This is twice as long and will be the overall bottleneck for the loop.
The other involving one RMW of a (with an extra read each iteration which can happen in parallel with the read that's part of the next a ^= i;).
The dep chain for i only involves registers and can run far ahead of the others; it's no surprise that add $0x1,%rax has no counts. Its execution cost is totally hidden in the shadow of waiting for loads.
I'm a bit surprised there are significant counts for mov %edx,a. Perhaps it sometimes has to wait for the older store uops involving b to run on the CPUs single store-data port. (Uops are dispatched to ports according to oldest-ready first. How are x86 uops scheduled, exactly?)
Uops can't retire until all previous uops have executed, so it could just be getting some skew from the store at the bottom of the loop. Uops retire in groups of 4, so if the mov %edx,b does retire, the already-executed cmp/jcc, the mov load of a, and the xor %eax,%edx can retire with it. Those are not part of the dep chain that waits for b, so they're always going to be sitting in the ROB waiting to retire whenever the b store is ready to retire. (This is guesswork about how mov %edx,a could be getting counts, despite not being part of a real bottleneck.)
The store-address uops should all run far ahead of the loop because they don't have to wait for previous iterations: RIP-relative addressing1 is ready right away. And they can run on port 7, or compete with loads for ports 2 or 3. Same for the loads: they can execute right away and detect what store they're waiting for, with the load buffer monitoring it and ready to report when the data becomes ready after the store-data uop does eventually run.
Presumably the front-end will eventually bottleneck on allocating load buffer entries, and that's what will limit how many uops can be in the back-end, not ROB or RS size.
Footnote 1: Your annotated output only shows a not a(%rip) so that's odd; doesn't matter if somehow you did get it to use 32-bit absolute, or if it's just a disassembly quirk failing to show RIP-relative.

Related

Why does the same OpenCL code have different outputs from Intel Xeon CPU and NVIDIA GTX 1080 Ti GPU?

I am trying to parallelize Monte Carlo simulation by using OpenCL. I use the MWC64X as a uniform random number generator. The code runs well on different Intel CPUs, since the output of parallel computation is very close to the sequential one.
Using OpenCL device: Intel(R) Xeon(R) CPU E5-2630L v3 # 1.80GHz
Literal influence running time: 0.029048 seconds r1 seqInfl= 0.4771
Literal influence running time: 0.029762 seconds r2 seqInfl= 0.4771
Literal influence running time: 0.029742 seconds r3 seqInfl= 0.4771
Literal influence running time: 0.02971 seconds ra seqInfl= 0.4771
Literal influence running time: 0.029225 seconds trust1-57 seqInfl= 0.6001
Literal influence running time: 0.04992 seconds trust110-1 seqInfl= 0
Literal influence running time: 0.034636 seconds trust4-57 seqInfl= 0
Literal influence running time: 0.049079 seconds trust57-110 seqInfl= 0
Literal influence running time: 0.024442 seconds trust57-4 seqInfl= 0.8026
Literal influence running time: 0.04946 seconds trust33-1 seqInfl= 0
Literal influence running time: 0.049071 seconds trust57-33 seqInfl= 0
Literal influence running time: 0.053117 seconds trust4-1 seqInfl= 0.1208
Literal influence running time: 0.051642 seconds trust57-1 seqInfl= 0
Literal influence running time: 0.052052 seconds trust57-64 seqInfl= 0
Literal influence running time: 0.052118 seconds trust64-1 seqInfl= 0
Literal influence running time: 0.051998 seconds trust57-7 seqInfl= 0
Literal influence running time: 0.052069 seconds trust7-1 seqInfl= 0
Total number of literals: 17
Sequential influence running time: 0.71728 seconds
Sequential maxInfluence Literal: trust57-4 0.8026
index1= 17 size= 51 dim1_size= 6
sum0:4781 influence0:0.478100 sum2:4781 influence2:0.478100 sum6:0 influence6:0.000000 sum10:0 sum12:0 influence12:0.000000 sum7:0 influence7:0.000000 influence10:0.000000 sum4:5962 influence4:0.596200 sum8:7971 influence8:0.797100 sum1:4781 influence1:0.478100 sum3:4781 influence3:0.478100 sum13:0 influence13:0.000000 sum11:1261 influence11:0.126100 sum9:0 influence9:0.000000 sum14:0 influence14:0.000000 sum5:0 influence5:0.000000 sum15:0 influence15:0.000000 sum16:0 influence16:0.000000
Parallel influence running time: 0.054391 seconds
Parallel maxInfluence Literal: trust57-4 Infl=0.7971
However, when I run the code on GeForce GTX 1080 Ti, with NVIDIA-SMI 430.40 and CUDA 10.1 and OpenCL 1.2 CUDA installed, the output is as below:
Using OpenCL device: GeForce GTX 1080 Ti
Influence:
Literal influence running time: 0.011119 seconds r1 seqInfl= 0.4771
Literal influence running time: 0.011238 seconds r2 seqInfl= 0.4771
Literal influence running time: 0.011408 seconds r3 seqInfl= 0.4771
Literal influence running time: 0.01109 seconds ra seqInfl= 0.4771
Literal influence running time: 0.011132 seconds trust1-57 seqInfl= 0.6001
Literal influence running time: 0.018978 seconds trust110-1 seqInfl= 0
Literal influence running time: 0.013093 seconds trust4-57 seqInfl= 0
Literal influence running time: 0.018968 seconds trust57-110 seqInfl= 0
Literal influence running time: 0.009105 seconds trust57-4 seqInfl= 0.8026
Literal influence running time: 0.018753 seconds trust33-1 seqInfl= 0
Literal influence running time: 0.018583 seconds trust57-33 seqInfl= 0
Literal influence running time: 0.02005 seconds trust4-1 seqInfl= 0.1208
Literal influence running time: 0.01957 seconds trust57-1 seqInfl= 0
Literal influence running time: 0.019686 seconds trust57-64 seqInfl= 0
Literal influence running time: 0.019632 seconds trust64-1 seqInfl= 0
Literal influence running time: 0.019687 seconds trust57-7 seqInfl= 0
Literal influence running time: 0.019859 seconds trust7-1 seqInfl= 0
Total number of literals: 17
Sequential influence running time: 0.272032 seconds
Sequential maxInfluence Literal: trust57-4 0.8026
index1= 17 size= 51 dim1_size= 6
sum0:10000 sum1:10000 sum2:10000 sum3:10000 sum4:10000 sum5:0 sum6:0 sum7:0 sum8:10000 sum9:0 sum10:0 sum11:0 sum12:0 sum13:0 sum14:0 sum15:0 sum16:0
Parallel influence running time: 0.193581 seconds
The "Influence" value equals sum*1.0/10000, thus the parallel influence only composes of 1 and 0, which is incorrect (in GPU runs) and doesn't happen when parallelizing on a Intel CPU.
When I check the output of the random number generator if(flag==0) printf("randint=%u",randint);, it seems the outputs are all zero on GPU. Below is the clinfo and the .cl code:
Device Name GeForce GTX 1080 Ti
Device Vendor NVIDIA Corporation
Device Vendor ID 0x10de
Device Version OpenCL 1.2 CUDA
Driver Version 430.40
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Topology (NV) PCI-E, 68:00.0
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 28
Max clock frequency 1721MHz
Compute Capability (NV) 6.1
Device Partition (core)
Max number of sub-devices 1
Supported partition types None
Max work item dimensions 3
Max work item sizes 1024x1024x64
Max work group size 1024
Preferred work group size multiple 32
Warp size (NV) 32
Preferred / native vector sizes
char 1 / 1
short 1 / 1
int 1 / 1
long 1 / 1
half 0 / 0 (n/a)
float 1 / 1
double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 11720130560 (10.92GiB)
Error Correction support No
Max memory allocation 2930032640 (2.729GiB)
Unified memory for Host and Device No
Integrated memory (NV) No
Minimum alignment for any data type 128 bytes
Alignment of base address 4096 bits (512 bytes)
Global Memory cache type Read/Write
Global Memory cache size 458752 (448KiB)
Global Memory cache line size 128 bytes
Image support Yes
Max number of samplers per kernel 32
Max size for 1D images from buffer 134217728 pixels
Max 1D or 2D image array size 2048 images
Max 2D image size 16384x32768 pixels
Max 3D image size 16384x16384x16384 pixels
Max number of read image args 256
Max number of write image args 16
Local memory type Local
Local memory size 49152 (48KiB)
Registers per block (NV) 65536
Max number of constant args 9
Max constant buffer size 65536 (64KiB)
Max size of kernel argument 4352 (4.25KiB)
Queue properties
Out-of-order execution Yes
Profiling Yes
Prefer user sync for interop No
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Kernel execution timeout (NV) Yes
Concurrent copy and kernel execution (NV) Yes
Number of async copy engines 2
printf() buffer size 1048576 (1024KiB)
#define N 70 // N > index, which is the total number of literals
#define BASE 4294967296UL
//! Represents the state of a particular generator
typedef struct{ uint x; uint c; } mwc64x_state_t;
enum{ MWC64X_A = 4294883355U };
enum{ MWC64X_M = 18446383549859758079UL };
void MWC64X_Step(mwc64x_state_t *s)
{
uint X=s->x, C=s->c;
uint Xn=MWC64X_A*X+C;
uint carry=(uint)(Xn<C); // The (Xn<C) will be zero or one for scalar
uint Cn=mad_hi(MWC64X_A,X,carry);
s->x=Xn;
s->c=Cn;
}
//! Return a 32-bit integer in the range [0..2^32)
uint MWC64X_NextUint(mwc64x_state_t *s)
{
uint res=s->x ^ s->c;
MWC64X_Step(s);
return res;
}
__kernel void setInfluence(const int literals, const int size, const int dim1_size, __global int* lambdas, __global float* lambdap, __global int* dim2_size, __global float* influence){
int flag=get_global_id(0);
int sum=0;
int count=10000;
int assignment[N];
//or try to get newlambda like original version does
if(flag < literals){
mwc64x_state_t rng;
for(int i=0; i<count; i++){
for(int j=0; j<size; j++){
uint randint=MWC64X_NextUint(&rng);
float rand=randint*1.0/BASE;
//if(flag==0)
// printf("randint=%u",randint);
if(lambdap[j]<rand)
assignment[lambdas[j]]=0;
else
assignment[lambdas[j]]=1;
}
//the true case
assignment[flag]=1;
int valuet=0;
int index=0;
for(int m=0; m<dim1_size; m++){
int valueMono=1;
for(int n=0; n<dim2_size[m]; n++){
if(assignment[lambdas[index+n]]==0){
valueMono=0;
index+=dim2_size[m];
break;
}
}
if(valueMono==1){
valuet=1;
break;
}
}
//the false case
assignment[flag]=0;
int valuef=0;
index=0;
for(int m=0; m<dim1_size; m++){
int valueMono=1;
for(int n=0; n<dim2_size[m]; n++){
if(assignment[lambdas[index+n]]==0){
valueMono=0;
index+=dim2_size[m];
break;
}
}
if(valueMono==1){
valuef=1;
break;
}
}
sum += valuet-valuef;
}
influence[flag] = 1.0*sum/count;
printf("sum%d:%d\t", flag, sum);
}
}
What might be the problem when running the code on GPU? Is it MWC64X? According to its author, it can perform well on NVIDIA GPUs. If so, how can I fix it; if not, what might be the problem?
(This started out as a comment, it turns out this was the source of the problem so I'm turning it into an answer.)
You're not initialising your mwc64x_state_t rng; variable before reading from it, so any results will be undefined:
mwc64x_state_t rng;
for(int i=0; i<count; i++){
for(int j=0; j<size; j++){
uint randint=MWC64X_NextUint(&rng);
Where MWC64X_NextUint() immediately reads from the rng state before updating it:
uint MWC64X_NextUint(mwc64x_state_t *s)
{
uint res=s->x ^ s->c;
Note that you will probably want to seed your RNG differently for each work-item, otherwise you will get nasty correlation artifacts in your results.
All use-cases of a pseudo-random number are a next-level challenge in true-[PARALLEL] computing platforms (not languages, platforms).
Either, there is some source-of-randomness, which gets us into a trouble once massively-parallel requests are to get fair-handled in a truly [PARALLEL] fashion (here, hardware resources may help, yet at a cost of not being able to reproduce the same behaviour "outside" of this very same platform ( and moment-in-time, if such a source is not software-operated with some seed-injection feature, that may setup the "just"-pseudo-random algorithm that creates a pure-[SERIAL] sequence-of-produced "just"-pseudo-random numbers ) )
Or,there is some "shared"-generator of pseudo-random numbers, which enjoys of a higher level of system-wide level-of-entropy (which is good for the resulting "quality" of pseudo-randomness) but at a cost of pure-serial dependence (no parallel execution possible,serial sequence gets served one after another in a sequential manner) and having close to zero chance for repeatable runs (a must for reproducible science) providing repeatably same sequences, needed for testing and for method-validation cases.
RESUME :
The code may employ a work-item-"private" pseudo-random generating function(s) ( privacy is a must for the sake of both the parallel code-execution and the mutual independence (non-intervening processes) of generating these pseudo-random numbers ) , yet each of instances must be a) independently initialised, so as to provide the expected level of randomness achievable in parallelised code-runs and b) any such initialisation ought be performed in a repeatably reproducible manner, for the sake of running the test on different times, often using different OpenCL target computing-platforms.
For __kernel-s, that do not rely on hardware-specific sources-of-randomness, meeting the conditions a && b will suffice for receiving repeatably reproducible (same) results for testing "in vitro" and thus providing a reasonably random method for generating results during generic production-level use-case code-runs "in vivo".
The comparison of net-run-times (benchmarked above) seems to show that Amdahl's law add-on overhead costs plus a tail-end effect of the atomicity-of-work have finally decided the net-run-time was ~ 3.6x faster on XEON compared to GPU:
index1 = 17
size = 51
dim1_size = 6
sum0: 4781 influence0: 0.478100
sum2: 4781 influence2: 0.478100
sum6: 0 influence6: 0.000000
sum10: 0 influence10: 0.000000
sum12: 0 influence12: 0.000000
sum7: 0 influence7: 0.000000
sum4: 5962 influence4: 0.596200
sum8: 7971 influence8: 0.797100
sum1: 4781 influence1: 0.478100
sum3: 4781 influence3: 0.478100
sum13: 0 influence13: 0.000000
sum11: 1261 influence11: 0.126100
sum9: 0 influence9: 0.000000
sum14: 0 influence14: 0.000000
sum5: 0 influence5: 0.000000
sum15: 0 influence15: 0.000000
sum16: 0 influence16: 0.000000
Parallel influence running time: 0.054391 seconds on XEON E5-2630L v3 # 1.80GHz using OpenCL
|....
index1 = 17 |....
size = 51 |....
dim1_size = 6 |....
sum0: 10000 |....
sum1: 10000 |....
sum2: 10000 |....
sum3: 10000 |....
sum4: 10000 |....
sum5: 0 |....
sum6: 0 |....
sum7: 0 |....
sum8: 10000 |....
sum9: 0 |....
sum10: 0 |....
sum11: 0 |....
sum12: 0 |....
sum13: 0 |....
sum14: 0 |....
sum15: 0 |....
sum16: 0 |....
Parallel influence running time: 0.193581 seconds on GeForce GTX 1080 Ti using OpenCL

Understanding branch prediction efficiency

I tried to measure branch prediction cost, I created a little program.
It creates a little buffer on stack, fills with random 0/1. I can set the size of the buffer with N. The code repeatedly causes branches for the same 1<<N random numbers.
Now, I've expected, that if 1<<N is sufficiently large (like >100), then the branch predictor will not be effective (as it has to predict >100 random numbers). However, these are the results (on a 5820k machine), as N grows, the program becomes slower:
N time
=========
8 2.2
9 2.2
10 2.2
11 2.2
12 2.3
13 4.6
14 9.5
15 11.6
16 12.7
20 12.9
For reference, if buffer is initialized with zeros (use the commented init), time is more-or-less constant, it varies between 1.5-1.7 for N 8..16.
My question is: can branch predictor effective for predicting such a large amount of random numbers? If not, then what's going on here?
(Some more explanation: the code executes 2^32 branches, no matter of N. So I expected, that the code runs the same speed, no matter of N, because the branch cannot be predicted at all. But it seems that if buffer size is less than 4096 (N<=12), something makes the code fast. Can branch prediction be effective for 4096 random numbers?)
Here's the code:
#include <cstdint>
#include <iostream>
volatile uint64_t init[2] = { 314159165, 27182818 };
// volatile uint64_t init[2] = { 0, 0 };
volatile uint64_t one = 1;
uint64_t next(uint64_t s[2]) {
uint64_t s1 = s[0];
uint64_t s0 = s[1];
uint64_t result = s0 + s1;
s[0] = s0;
s1 ^= s1 << 23;
s[1] = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5);
return result;
}
int main() {
uint64_t s[2];
s[0] = init[0];
s[1] = init[1];
uint64_t sum = 0;
#if 1
const int N = 16;
unsigned char buffer[1<<N];
for (int i=0; i<1<<N; i++) buffer[i] = next(s)&1;
for (uint64_t i=0; i<uint64_t(1)<<(32-N); i++) {
for (int j=0; j<1<<N; j++) {
if (buffer[j]) {
sum += one;
}
}
}
#else
for (uint64_t i=0; i<uint64_t(1)<<32; i++) {
if (next(s)&1) {
sum += one;
}
}
#endif
std::cout<<sum<<"\n";
}
(The code contains a non-buffered version as well, use #if 0. It runs around the same speed as the buffered version with N=16)
Here's the inner loop disassembly (compiled with clang. It generates the same code for all N between 8..16, only the loop count differs. Clang unrolled the loop twice):
401270: 80 3c 0c 00 cmp BYTE PTR [rsp+rcx*1],0x0
401274: 74 07 je 40127d <main+0xad>
401276: 48 03 35 e3 2d 00 00 add rsi,QWORD PTR [rip+0x2de3] # 404060 <one>
40127d: 80 7c 0c 01 00 cmp BYTE PTR [rsp+rcx*1+0x1],0x0
401282: 74 07 je 40128b <main+0xbb>
401284: 48 03 35 d5 2d 00 00 add rsi,QWORD PTR [rip+0x2dd5] # 404060 <one>
40128b: 48 83 c1 02 add rcx,0x2
40128f: 48 81 f9 00 00 01 00 cmp rcx,0x10000
401296: 75 d8 jne 401270 <main+0xa0>
Branch prediction can be such effective. As Peter Cordes suggests, I've checked branch-misses with perf stat. Here are the results:
N time cycles branch-misses (%) approx-time
===============================================================
8 2.2 9,084,889,375 34,806 ( 0.00) 2.2
9 2.2 9,212,112,830 39,725 ( 0.00) 2.2
10 2.2 9,264,903,090 2,394,253 ( 0.06) 2.2
11 2.2 9,415,103,000 8,102,360 ( 0.19) 2.2
12 2.3 9,876,827,586 27,169,271 ( 0.63) 2.3
13 4.6 19,572,398,825 486,814,972 (11.33) 4.6
14 9.5 39,813,380,461 1,473,662,853 (34.31) 9.5
15 11.6 49,079,798,916 1,915,930,302 (44.61) 11.7
16 12.7 53,216,900,532 2,113,177,105 (49.20) 12.7
20 12.9 54,317,444,104 2,149,928,923 (50.06) 12.9
Note: branch-misses (%) is calculated for 2^32 branches
As you can see, when N<=12, branch predictor can predict most of the branches (which is surprising: the branch predictor can memorize the outcome of 4096 consecutive random branches!). When N>12, branch-misses starts to grow. At N>=16, it can only predict ~50% correctly, which means it is as effective as random coin flips.
The time taken can be approximated by looking at the time and branch-misses (%) column: I've added the last column, approx-time. I've calculated it by this: 2.2+(12.9-2.2)*branch-misses %/100. As you can see, approx-time equals to time (not considering rounding error). So this effect can be explained perfectly by branch prediction.
The original intent was to calculate how many cycles a branch-miss costs (in this particular case - as for other cases this number can differ):
(54,317,444,104-9,084,889,375)/(2,149,928,923-34,806) = 21.039 = ~21 cycles.

What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?

Two different threads within a single process can share a common memory location by reading and/or writing to it.
Usually, such (intentional) sharing is implemented using atomic operations using the lock prefix on x86, which has fairly well-known costs both for the lock prefix itself (i.e., the uncontended cost) and also additional coherence costs when the cache line is actually shared (true or false sharing).
Here I'm interested in produced-consumer costs where a single thread P writes to a memory location, and another thread `C reads from the memory location, both using plain reads and writes.
What is the latency and throughput of such an operation when performed on separate cores on the same socket, and in comparison when performed on sibling hyperthreads on the same physical core, on recent x86 cores.
In the title I'm using the term "hyper-siblings" to refer to two threads running on the two logical threads of the same core, and inter-core siblings to refer to the more usual case of two threads running on different physical cores.
Okay, I couldn't find any authoritative source, so I figured I'd give it a go myself.
#include <pthread.h>
#include <sched.h>
#include <atomic>
#include <cstdint>
#include <iostream>
alignas(128) static uint64_t data[SIZE];
alignas(128) static std::atomic<unsigned> shared;
#ifdef EMPTY_PRODUCER
alignas(128) std::atomic<unsigned> unshared;
#endif
alignas(128) static std::atomic<bool> stop_producer;
alignas(128) static std::atomic<uint64_t> elapsed;
static inline uint64_t rdtsc()
{
unsigned int l, h;
__asm__ __volatile__ (
"rdtsc"
: "=a" (l), "=d" (h)
);
return ((uint64_t)h << 32) | l;
}
static void * consume(void *)
{
uint64_t value = 0;
uint64_t start = rdtsc();
for (unsigned n = 0; n < LOOPS; ++n) {
for (unsigned idx = 0; idx < SIZE; ++idx) {
value += data[idx] + shared.load(std::memory_order_relaxed);
}
}
elapsed = rdtsc() - start;
return reinterpret_cast<void*>(value);
}
static void * produce(void *)
{
do {
#ifdef EMPTY_PRODUCER
unshared.store(0, std::memory_order_relaxed);
#else
shared.store(0, std::memory_order_relaxed);
#enfid
} while (!stop_producer);
return nullptr;
}
int main()
{
pthread_t consumerId, producerId;
pthread_attr_t consumerAttrs, producerAttrs;
cpu_set_t cpuset;
for (unsigned idx = 0; idx < SIZE; ++idx) { data[idx] = 1; }
shared = 0;
stop_producer = false;
pthread_attr_init(&consumerAttrs);
CPU_ZERO(&cpuset);
CPU_SET(CONSUMER_CPU, &cpuset);
pthread_attr_setaffinity_np(&consumerAttrs, sizeof(cpuset), &cpuset);
pthread_attr_init(&producerAttrs);
CPU_ZERO(&cpuset);
CPU_SET(PRODUCER_CPU, &cpuset);
pthread_attr_setaffinity_np(&producerAttrs, sizeof(cpuset), &cpuset);
pthread_create(&consumerId, &consumerAttrs, consume, NULL);
pthread_create(&producerId, &producerAttrs, produce, NULL);
pthread_attr_destroy(&consumerAttrs);
pthread_attr_destroy(&producerAttrs);
pthread_join(consumerId, NULL);
stop_producer = true;
pthread_join(producerId, NULL);
std::cout <<"Elapsed cycles: " <<elapsed <<std::endl;
return 0;
}
Compile with the following command, replacing defines:
gcc -std=c++11 -DCONSUMER_CPU=3 -DPRODUCER_CPU=0 -DSIZE=131072 -DLOOPS=8000 timing.cxx -lstdc++ -lpthread -O2 -o timing
Where:
CONSUMER_CPU is the number of the cpu to run consumer thread on.
PRODUCER_CPU is the number of the cpu to run producer thread on.
SIZE is the size of the inner loop (matters for cache)
LOOPS is, well...
Here are the generated loops:
Consumer thread
400cc8: ba 80 24 60 00 mov $0x602480,%edx
400ccd: 0f 1f 00 nopl (%rax)
400cd0: 8b 05 2a 17 20 00 mov 0x20172a(%rip),%eax # 602400 <shared>
400cd6: 48 83 c2 08 add $0x8,%rdx
400cda: 48 03 42 f8 add -0x8(%rdx),%rax
400cde: 48 01 c1 add %rax,%rcx
400ce1: 48 81 fa 80 24 70 00 cmp $0x702480,%rdx
400ce8: 75 e6 jne 400cd0 <_ZL7consumePv+0x20>
400cea: 83 ee 01 sub $0x1,%esi
400ced: 75 d9 jne 400cc8 <_ZL7consumePv+0x18>
Producer thread, with empty loop (no writing to shared):
400c90: c7 05 e6 16 20 00 00 movl $0x0,0x2016e6(%rip) # 602380 <unshared>
400c97: 00 00 00
400c9a: 0f b6 05 5f 16 20 00 movzbl 0x20165f(%rip),%eax # 602300 <stop_producer>
400ca1: 84 c0 test %al,%al
400ca3: 74 eb je 400c90 <_ZL7producePv>
Producer thread, writing to shared:
400c90: c7 05 66 17 20 00 00 movl $0x0,0x201766(%rip) # 602400 <shared>
400c97: 00 00 00
400c9a: 0f b6 05 5f 16 20 00 movzbl 0x20165f(%rip),%eax # 602300 <stop_producer>
400ca1: 84 c0 test %al,%al
400ca3: 74 eb je 400c90 <_ZL7producePv>
The program counts the number of CPU cycles consumed, on consumer's core, to complete the whole loop. We compare the first producer, which does nothing but burn CPU cycles, to the second producer, which disrupts the consumer by repetitively writing to shared.
My system has a i5-4210U. That is, 2 cores, 2 threads per core. They are exposed by the kernel as Core#1 → cpu0, cpu2 Core#2 → cpu1, cpu3.
Result without starting the producer at all:
CONSUMER PRODUCER cycles for 1M cycles for 128k
3 n/a 2.11G 1.80G
Results with empty producer. For 1G operations (either 1000*1M or 8000*128k).
CONSUMER PRODUCER cycles for 1M cycles for 128k
3 3 3.20G 3.26G # mono
3 2 2.10G 1.80G # other core
3 1 4.18G 3.24G # same core, HT
As expected, since both threads are cpu hogs and both get a fair share, the producer burning cycles slows down consumer by about half. That's just cpu contention.
With producer on cpu#2, as there is no interaction, consumer runs with no impact from the producer running on another cpu.
With producer on cpu#1, we see hyperthreading at work.
Results with disruptive producer:
CONSUMER PRODUCER cycles for 1M cycles for 128k
3 3 4.26G 3.24G # mono
3 2 22.1 G 19.2 G # other core
3 1 36.9 G 37.1 G # same core, HT
When we schedule both thread on the same thread of the same core, there is no impact. Expected again, as the producer writes remain local, incurring no synchronization cost.
I cannot really explain why I get much worse performance for hyperthreading than for two cores. Advice welcome.
The killer problem is that the cores makes speculative reads, which means that each time a write to the the speculative read address (or more correctly to the same cache line) before it is "fulfilled" means the CPU must undo the read (at least if your an x86), which effectively means it cancels all speculative instructions from that instruction and later.
At some point before the read is retired it gets "fulfilled", ie. no instruction before can fail and there is no longer any reason to reissue, and the CPU can act as-if it had executed all instructions before.
Other core example
These are playing cache ping pong in addition to cancelling instructions so this should be worse than the HT version.
Lets start at some point in the process where the cache line with the shared data has just been marked shared because the Consumer has ask to read it.
The producer now wants to write to the shared data and sends out a request for exclusive ownership of the cache line.
The Consumer receives his cache line still in shared state and happily reads the value.
The consumer continues to read the shared value until the exclusive request arrives.
At which point the Consumer sends a shared request for the cache line.
At this point the Consumer clears its instructions from the first unfulfilled load instruction of the shared value.
While the Consumer waits for the data it runs ahead speculatively.
So the Consumer can advance in the period between it gets it shared cache line until its invalidated again. It is unclear how many reads can be fulfilled at the same time, most likely 2 as the CPU has 2 read ports. And it properbly doesn't need to rerun them as soon as the internal state of the CPU is satisfied they can't they can't fail between each.
Same core HT
Here the two HT shares the core and must share its resources.
The cache line should stay in the exclusive state all the time as they share the cache and therefore don't need the cache protocol.
Now why does it take so many cycles on the HT core? Lets start with the Consumer just having read the shared value.
Next cycle a write from the Produces occures.
The Consumer thread detects the write and cancels all its instructions from the first unfulfilled read.
The Consumer re-issues its instructions taking ~5-14 cycles to run again.
Finally the first instruction, which is a read, is issued and executed as it did not read a speculative value but a correct one as its in front of the queue.
So for every read of the shared value the Consumer is reset.
Conclusion
The different core apparently advance so much each time between each cache ping pong that it performs better than the HT one.
What would have happened if the CPU waited to see if the value had actually changed?
For the test code the HT version would have run much faster, maybe even as fast as the private write version. The different core would not have run faster as the cache miss was covering the reissue latency.
But if the data had been different the same problem would arise, except it would be worse for the different core version as it would then also have to wait for the cache line, and then reissue.
So if the OP can change some of roles letting the time stamp producer read from the shared and take the performance hit it would be better.
Read more here

F# Performance Impact of Checked Calcs?

Is there a performance impact from using the Checked module? I've tested it out with sequences of type int and see no noticeable difference. Sometimes the checked version is faster and sometimes unchecked is faster, but generally not by much.
Seq.initInfinite (fun x-> x) |> Seq.item 1000000000;;
Real: 00:00:05.272, CPU: 00:00:05.272, GC gen0: 0, gen1: 0, gen2: 0
val it : int = 1000000000
open Checked
Seq.initInfinite (fun x-> x) |> Seq.item 1000000000;;
Real: 00:00:04.785, CPU: 00:00:04.773, GC gen0: 0, gen1: 0, gen2: 0
val it : int = 1000000000
Basically I'm trying to figure out if there would be any downside to always opening Checked. (I encountered an overflow that wasn't immediately obvious, so I'm now playing the role of the jilted lover who doesn't want another broken heart.) The only non-contrived reason I can come up with for not always using Checked is if there were some performance hit, but I haven't seen one yet.
When you measure performance it's usually not a good idea to include Seq as Seq adds lots of overhead (at least compared to int operations) so you risk that most of the time is spent in Seq, not in the code you like to test.
I wrote a small test program for (+):
let clock =
let sw = System.Diagnostics.Stopwatch ()
sw.Start ()
fun () ->
sw.ElapsedMilliseconds
let dbreak () = System.Diagnostics.Debugger.Break ()
let time a =
let b = clock ()
let r = a ()
let n = clock ()
let d = n - b
d, r
module Unchecked =
let run c () =
let rec loop a i =
if i < c then
loop (a + 1) (i + 1)
else
a
loop 0 0
module Checked =
open Checked
let run c () =
let rec loop a i =
if i < c then
loop (a + 1) (i + 1)
else
a
loop 0 0
[<EntryPoint>]
let main argv =
let count = 1000000000
let testCases =
[|
"Unchecked" , Unchecked.run
"Checked" , Checked.run
|]
for nm, a in testCases do
printfn "Running %s ..." nm
let ms, r = time (a count)
printfn "... it took %d ms, result is %A" ms r
0
The performance results are this:
Running Unchecked ...
... it took 561 ms, result is 1000000000
Running Checked ...
... it took 1103 ms, result is 1000000000
So it seems some overhead is added by using Checked. The cost of int add should be less than the loop overhead so the overhead of Checked is higher than 2x maybe closer to 4x.
Out of curiousity we can check the IL Code using tools like ILSpy:
Unchecked:
IL_0000: nop
IL_0001: ldarg.2
IL_0002: ldarg.0
IL_0003: bge.s IL_0014
IL_0005: ldarg.0
IL_0006: ldarg.1
IL_0007: ldc.i4.1
IL_0008: add
IL_0009: ldarg.2
IL_000a: ldc.i4.1
IL_000b: add
IL_000c: starg.s i
IL_000e: starg.s a
IL_0010: starg.s c
IL_0012: br.s IL_0000
Checked:
IL_0000: nop
IL_0001: ldarg.2
IL_0002: ldarg.0
IL_0003: bge.s IL_0014
IL_0005: ldarg.0
IL_0006: ldarg.1
IL_0007: ldc.i4.1
IL_0008: add.ovf
IL_0009: ldarg.2
IL_000a: ldc.i4.1
IL_000b: add.ovf
IL_000c: starg.s i
IL_000e: starg.s a
IL_0010: starg.s c
IL_0012: br.s IL_0000
The only difference is that Unchecked uses add and Checked uses add.ovf. add.ovf is add with overflow check.
We can dig even deeper by looking at the jitted x86_64 code.
Unchecked:
; if i < c then
00007FF926A611B3 cmp esi,ebx
00007FF926A611B5 jge 00007FF926A611BD
; i + 1
00007FF926A611B7 inc esi
; a + 1
00007FF926A611B9 inc edi
; loop (a + 1) (i + 1)
00007FF926A611BB jmp 00007FF926A611B3
Checked:
; if i < c then
00007FF926A62613 cmp esi,ebx
00007FF926A62615 jge 00007FF926A62623
; a + 1
00007FF926A62617 add edi,1
; Overflow?
00007FF926A6261A jo 00007FF926A6262D
; i + 1
00007FF926A6261C add esi,1
; Overflow?
00007FF926A6261F jo 00007FF926A6262D
; loop (a + 1) (i + 1)
00007FF926A62621 jmp 00007FF926A62613
Now the reason for the Checked overhead is visible. After each operation the jitter inserts the conditional instruction jo which jumps to code that raises OverflowException if the overflow flag is set.
This chart shows us that the cost of an integer add is less than 1 clock cycle. The reason it's less than 1 clock cycle is that modern CPU can execute certain instructions in parallel.
The chart also shows us that branch that was correctly predicted by the CPU takes around 1-2 clock cycles.
So assuming a throughtput of at least 2 the cost of two integer additions in the Unchecked example should be 1 clock cycle.
In the Checked example we do add, jo, add, jo. Most likely CPU can't parallelize in this case and the cost of this should be around 4-6 clock cycles.
Another interesting difference is that the order of additions changed. With checked additions the order of the operations matter but with unchecked the jitter (and the CPU) has a greater flexibility moving the operations possibly improving performance.
So long story short; for cheap operations like (+) the overhead of Checked should be around 4x-6x compared to Unchecked.
This assumes no overflow exception. The cost of a .NET exception is probably around 100,000x times more expensive than an integer addition.

CUDA profiler reports inefficient global memory access

I have a simple CUDA kernel which I thought was accessing global memory efficiently. The Nvidia profiler however reports that I am performing inefficient global memory accesses. My kernel code is:
__global__ void update_particles_kernel
(
float4 *pos,
float4 *vel,
float4 *acc,
float dt,
int numParticles
)
{
int index = threadIdx.x + blockIdx.x * blockDim.x;
int offset = 0;
while(index + offset < numParticles)
{
vel[index + offset].x += dt*acc[index + offset].x; // line 247
vel[index + offset].y += dt*acc[index + offset].y;
vel[index + offset].z += dt*acc[index + offset].z;
pos[index + offset].x += dt*vel[index + offset].x; // line 251
pos[index + offset].y += dt*vel[index + offset].y;
pos[index + offset].z += dt*vel[index + offset].z;
offset += blockDim.x * gridDim.x;
}
In particular the profiler reports the following:
From the CUDA best practices guide it says:
"For devices of compute capability 2.x, the requirements can be summarized quite easily: the concurrent accesses of the threads of a warp will coalesce into a number of transactions equal to the number of cache lines necessary to service all of the threads of the warp. By default, all accesses are cached through L1, which as 128-byte lines. For scattered access patterns, to reduce overfetch, it can sometimes be useful to cache only in L2, which caches shorter 32-byte segments (see the CUDA C Programming Guide).
For devices of compute capability 3.x, accesses to global memory are cached only in L2; L1 is reserved for local memory accesses. Some devices of compute capability 3.5, 3.7, or 5.2 allow opt-in caching of globals in L1 as well."
Now in my kernel based on this information I would expect that 16 accesses would be required to service a 32 thread warp because float4 is 16 bytes and on my card (770m compute capability 3.0) reads from the L2 cache are performed in 32 bytes chunks (16 bytes * 32 threads / 32 bytes cache lines = 16 accesses). Indeed as you can see the profiler reports that I am doing 16 access. What I don't understand is why the profiler reports that the ideal access would involve 8 L2 transactions per access for line 247 and only 4 L2 transactions per access for the remaining lines. Can someone explain what I am missing here?
I have a simple CUDA kernel which I thought was accessing global memory efficiently. The Nvidia profiler however reports that I am performing inefficient global memory accesses.
To take one example, your float4 vel array is stored in memory like this:
0.x 0.y 0.z 0.w 1.x 1.y 1.z 1.w 2.x 2.y 2.z 2.w 3.x 3.y 3.z 3.w ...
^ ^ ^ ^ ...
thread0 thread1 thread2 thread3
So when you do this:
vel[index + offset].x += ...; // line 247
you are accessing (storing) at the locations (.x) that I have marked above. The gaps in between each ^ mark indicate an inefficient access pattern, which the profiler is pointing out. (It does not matter that in the very next line of code, you are storing to the .y locations.)
There are at least 2 solutions, one of which would be a classical AoS -> SoA reorganization of your data, with appropriate code adjustments. This is well documented (e.g. here on the cuda tag and elsewhere) in terms of what it means, and how to do it, so I will let you look that up.
The other typical solution is to load a float4 quantity per thread, when you need it, and store a float4 quantity per thread, when you need to. Your code can be trivially reworked to do this, which should give improved profiling results:
//preceding code need not change
while(index + offset < numParticles)
{
float4 my_vel = vel[index + offset];
float4 my_acc = acc[index + offset];
my_vel.x += dt*my_acc.x;
my_vel.y += dt*my_acc.y;
my_vel.z += dt*my_acc.z;
vel[index + offset] = my_vel;
float4 my_pos = pos[index + offset];
my_pos.x += dt*my_vel.x;
my_pos.y += dt*my_vel.y;
my_pos.z += dt*my_vel.z;
pos[index + offset] = my_pos;
offset += blockDim.x * gridDim.x;
}
Even though you might think that this code is "less efficient" than your code, because your code "appears" to be only loading and storing .x, .y, .z, whereas mine "appears" to also load and store .w, in fact there is essentially no difference, due to the way a GPU loads and stores to/from global memory. Although your code does not appear to touch .w, in the process of accessing the adjacent elements, the GPU will load the .w elements from global memory, and also (eventually) store the .w elements back to global memory.
What I don't understand is why the profiler reports that the ideal access would involve 8 L2 transactions per access for line 247
For line 247 in your original code, you are accessing one float quantity per thread for the load operation of acc.x, and one float quantity per thread for the load operation of vel.x. A float quantity per thread by itself should require 128 bytes for a warp, which is 4 32-byte L2 cachelines. Two loads together would require 8 L2 cacheline loads. This is the ideal case, which assumes that the quantities are packed together nicely (SoA). But that is not what you have (you have AoS).

Resources