reduction is very slow in OpenMP

reduction is very slow in OpenMP - openmp

I am doing some optimization on my code. First i proceeded to a parallel programming with OpenMP Then i used the optimization flags provided by GNU GCC compiler. Also i included an SSE instruction to compute inverse square root. But i realized finally that the problem is that the last operation, when each thread writes the result into the reduction variable, takes ~ 80% of time. Here the parallel loop :
time(&t5);
# pragma omp parallel for shared(NTOT) private(dx,dy,d,H,V,E,F,G,K) reduction(+:dU)
for(j = 1; j <= NTOT; j++){
if(!(j-i)) continue;
dx = (X[2*j-2]-X[2*i-2])*a;
dy = (X[2*j-1]-X[2*i-1])*a;
d = rsqrtSSE(dx*dx+dy*dy);
H = D*d*d*d;
V = dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1];
E = dS[0]*dx+dS[1]*dy;
F = spin[2*j-2]*dx+spin[2*j-1]*dy;
G = -3*d*d*E*F;
K = H*(V+G);
dU += K;
}
time(&t6);
t_loop = difftime(t6, t5);
where rsqrtSSE() is a function based on __mm_rsqrt_ps(__m128 X) predefined function in xmmintrin.h .
If there is a solution to overcome this problem? or this is due to bandwidth limitation?
i compile with gcc -o prog prog.c -lm -fopenmp -O3 - ffast-math -march=native
Here some infos about my computer :
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 69
Model name: Intel(R) Core(TM) i5-4200U CPU # 1.60GHz
Stepping: 1
CPU MHz: 849.382
CPU max MHz: 2600.0000
CPU min MHz: 800.0000
BogoMIPS: 4589.17
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3
and with turboboost :
CPU Avg_MHz %Busy Bzy_MHz TSC_MHz
- 2294 99.97 2300 2295
0 2295 100.00 2300 2295
1 2295 100.00 2300 2295
2 2292 99.87 2300 2295
3 2295 100.00 2300 2295

Your measurement is flawed.
The first approach, removing the line alltogether, allows the compiler to optimize away most of the computations, simply because the result is unused.
The second approach, if i understand it correctly, was to place timing instructins inside the loop itself, e.g. before/after dU += K. Unfortunately this, also has no hope of producing meaingful results. Any library timing call is orders of magintune slower than this operation. So you basically measure the time it takes to get the time. You can try that by repeating multiple timing calls and compare the difference.
From what I have seen, I suspect that the OpenMP implementations simply keep dU as thread-private variable and only after the loop is completed perform the reductions atomic/locked operations.
A better approach to determine the performance of individual lines is to use sampling. Take a look at the perf tool for linux, it can be very helpful. The results still may not always be perfectly reliable because there can be bias. Due to compiler optimizations (e.g. unrolling), and hardware optimizations (e.g. pipelining) multiple lines of code can be in the state of execution at the same time. It becomes a very difficult question to say how much time a line of code is taking.
One approach to see if you have a problem, is to try and figure out the theoretical hardware performance. For __mm_rsqrt_ps that seems difficult. You can also take a look at hardware performance counters to see if you have a lot of cache misses etc. perf can also help you with that.

Related

strange CPU binding/pining result within OpenMPI

I have tried to evaluate an OpenMPI program with Matrix Multiplication algorithm, the written code scales very well on a single thread per core machine in our Laboratory (close to ideal speedup within 48 and 64 cores), However, on some other machines which are hyperthreaded there is strange behavior, as you can see in the screenshot from htop I realized the CPU utilization when I run the same experiment with the same command is different and strange, I executed the program with
mpirun --bind-to hwthread--use-hwthread-cpus -n 2 ...
Here I bind the MPI workers to each hwthread, and can be seen with -n 2 which means I overwrite the variable in such a way to bind the execution on two processors (here hwthreads), however, seems it uses another hwthread with more or less 50% of utilization as well! I found this strange because there is not any extra CPU utilization on other machines, I tried this experiment many times and I'm sure this is not a temporary check or sth by OS and is due to the execution model of OpenMPI.
I appreciate it if someone could explain this behavior and extra CPU utilization when I execute this on the hyper-threaded machine.
The output of lscpu is as below:
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 1
Model name: AMD Ryzen Threadripper 1950X 16-Core Processor
Stepping: 1
Frequency boost: enabled
CPU MHz: 2200.000
CPU max MHz: 3400.0000
CPU min MHz: 2200.0000
BogoMIPS: 6786.36
Virtualization: AMD-V
L1d cache: 512 KiB
L1i cache: 1 MiB
L2 cache: 8 MiB
L3 cache: 32 MiB
The version of OpenMPI for all machines is the same 2.1.1.
Maybe Hyperthreading is not the case and I was misled by this, but the only big difference between these environments are 1) the Hyperthreading and 2) Clock Frequency of the processors which is based on different CPUs is different between 2200 MHz to 4.8 GHz.

GCC optimization flag -O2 makes code much slower that -O0 [duplicate]

I wanted to benchmark glibc's strlen function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.
Here's my code:
#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
int main() {
char *s = calloc(1 << 20, 1);
memset(s, 65, 1000000);
clock_t start = clock();
for (int i = 0; i < 128; ++i) {
s[strlen(s)] = 'A';
}
clock_t end = clock();
printf("%lld\n", (long long)(end - start));
return 0;
}
On my machine it outputs:
$ gcc test.c && ./a.out
13336
$ gcc -O1 test.c && ./a.out
199004
$ gcc -O2 test.c && ./a.out
83415
$ gcc -O3 test.c && ./a.out
83415
Somehow, enabling optimizations causes it to execute longer.

Testing your code on Godbolt's Compiler Explorer provides this explanation:
at -O0 or without optimisations, the generated code calls the C library function strlen;
at -O1 the generated code uses a simple inline expansion using a rep scasb instruction;
at -O2 and above, the generated code uses a more elaborate inline expansion.
Benchmarking your code repeatedly shows substantial variations from one run to another, but increasing the number of iterations shows that:
the -O1 code is much slower than the C library implementation: 32240 vs 3090
the -O2 code is faster than the -O1 but still substantially slower than the C library code: 8570 vs 3090.
This behavior is specific to gcc and the GNU libc. The same test on OS/X with clang and Apple's Libc does not show significant differences, which is not a surprise as Godbolt shows that clang generates a call to the C library strlen at all optimisation levels.
This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen has a more important impact than the lack of performance of the inline code for small strings. The strings in your benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.
I improved this benchmark and tested various string lengths. It appears from the benchmarks on linux with gcc (Debian 4.7.2-5) 4.7.2 running on an Intel(R) Core(TM) i3-2100 CPU # 3.10GHz that the inline code generated by -O1 is always slower, by as much as a factor of 10 for moderately long strings, while -O2 is only slightly faster than the libc strlen for very short strings and half as fast for longer strings. From this data, the GNU C library version of strlen is quite efficient for most string lengths, at least on my specific hardware. Also keeping in mind that cacheing has a major impact on benchmark measurements.
Here is the updated code:
#include <stdlib.h>
#include <string.h>
#include <time.h>
void benchmark(int repeat, int minlen, int maxlen) {
char *s = malloc(maxlen + 1);
memset(s, 'A', minlen);
long long bytes = 0, calls = 0;
clock_t clk = clock();
for (int n = 0; n < repeat; n++) {
for (int i = minlen; i < maxlen; ++i) {
bytes += i + 1;
calls += 1;
s[i] = '\0';
s[strlen(s)] = 'A';
}
}
clk = clock() - clk;
free(s);
double avglen = (minlen + maxlen - 1) / 2.0;
double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/call\n",
avglen, ns / bytes, ns / calls);
}
int main() {
benchmark(10000000, 0, 1);
benchmark(1000000, 0, 10);
benchmark(1000000, 5, 15);
benchmark(100000, 0, 100);
benchmark(100000, 50, 150);
benchmark(10000, 0, 1000);
benchmark(10000, 500, 1500);
benchmark(1000, 0, 10000);
benchmark(1000, 5000, 15000);
benchmark(100, 1000000 - 50, 1000000 + 50);
return 0;
}
Here is the output:
chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
average length 0 -> avg time: 14.000 ns/byte, 14.000 ns/call
average length 4 -> avg time: 2.364 ns/byte, 13.000 ns/call
average length 10 -> avg time: 1.238 ns/byte, 13.000 ns/call
average length 50 -> avg time: 0.317 ns/byte, 16.000 ns/call
average length 100 -> avg time: 0.169 ns/byte, 17.000 ns/call
average length 500 -> avg time: 0.074 ns/byte, 37.000 ns/call
average length 1000 -> avg time: 0.068 ns/byte, 68.000 ns/call
average length 5000 -> avg time: 0.064 ns/byte, 318.000 ns/call
average length 10000 -> avg time: 0.062 ns/byte, 622.000 ns/call
average length 1000000 -> avg time: 0.062 ns/byte, 62000.000 ns/call
chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
average length 0 -> avg time: 20.000 ns/byte, 20.000 ns/call
average length 4 -> avg time: 3.818 ns/byte, 21.000 ns/call
average length 10 -> avg time: 2.190 ns/byte, 23.000 ns/call
average length 50 -> avg time: 0.990 ns/byte, 50.000 ns/call
average length 100 -> avg time: 0.816 ns/byte, 82.000 ns/call
average length 500 -> avg time: 0.679 ns/byte, 340.000 ns/call
average length 1000 -> avg time: 0.664 ns/byte, 664.000 ns/call
average length 5000 -> avg time: 0.651 ns/byte, 3254.000 ns/call
average length 10000 -> avg time: 0.649 ns/byte, 6491.000 ns/call
average length 1000000 -> avg time: 0.648 ns/byte, 648000.000 ns/call
chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
average length 0 -> avg time: 10.000 ns/byte, 10.000 ns/call
average length 4 -> avg time: 2.000 ns/byte, 11.000 ns/call
average length 10 -> avg time: 1.048 ns/byte, 11.000 ns/call
average length 50 -> avg time: 0.337 ns/byte, 17.000 ns/call
average length 100 -> avg time: 0.299 ns/byte, 30.000 ns/call
average length 500 -> avg time: 0.202 ns/byte, 101.000 ns/call
average length 1000 -> avg time: 0.188 ns/byte, 188.000 ns/call
average length 5000 -> avg time: 0.174 ns/byte, 868.000 ns/call
average length 10000 -> avg time: 0.172 ns/byte, 1716.000 ns/call
average length 1000000 -> avg time: 0.172 ns/byte, 172000.000 ns/call

GCC's inline strlen patterns are much slower than what it could do with SSE2 pcmpeqb / pmovmskb, and bsf, given the 16-byte alignment from calloc. This "optimization" is actually a pessimization.
My simple hand-written loop that takes advantage of 16-byte alignment is 5x faster than what gcc -O3 inlines for large buffers, and ~2x faster for short strings. (And faster than calling strlen for short strings). I've added a comment to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809 to propose this for what gcc should inline at -O2 / -O3 when it's able. (With a suggestion for ramping up to 16-byte if we only know 4-byte alignment to start with.)
When gcc knows it has 4-byte alignment for the buffer (guaranteed by calloc), it chooses to inline strlen as a 4-byte-at-a-time scalar bithack using GP integer registers (-O2 and higher).
(Reading 4 bytes at a time is only safe if we know we can't cross into a page that doesn't contain any string bytes, and thus might be unmapped. Is it safe to read past the end of a buffer within the same page on x86 and x64? (TL:DR yes, in asm it is, so compilers can emit code that does that even if doing so in the C source is UB. libc strlen implementations also take advantage of that. See my answer there for links to glibc strlen and a summary of how it runs so fast for large strings.)
At -O1, gcc always (even without known alignment) chooses to inline strlen as repnz scasb, which is very slow (about 1 byte per clock cycle on modern Intel CPUs). "Fast strings" only applies to rep stos and rep movs, not the repz/repnz instructions, unfortunately. Their microcode is just simple 1 byte at a time, but they still have some startup overhead. (https://agner.org/optimize/)
(We can test this by "hiding" the pointer from the compiler by storing / reloading s to a volatile void *tmp, for example. gcc has to make zero assumptions about the pointer value that's read back from a volatile, destroying any alignment info.)
GCC does have some x86 tuning options like -mstringop-strategy=libcall vs. unrolled_loop vs. rep_byte for inlining string operations in general (not just strlen; memcmp would be another major one that can be done with rep or a loop). I haven't checked what effect these have here.
The docs for another option also describe the current behaviour. We could get this inlining (with extra code for alignment-handling) even in cases where we wanted it on unaligned pointers. (This used to be an actual perf win, especially for small strings, on targets where the inline loop wasn't garbage compared to what the machine can do.)
-minline-all-stringops
By default GCC inlines string operations only when the destination is known to be aligned to least a 4-byte boundary. This enables more inlining and increases code size, but may improve performance of code that depends on fast memcpy, strlen, and memset for short lengths.
GCC also has per-function attributes you can apparently use to control this, like __attribute__((no-inline-all-stringops)) void foo() { ... }, but I haven't played around with it. (That's the opposite of inline-all. It doesn't mean inline none, it just goes back to only inlining when 4-byte alignment is known.)
Both of gcc's inline strlen strategies fail to take advantage of 16-byte alignment, and are pretty bad for x86-64
Unless the small-string case is very common, doing one 4-byte chunk, then aligned 8-byte chunks would go about twice as fast as 4-byte.
And the 4-byte strategy has much slower cleanup than necessary for finding the byte within the dword containing the zero byte. It detects this by looking for a byte with its high bit set, so it should just mask off the other bits and use bsf (bit-scan forward). That has 3 cycle latency on modern CPUs (Intel and Ryzen). Or compilers can use rep bsf so it runs as tzcnt on CPUs that support BMI1, which is more efficient on AMD. bsf and tzcnt give the same result for non-zero inputs.
GCC's 4-byte loop looks like it's compiled from pure C, or some target-independent logic, not taking advantage of bitscan. gcc does use andn to optimize it when compiling for x86 with BMI1, but it's still less than 4 bytes per cycle.
SSE2 pcmpeqb + bsf is much much better for both short and long inputs. x86-64 guarantees that SSE2 is available, and the x86-64 System V has alignof(maxalign_t) = 16 so calloc will always return pointers that are at least 16-byte aligned.
I wrote a replacement for the strlen block to test performance
As expected it's about 4x faster on Skylake going 16 bytes at a time instead of 4.
(I compiled the original source to asm with -O3, then edited the asm to see what performance should have been with this strategy for inline expansion of strlen. I also ported it to inline asm inside the C source; see that version on Godbolt.)
# at this point gcc has `s` in RDX, `i` in ECX
pxor %xmm0, %xmm0 # zeroed vector to compare against
.p2align 4
.Lstrlen16: # do {
#ifdef __AVX__
vpcmpeqb (%rdx), %xmm0, %xmm1
#else
movdqa (%rdx), %xmm1
pcmpeqb %xmm0, %xmm1 # xmm1 = -1 where there was a 0 in memory
#endif
add $16, %rdx # ptr++
pmovmskb %xmm1, %eax # extract high bit of each byte to a 16-bit mask
test %eax, %eax
jz .Lstrlen16 # }while(mask==0);
# RDX points at the 16-byte chunk *after* the one containing the terminator
# EAX = bit-mask of the 0 bytes, and is known to be non-zero
bsf %eax, %eax # EAX = bit-index of the lowest set bit
movb $'A', -16(%rdx, %rax)
Note that I optimized part of the strlen cleanup into the store addressing mode: I correct for the overshoot with the -16 displacement, and that this is just finding the end of the string, not actually calculating the length and then indexing like GCC was already doing after inlining its 4-byte-at-a-time loop.
To get actual string length (instead of pointer to the end), you'd subtract rdx-start and then add rax-16 (maybe with an LEA to add 2 registers + a constant, but 3-component LEA has more latency.)
With AVX to allow load+compare in one instruction without destroying the zeroed register, the whole loop is only 4 uops, down from 5. (test/jz macro-fuses into one uop on both Intel and AMD. vpcmpeqb with a non-indexed memory-source can keep it micro-fused through the whole pipeline, so it's only 1 fused-domain uop for the front-end.)
(Note that mixing 128-bit AVX with SSE does not cause stalls even on Haswell, as long as you're in clean-upper state to start with. So I didn't bother about changing the other instructions to AVX, only the one that mattered. There seemed to be some minor effect where pxor was actually slightly better than vpxor on my desktop, though, for an AVX loop body. It seemed somewhat repeatable, but it's weird because there's no code-size difference and thus no alignment difference.)
pmovmskb is a single-uop instruction. It has 3-cycle latency on Intel and Ryzen (worse on Bulldozer-family). For short strings, the trip through the SIMD unit and back to integer is an important part of the critical path dependency chain for latency from input memory bytes to store-address being ready. But only SIMD has packed-integer compares, so scalar would have to do more work.
For the very-small string case (like 0 to 3 bytes), it might be possible to achieve slightly lower latency for that case by using pure scalar (especially on Bulldozer-family), but having all strings from 0 to 15 bytes take the same branch path (loop branch never taken) is very nice for most short-strings use-cases.
Being very good for all strings up to 15 bytes seems like a good choice, when we know we have 16-byte alignment. More predictable branching is very good. (And note that when looping, pmovmskb latency only affects how quickly we can detect branch mispredicts to break out of the loop; branch prediction + speculative execution hides the latency of the independent pmovmskb in each iteration.
If we expected longer strings to be common, we could unroll a bit, but at that point you should just call the libc function so it can dispatch to AVX2 if available at runtime. Unrolling to more than 1 vector complicates the cleanup, hurting the simple cases.
On my machine i7-6700k Skylake at 4.2GHz max turbo (and energy_performance_preference = performance), with gcc8.2 on Arch Linux, I get somewhat consistent benchmark timing because my CPU clock speed ramps up during the memset. But maybe not always to max turbo; Skylake's hw power management downclocks when memory-bound. perf stat showed I typically got right around 4.0GHz when running this to average the stdout output and see perf summary on stderr.
perf stat -r 100 ./a.out | awk '{sum+= $1} END{print sum/100;}'
I ended up copying my asm into a GNU C inline-asm statement, so I could put the code on the Godbolt compiler explorer.
For large strings, same length as in the question: times on ~4GHz Skylake
~62100 clock_t time units: -O1 rep scas: (clock() is a bit obsolete, but I didn't bother changing it.)
~15900 clock_t time units: -O3 gcc 4-byte loop strategy: avg of 100 runs = . (Or maybe ~15800 with -march=native for andn)
~1880 clock_t time units: -O3 with glibc strlen function calls, using AVX2
~3190 clock_t time units: (AVX1 128-bit vectors, 4 uop loop) hand-written inline asm that gcc could/should inline.
~3230 clock_t time units: (SSE2 5 uop loop) hand-written inline asm that gcc could/should inline.
My hand-written asm should be very good for short strings, too, because it doesn't need to branch specially. Known alignment is very good for strlen, and libc can't take advantage of it.
If we expect large strings to be rare, 1.7x slower than libc for that case. The length of 1M bytes means it won't be staying hot in L2 (256k) or L1d cache (32k) on my CPU, so even bottlenecked on L3 cache the libc version was faster. (Probably an unrolled loop and 256-bit vectors doesn't clog up the ROB with as many uops per byte, so OoO exec can see farther ahead and get more memory parallelism, especially at page boundaries.)
But L3 cache bandwidth is probably a bottleneck stopping the 4-uop version from running at 1 iteration per clock, so we're seeing less benefit from AVX saving us a uop in the loop. With data hot in L1d cache, we should get 1.25 cycles per iteration vs. 1.
But a good AVX2 implementation can read up to 64 bytes per cycle (2x 32 byte loads) using vpminub to combine pairs before checking for zeros and going back to find where they were. The gap between this and libc opens wider for sizes of ~2k to ~30 kiB or so that stay hot in L1d.
Some read-only testing with length=1000 indicates that glibc strlen really is about 4x faster than my loop for medium size strings hot in L1d cache. That's large enough for AVX2 to ramp up to the big unrolled loop, but still easily fits in L1d cache. (Read-only avoid store-forwarding stalls, and so we can do many iterations)
If your strings are that big, you should be using explicit-length strings instead of needing to strlen at all, so inlining a simple loop still seems like a reasonable strategy, as long as it's actually good for short strings and not total garbage for medium (like 300 bytes) and very long (> cache size) strings.
Benchmarking small strings with this:
I ran into some oddities in trying to get the results I expected:
I tried s[31] = 0 to truncate the string before every iteration (allowing short constant length). But then my SSE2 version was almost the same speed as GCC's version. Store-forwarding stalls were the bottleneck! A byte store followed by a wider load makes store-forwarding take the slow path that merges bytes from the store buffer with bytes from L1d cache. This extra latency is part of a loop-carried dep chain through the last 4-byte or 16-byte chunk of the string, to calculate the store index for the next iteration.
GCC's slower 4-byte-at-a-time code could keep up by processing the earlier 4-byte chunks in the shadow of that latency. (Out-of-order execution is pretty fantastic: slow code can sometimes not affect the overall speed of your program).
I eventually solved it by making a read-only version, and using inline asm to stop the compiler from hoisting strlen out of the loop.
But store-forwarding is a potential issue with using 16-byte loads. If other C variables are stored past the end of the array, we might hit a SF stall due to loading off the end of the array farther than with narrower stores. For recently-copied data, we're fine if it was copied with 16-byte or wider aligned stores, but glibc memcpy for small copies does 2x overlapping loads that cover the whole object, from the start and end of the object. Then it stores both, again overlapping, handling the memmove src overlaps dst case for free. So the 2nd 16-byte or 8-byte chunk of a short string that was just memcpyied might give us a SF stall for reading the last chunk. (The one that has the data dependency for the output.)
Just running slower so you don't get to the end before it's ready isn't good in general, so there's no great solution here. I think most of the time you're not going to strlen a buffer you just wrote, usually you're going to strlen an input that you're only reading so store-forwarding stalls aren't a problem. If something else just wrote it, then efficient code hopefully wouldn't have thrown away the length and called a function that required recalculating it.
Other weirdness I haven't totally figured out:
Code alignment is making a factor of 2 difference for read-only, size=1000 (s[1000] = 0;). But the inner-most asm loop itself is aligned with .p2align 4 or .p2align 5. Increasing the loop alignment can slow it down by a factor of 2!
# slow version, with *no* extra HIDE_ALIGNMENT function call before the loop.
# using my hand-written asm, AVX version.
i<1280000 read-only at strlen(s)=1000 so strlen time dominates the total runtime (not startup overhead)
.p2align 5 in the asm inner loop. (32-byte code alignment with NOP padding)
gcc -DUSE_ASM -DREAD_ONLY -DHIDE_ALIGNMENT -march=native -O3 -g strlen-microbench.c &&
time taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread -r 100 ./a.out |
awk '{sum+= $1} END{print sum/100;}'
Performance counter stats for './a.out' (100 runs):
40.92 msec task-clock # 0.996 CPUs utilized ( +- 0.20% )
2 context-switches # 0.052 K/sec ( +- 3.31% )
0 cpu-migrations # 0.000 K/sec
313 page-faults # 0.008 M/sec ( +- 0.05% )
168,103,223 cycles # 4.108 GHz ( +- 0.20% )
82,293,840 branches # 2011.269 M/sec ( +- 0.00% )
1,845,647 branch-misses # 2.24% of all branches ( +- 0.74% )
412,769,788 instructions # 2.46 insn per cycle ( +- 0.00% )
466,515,986 uops_issued.any # 11401.694 M/sec ( +- 0.22% )
487,011,558 uops_executed.thread # 11902.607 M/sec ( +- 0.13% )
0.0410624 +- 0.0000837 seconds time elapsed ( +- 0.20% )
40326.5 (clock_t)
real 0m4.301s
user 0m4.050s
sys 0m0.224s
Note branch misses definitely non-zero, vs. almost exactly zero for the fast version. And uops issued is much higher than the fast version: it may be speculating down the wrong path for a long time on each of those branch misses.
Probably the inner and outer loop-branches are aliasing each other, or not.
Instruction count is nearly identical, just different by some NOPs in the outer loop ahead of the inner loop. But IPC is vastly different: without problems, the fast version runs an average of 4.82 instructions per clock for the whole program. (Most of that is in the inner-most loop running 5 instructions per cycle, thanks to a test/jz that macro-fuses 2 instructions into 1 uop.) And note that uops_executed is much higher than uops_issued: that means micro-fusion is working well to get more uops through the front-end bottleneck.
fast version, same read-only strlen(s)=1000 repeated 1280000 times
gcc -DUSE_ASM -DREAD_ONLY -UHIDE_ALIGNMENT -march=native -O3 -g strlen-microbench.c &&
time taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread -r 100 ./a.out |
awk '{sum+= $1} END{print sum/100;}'
Performance counter stats for './a.out' (100 runs):
21.06 msec task-clock # 0.994 CPUs utilized ( +- 0.10% )
1 context-switches # 0.056 K/sec ( +- 5.30% )
0 cpu-migrations # 0.000 K/sec
313 page-faults # 0.015 M/sec ( +- 0.04% )
86,239,943 cycles # 4.094 GHz ( +- 0.02% )
82,285,261 branches # 3906.682 M/sec ( +- 0.00% )
17,645 branch-misses # 0.02% of all branches ( +- 0.15% )
415,286,425 instructions # 4.82 insn per cycle ( +- 0.00% )
335,057,379 uops_issued.any # 15907.619 M/sec ( +- 0.00% )
409,255,762 uops_executed.thread # 19430.358 M/sec ( +- 0.00% )
0.0211944 +- 0.0000221 seconds time elapsed ( +- 0.10% )
20504 (clock_t)
real 0m2.309s
user 0m2.085s
sys 0m0.203s
I think it's just the branch prediction, not other front-end stuff that's a problem. The test/branch instructions aren't getting split across a boundary that would prevent macro-fusion.
Changing .p2align 5 to .p2align 4 reverses them: -UHIDE_ALIGNMENT becomes slow.
This Godbolt binary link reproduces the same padding I'm seeing with gcc8.2.1 on Arch Linux for both cases: 2x 11-byte nopw + a 3-byte nop inside the outer loop for the fast case. It also has the exact source I was using locally.
short strlen read-only micro-benchmarks:
Tested with stuff chosen so it doesn't suffer from branch mispredicts or store-forwarding, and can test the same short length repeatedly for enough iterations to get meaningful data.
strlen=33, so the terminator is near the start of the 3rd 16-byte vector. (Makes my version look as bad as possible vs. the 4-byte version.) -DREAD_ONLY, and i<1280000 as an outer-loop repeat loop.
1933 clock_t: my asm: nice and consistent best-case time (not noisy / bouncing around when re-running the average.) Equal perf with/without -DHIDE_ALIGNMENT, unlike for the longer strlen. The loop branch is much more easily predictable with that much shorter pattern. (strlen=33, not 1000).
3220 clock_t: gcc -O3 call glibc strlen. (-DHIDE_ALIGNMENT)
6100 clock_t: gcc -O3 4-byte loop
37200 clock_t: gcc -O1 repz scasb
So for short strings, my simple inline loop beats a library function call to strlen that has to go through the PLT (call + jmp [mem]), then run strlen's startup overhead that can't depend on alignment.
There were negligible branch-mispredicts, like 0.05% for all the versions with strlen(s)=33. The repz scasb version had 0.46%, but that's out of fewer total branches. No inner loop to rack up many correctly predicted branches.
With branch predictors and code-cache hot, repz scasb is more than 10x worse than calling glibc strlen for a 33-byte string. It would be less bad in real use cases where strlen could branch miss or even miss in code-cache and stall, but straight-line repz scasb wouldn't. But 10x is huge, and that's for a fairly short string.

How to calculate speedup and efficiency in a hybrid CPU and GPU algorithm?

I have an algorithm that I have executed in parallel using only CPU and I have achieved a speedup of 30x. That is, an efficiency equal to 0.93 (efficiency = speedup/cores, i.e. 0.93 = 30/32).
Later I added 2 GPUs (Tesla C2075 of 448 cores each) together to the 32 CPU cores.
To calculate the efficiency including CPUs and GPUs, should I add the amount of GPU cores to the CPU cores? That is, I would calculate the efficiency using 928 cores (32 + 448 + 448 = 928). Or should it be calculated differently?
Speedup and efficiency has been calculated based on what has been said here:
https://software.intel.com/en-us/articles/predicting-and-measuring-parallel-performance

GPUs have bigger "core complex" architectures called "SM" or "CU" with tens of pipelines each. Not "very" similar to "SIMD" of a CPU, they can issue commands in parallel to these pipelines in a "single-threaded" kernel code.
You have counted "cores" in CPU and not SIMD pipelines (which is 4 to 16 times of number of cores) so, it wouldn't be wrong to count SM units of Nvidia or CU of Amd or Slice subset of Intel etc.
Tesla C2075 has 14 SM units so you could add 14 for each GPU (32+14+14).
If you have also used SIMDified code for CPU, then it wouldn't be wrong to count each pipeline of a GPU which is 32 to 192 times the number of SM/CU(like 448 per GPU of yours) (32*SIMD_WIDTH + 448 + 448).
At least this is how I would compute "core efficiency" and "pipeline efficiency". If data transfer to/from GPU is not a bottleneck, efficiency should not drop much after GPUs are added.

When should we use prefetch?

Some CPU and compilers supply prefetch instructions. Eg: __builtin_prefetch in GCC Document. Although there is a comment in GCC's document, but it's too short to me.
I want to know, in practice, when should we use prefetch? Are there some examples?

This question isn't really about compilers as they're just providing some hook to insert prefetch instructions into your assembly code / binary. Different compilers may provide different intrinsic formats but you can just ignore all these and (carefully) add it directly in assembly code.
Now the real question seems to be "when are prefetches useful", and the answer is - in any scenario where youre bounded on memory latency, and the access pattern isn't regular and distinguishable for the HW prefetch to capture (organized in a stream or strides), or when you suspect there are too many different streams for the HW to track simultaneously.
Most compilers would only very seldom insert their own prefetches for you, so it's basically up to you to play with your code and benchmark how prefetches could be useful.
The link by #Mysticial shows a nice example, but here's a more straight forward one that I think can't be caught by HW:
#include "stdio.h"
#include "sys/timeb.h"
#include "emmintrin.h"
#define N 4096
#define REP 200
#define ELEM int
int main() {
int i,j, k, b;
const int blksize = 64 / sizeof(ELEM);
ELEM __attribute ((aligned(4096))) a[N][N];
for (i = 0; i < N; ++i) {
for (j = 0; j < N; ++j) {
a[i][j] = 1;
}
}
unsigned long long int sum = 0;
struct timeb start, end;
unsigned long long delta;
ftime(&start);
for (k = 0; k < REP; ++k) {
for (i = 0; i < N; ++i) {
for (j = 0; j < N; j ++) {
sum += a[i][j];
}
}
}
ftime(&end);
delta = (end.time * 1000 + end.millitm) - (start.time * 1000 + start.millitm);
printf ("Prefetching off: N=%d, sum=%lld, time=%lld\n", N, sum, delta);
ftime(&start);
sum = 0;
for (k = 0; k < REP; ++k) {
for (i = 0; i < N; ++i) {
for (j = 0; j < N; j += blksize) {
for (b = 0; b < blksize; ++b) {
sum += a[i][j+b];
}
_mm_prefetch(&a[i+1][j], _MM_HINT_T2);
}
}
}
ftime(&end);
delta = (end.time * 1000 + end.millitm) - (start.time * 1000 + start.millitm);
printf ("Prefetching on: N=%d, sum=%lld, time=%lld\n", N, sum, delta);
}
What I do here is traverse each matrix line (enjoying the HW prefetcher help with the consecutive lines), but prefetch ahead the element with the same column index from the next line that resides in a different page (which the HW prefetch should be hard pressed to catch). I sum the data just so that it's not optimized away, the important thing is that I basically just loop over a matrix, should have been pretty straightforward and simple to detect, and yet still get a speedup.
Built with gcc 4.8.1 -O3, it gives me an almost 20% boost on an Intel Xeon X5670:
Prefetching off: N=4096, sum=3355443200, time=1839
Prefetching on: N=4096, sum=3355443200, time=1502
Note that the speedup is received even though I made the control flow more complicated (extra loop nesting level), the branch predictor should easily catch the pattern of that short block-size loop, and it saves execution of unneeded prefetches.
Note that Ivybridge and onward on should have a "next-page prefetcher", so the HW may be able to mitigate that on these CPUs (if anyone has one available and cares to try i'll be happy to know). In that case i'd modify the benchmark to sum every second line (and the prefetch would look ahead two lines everytime), that should confuse the hell out of the HW prefetchers.
Skylake results
Here are some results from a Skylake i7-6700-HQ, running at 2.6 GHz (no turbo) with gcc:
Compile flags: -O3 -march=native
Prefetching off: N=4096, sum=28147495993344000, time=896
Prefetching on: N=4096, sum=28147495993344000, time=1222
Prefetching off: N=4096, sum=28147495993344000, time=886
Prefetching on: N=4096, sum=28147495993344000, time=1291
Prefetching off: N=4096, sum=28147495993344000, time=890
Prefetching on: N=4096, sum=28147495993344000, time=1234
Prefetching off: N=4096, sum=28147495993344000, time=848
Prefetching on: N=4096, sum=28147495993344000, time=1220
Prefetching off: N=4096, sum=28147495993344000, time=852
Prefetching on: N=4096, sum=28147495993344000, time=1253
Compile flags: -O2 -march=native
Prefetching off: N=4096, sum=28147495993344000, time=1955
Prefetching on: N=4096, sum=28147495993344000, time=1813
Prefetching off: N=4096, sum=28147495993344000, time=1956
Prefetching on: N=4096, sum=28147495993344000, time=1814
Prefetching off: N=4096, sum=28147495993344000, time=1955
Prefetching on: N=4096, sum=28147495993344000, time=1811
Prefetching off: N=4096, sum=28147495993344000, time=1961
Prefetching on: N=4096, sum=28147495993344000, time=1811
Prefetching off: N=4096, sum=28147495993344000, time=1965
Prefetching on: N=4096, sum=28147495993344000, time=1814
So using prefetch is either about 40% slower, or 8% faster depending on if you use -O3 or -O2 respectively for this particular example. The big slowdown for -O3 is actually due to a code generation quirk: at -O3 the loop without prefetch is vectorized, but the extra complexity of the prefetch variant loop prevents vectorization on my version of gcc anyway.
So the -O2 results are probably more apples-to-apples, and the benefit is about half (8% speedup vs 16%) of what we saw on Leeor's Westmere. Still it's worth noting that you have to be careful not to change code generation such that you get a big slowdown.
This test probably isn't ideal in that by going int by int implies a lot of CPU overhead rather than stressing the memory subsystem (that's why vectorization helped so much).

On recent Intel chips one reason you apparently might want to use prefetching is to avoid CPU power-saving features artificially limiting your achieved memory bandwidth. In this scenario, simple prefetching can as much as double your performance versus the same code without prefetching, but it depends entirely on the selected power management plan.
I ran a simplified version (code here)of the test in Leeor's answer, which stresses the memory subsystem a bit more (since that's where prefetch will help, hurt or do nothing). The original test stressed the CPU in parallel with the memory subsystem since it added together every int on each cache line. Since typical memory read bandwidth is in the region of 15 GB/s, that's 3.75 billion integers per second, putting a pretty hard cap on the maximum speed (code that isn't vectorized will usually process 1 int or less per cycle, so a 3.75 GHz CPU will be about equally CPU and memory bount).
First, I got results that seemed to show prefetching kicking butt on my i7-6700HQ (Skylake):
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=221, MiB/s=11583
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=221, MiB/s=11583
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=160, MiB/s=16000
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=204, MiB/s=12549
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=160, MiB/s=16000
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=200, MiB/s=12800
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=160, MiB/s=16000
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=201, MiB/s=12736
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=157, MiB/s=16305
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=197, MiB/s=12994
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=157, MiB/s=16305
Eyeballing the numbers, prefetch achieves something a bit above 16 GiB/s and without only about 12.5, so prefetch is increasing speed by about 30%. Right?
Not so fast. Remembering that the powersaving mode has all sorts of wonderful interactions on modern chips, I changed my Linux CPU governor to performance from the default of powersave1. Now I get:
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=155, MiB/s=16516
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=157, MiB/s=16305
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=144, MiB/s=17777
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=144, MiB/s=17777
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=152, MiB/s=16842
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=159, MiB/s=16100
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=163, MiB/s=15705
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=161, MiB/s=15900
It's a total toss-up. Both with and without prefetching seem to perform identically. So either hardware prefetching is less aggressive in the high powersaving modes, or there is some other interaction with power saving that behaves differently with the explicit software prefetches.
Investigation
In fact, the difference between prefetching and not is even more extreme if you change the benchark. The existing benchmark alternates between runs with prefetching on and off, and it turns out that this helped the "off" variant because the speed increase which occurs in the "on" test partly carries over to the subsequent off test2. If you run only the "off" test you get results around 9 GiB/s:
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=280, MiB/s=9142
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=277, MiB/s=9241
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=285, MiB/s=8982
... versus about 17 GiB/s for the prefetching version:
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=149, MiB/s=17181
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=148, MiB/s=17297
Prefetching on: SIZE=256 MiB, sum=1407374589952000, time=148, MiB/s=17297
So the prefetching version is almost twice as fast.
Let's take a look at what's going on with perf stat, for both the **off* version:
Performance counter stats for './prefetch-test off':
2907.485684 task-clock (msec) # 1.000 CPUs utilized
3,197,503,204 cycles # 1.100 GHz
2,158,244,139 instructions # 0.67 insns per cycle
429,993,704 branches # 147.892 M/sec
10,956 branch-misses # 0.00% of all branches
... and the on version:
1502.321989 task-clock (msec) # 1.000 CPUs utilized
3,896,143,464 cycles # 2.593 GHz
2,576,880,294 instructions # 0.66 insns per cycle
429,853,720 branches # 286.126 M/sec
11,444 branch-misses # 0.00% of all branches
The difference is that the version with prefetching on consistently runs at the max non-turbo frequency of ~2.6 GHz (I have disabled turbo via an MSR). The version without prefetching, however, has decided to run at a much lower speed of 1.1 GHz. Such large CPU differences often also reflect a large difference in uncore frequency, which can explain the worse bandwdith.
Now we've seen this before, and it is probably an outcome of the Energy Efficient Turbo feature on recent Intel chips, which try to ramp down the CPU frequency when they determine a process is mostly memory bound, presumably since increased CPU core speed doesn't provide much benefit in those cases. As we can see here, this assumption isn't always true, but it isn't clear to me if the tradeoff is a bad one in general, or perhaps the heuristic only occasionally gets it wrong.
1 I'm running the intel_pstate driver, which is the default for Intel chips on recent kernels which implements "hardware p-states", also known as "HWP". Command used: sudo cpupower -c 0,1,2,3 frequency-set -g performance.
2 Conversely, the slowdown from the "off" test partly carries over into the "on" test, although the effect is less extreme, possibly because the powersaving "ramp up" behavior is faster than "ramp down".

Here's a brief summary of cases that I'm aware of in which software prefetching may prove especially useful. Some may not apply to all hardware.
This list should be read from the point of view that the most obvious place software prefetches could be used is where the stream of accesses can be predicted in software, and yet this case isn't necessarily such an obvious win for SW prefetch because out-of-order processing often ends up having a similar effect since it can execute behind existing misses in order to get more misses in flight.
So this list is more a "in light of the fact that SW prefetch isn't as obviously useful as it might first seem, here are some places it might still be useful anyways", often compared to the alternative of either just letting out-of-order processing do its thing or just using "plain loads" to load some values before they are needed.
Fitting more loads in the out-of-order window
Although out-of-order processing can potentially expose the same type of MLP (Memory-Level Parallelism) as software prefetches, there are limits inherent to the total possible lookahead distance after a cache miss. These include reorder-buffer capacity, load buffer capacity, scheduler capacity and so on. See this blog post for an example of where extra work seriously hinders MLP because the CPU can't run ahead far enough to get enough loads executing at once.
In this case, software prefetch allows you to effectively stuff more loads earlier in the instruction stream. As an example, imagine you have a loop which performs one load and then 20 instructions worth of work on the loaded data, and your CPU has an out-of-order buffer of 100 instructions and that loads are independent from each other (e.g,. accessing an array with a known stride).
After the first miss, you can run ahead 99 more instructions which will be composed of 95 non-load and 5 load instructions (including the first load). So your MLP is inherently limited to 5 by the size of the out-of-order buffer. If instead you paired every load with two software prefetches to a location say 6 or more iterations ahead, you'd end up instead with 90 non-load instructions, 5 loads and 5 software prefetches and since all those loads are you just doubled your MLP to 102.
There is of course no limit of one additional prefetch per load: you could add more to hit higher numbers, but there is a point of diminishing and then negative returns as you hit the MLP limits of your machine and the prefetches take up resources you'd rather spend on other things.
This is similar to software pipelining, where you load data for a future iteration, and then don't touch that register until after a significant amount of other work. This was mostly used on in-order machines to hide latency of computation as well as memory. Even on a RISC with 32 architectural registers, software-pipelining typically can't place the loads as far ahead of use as an optimal prefetch-distance on a modern machine; the amount of work a CPU can do during one memory latency has grown a lot since the early days of in-order RISCs.
In-order machines
Not all machines are bit out-of-order cores: in-order CPUs are still common in some places (especially outside x86), and you'll also find "weak" out of order cores that don't have the capability to run ahead very far and so partly act like in-order machines.
On these machines software prefetches may help gain MLP that you wouldn't otherwise be able access (of course, an in-order machine probably doesn't support a lot of inherent MLP otherwise).
Working around hardware prefetch restrictions
Hardware prefetch may have restrictions which you could work around using software prefetch.
For example, Leeor's answer has an example of hardware prefetch stopping at page boundaries, while software prefetch doesn't have any such restriction.
Another example might be any time that hardware prefetch is too aggressive or too conservative (after all it has to guess at your intentions): you might use software prefetch instead since you know exactly how your application will behave.
Examples of the latter include prefetching discontiguous areas: such as rows in a sub-matrix of a larger matrix: hardware prefetch won't understand the boundaries of the "rectangular" region and will constantly prefetch beyond the end of each row, and then take a bit of time to pick up the new row pattern. Software prefetching can get this exactly right: never issuing any useless prefetches at all (but it often requires ugly splitting of loops).
If you do enough software prefetches, the hardware prefeteches should in theory mostly shut down, because the activity of the memory subsystem is one heuristic they use to decide whether to activate.
Counterpoint
I should note here that software prefetching is not equivalent to hardware prefetching when it comes to possible speedups for cases the hardware prefetching can pick up: hardware prefetching can be considerably faster. That is because hardware prefetching can start working closer to memory (e.g., from the L2) where it has a lower latency to memory and also access to more buffers (in the so-called "superqueue" on Intel chips) and so more concurrency. So if you turn off hardware prefetching and try to implement a memcpy or some other streaming load with pure software prefetching, you'll find that it is likely slower.
Special load hints
Prefetching may give you access to special hints that you can't achieve with regular loads. For example x86 has the prefetchnta, prefetcht0, prefetcht1, and prefetchw instructions which hint to the processor how to treat the loaded data in the caching subsystem. You can't achieve the same effect with plain loads (at least on x86).
2 It's not actually as simple as just adding a single prefetch to the loop, since after the first five iterations, the loads will start hitting already prefetched values, reducing your MLP back to 5 - but the idea still holds. A real implementation would also involve reorganizing the loop so that the MLP can be sustained (e.g., "jamming" the loads and prefetches together every few iterations).

There are definitely situations where software prefetch provides significant performance improvements.
For example, if you are accessing a relatively slow memory device such as Optane DC Persistent Memory, which has access times of several hundred nanoseconds, prefetching can reduce effective latency by 50 percent or more if you can do it far enough in advance of the read or write.
This isn't a very common case at present but it will become a lot more common if and when such storage devices become mainstream.

The article 'What Every Programmer Should Know About Memory
Ulrich Drepper' discusses situations where pre-fetching is advantageous;
http://www.akkadia.org/drepper/cpumemory.pdf , warning: this is quite a long article that discusses things like memory architecture / how the cpu works, etc.
prefetching gives something if the data is aligned to cache lines; and if you are loading data that is about to be accessed by the algorithm;
In any event one should do this when trying to optimize highly used code; benchmarking is a must and things usually work out differently than one might use to think.

It seems, that the best policy to follow is to never use __builtin_prefetch (and its friend, __builtin_expect) at all. On some platforms those may help (and even help a lot) - however, one must always do some benchmarking to confirm this. The real question, whether the short term performance gains will worth the trouble in the longer run.
First, one may ask the following question: what these statements actually do when fed to a higher end modern CPU? The answer is: nobody really knows (except, may be, few guys on the CPU's core architecture team, but they are not going to tell anybody). Modern CPUs are very complex machines, capable of instruction reordering, speculative execution of instructions across possibly not taken branches, etc., etc. Moreover, the details of this complex behavior may (and will) differ considerably between CPU generations and vendors (Intel Core vs Intel I* vs AMD Opteron; with more fragmented platforms like ARM the situation is even worse).
One neat example (not prefetch related, but still) of CPU functionality which used to speed things up on older Intel CPUs, but sucks badly on the more modern one is outlined here: http://lists-archives.com/git/744742-git-gc-speed-it-up-by-18-via-faster-hash-comparisons.html. In that particular case, it was possible to achieve 18% performance increase by replacing the optimized version of gcc supplied memcmp with an explicit ("naive" so to say) loop.

How to derive the Peak performance in GFlop/s of Intel Xeon E5-2690?

I was able to find the theoretical DP peak performance 371 GFlop/s for the Xeon E5-2690 in this Processor Comparison (interesting that it is easier to find this information in Intel's competitor than Intel support pages itself). However, when I try to derive that peak performance my derivation doesn't match:
The frequency (in Turbo mode) for each core of the Xeon E5-2690 = 3.8Ghz
The processor can do an add and mul operation per cycle so we get: 3.8 x 2 = 7.6
Given it has AVX support it can do 4 double operations per cycle: 7.6 x 4 = 30.4
Finally, it has 8 cores, therefore we get: 8 x 30.4 = 243.2
Thus, the peak performance in Gflop/s would be 243.2 GFlop/s and not 371 GFlop/s?

Turbo Mode is not used to calculate Theoretical Peak Performance, you have to consider something like:
CPU speed = 2.9 GHz
CPU Cores = 8
CPU instruction per cycle = 8 (considering AVX-256 -> 256 bits unit, can hold 8 single precision values) x 2 (add and mul operations like you said) = 16
Putting all together:
2.9x8x16 = 371 GFlops/s

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio