GCC optimization flag -O2 makes code much slower that -O0 [duplicate] - gcc

I wanted to benchmark glibc's strlen function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.
Here's my code:
#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
int main() {
char *s = calloc(1 << 20, 1);
memset(s, 65, 1000000);
clock_t start = clock();
for (int i = 0; i < 128; ++i) {
s[strlen(s)] = 'A';
}
clock_t end = clock();
printf("%lld\n", (long long)(end - start));
return 0;
}
On my machine it outputs:
$ gcc test.c && ./a.out
13336
$ gcc -O1 test.c && ./a.out
199004
$ gcc -O2 test.c && ./a.out
83415
$ gcc -O3 test.c && ./a.out
83415
Somehow, enabling optimizations causes it to execute longer.

Testing your code on Godbolt's Compiler Explorer provides this explanation:
at -O0 or without optimisations, the generated code calls the C library function strlen;
at -O1 the generated code uses a simple inline expansion using a rep scasb instruction;
at -O2 and above, the generated code uses a more elaborate inline expansion.
Benchmarking your code repeatedly shows substantial variations from one run to another, but increasing the number of iterations shows that:
the -O1 code is much slower than the C library implementation: 32240 vs 3090
the -O2 code is faster than the -O1 but still substantially slower than the C library code: 8570 vs 3090.
This behavior is specific to gcc and the GNU libc. The same test on OS/X with clang and Apple's Libc does not show significant differences, which is not a surprise as Godbolt shows that clang generates a call to the C library strlen at all optimisation levels.
This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen has a more important impact than the lack of performance of the inline code for small strings. The strings in your benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.
I improved this benchmark and tested various string lengths. It appears from the benchmarks on linux with gcc (Debian 4.7.2-5) 4.7.2 running on an Intel(R) Core(TM) i3-2100 CPU # 3.10GHz that the inline code generated by -O1 is always slower, by as much as a factor of 10 for moderately long strings, while -O2 is only slightly faster than the libc strlen for very short strings and half as fast for longer strings. From this data, the GNU C library version of strlen is quite efficient for most string lengths, at least on my specific hardware. Also keeping in mind that cacheing has a major impact on benchmark measurements.
Here is the updated code:
#include <stdlib.h>
#include <string.h>
#include <time.h>
void benchmark(int repeat, int minlen, int maxlen) {
char *s = malloc(maxlen + 1);
memset(s, 'A', minlen);
long long bytes = 0, calls = 0;
clock_t clk = clock();
for (int n = 0; n < repeat; n++) {
for (int i = minlen; i < maxlen; ++i) {
bytes += i + 1;
calls += 1;
s[i] = '\0';
s[strlen(s)] = 'A';
}
}
clk = clock() - clk;
free(s);
double avglen = (minlen + maxlen - 1) / 2.0;
double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/call\n",
avglen, ns / bytes, ns / calls);
}
int main() {
benchmark(10000000, 0, 1);
benchmark(1000000, 0, 10);
benchmark(1000000, 5, 15);
benchmark(100000, 0, 100);
benchmark(100000, 50, 150);
benchmark(10000, 0, 1000);
benchmark(10000, 500, 1500);
benchmark(1000, 0, 10000);
benchmark(1000, 5000, 15000);
benchmark(100, 1000000 - 50, 1000000 + 50);
return 0;
}
Here is the output:
chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
average length 0 -> avg time: 14.000 ns/byte, 14.000 ns/call
average length 4 -> avg time: 2.364 ns/byte, 13.000 ns/call
average length 10 -> avg time: 1.238 ns/byte, 13.000 ns/call
average length 50 -> avg time: 0.317 ns/byte, 16.000 ns/call
average length 100 -> avg time: 0.169 ns/byte, 17.000 ns/call
average length 500 -> avg time: 0.074 ns/byte, 37.000 ns/call
average length 1000 -> avg time: 0.068 ns/byte, 68.000 ns/call
average length 5000 -> avg time: 0.064 ns/byte, 318.000 ns/call
average length 10000 -> avg time: 0.062 ns/byte, 622.000 ns/call
average length 1000000 -> avg time: 0.062 ns/byte, 62000.000 ns/call
chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
average length 0 -> avg time: 20.000 ns/byte, 20.000 ns/call
average length 4 -> avg time: 3.818 ns/byte, 21.000 ns/call
average length 10 -> avg time: 2.190 ns/byte, 23.000 ns/call
average length 50 -> avg time: 0.990 ns/byte, 50.000 ns/call
average length 100 -> avg time: 0.816 ns/byte, 82.000 ns/call
average length 500 -> avg time: 0.679 ns/byte, 340.000 ns/call
average length 1000 -> avg time: 0.664 ns/byte, 664.000 ns/call
average length 5000 -> avg time: 0.651 ns/byte, 3254.000 ns/call
average length 10000 -> avg time: 0.649 ns/byte, 6491.000 ns/call
average length 1000000 -> avg time: 0.648 ns/byte, 648000.000 ns/call
chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
average length 0 -> avg time: 10.000 ns/byte, 10.000 ns/call
average length 4 -> avg time: 2.000 ns/byte, 11.000 ns/call
average length 10 -> avg time: 1.048 ns/byte, 11.000 ns/call
average length 50 -> avg time: 0.337 ns/byte, 17.000 ns/call
average length 100 -> avg time: 0.299 ns/byte, 30.000 ns/call
average length 500 -> avg time: 0.202 ns/byte, 101.000 ns/call
average length 1000 -> avg time: 0.188 ns/byte, 188.000 ns/call
average length 5000 -> avg time: 0.174 ns/byte, 868.000 ns/call
average length 10000 -> avg time: 0.172 ns/byte, 1716.000 ns/call
average length 1000000 -> avg time: 0.172 ns/byte, 172000.000 ns/call

GCC's inline strlen patterns are much slower than what it could do with SSE2 pcmpeqb / pmovmskb, and bsf, given the 16-byte alignment from calloc. This "optimization" is actually a pessimization.
My simple hand-written loop that takes advantage of 16-byte alignment is 5x faster than what gcc -O3 inlines for large buffers, and ~2x faster for short strings. (And faster than calling strlen for short strings). I've added a comment to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809 to propose this for what gcc should inline at -O2 / -O3 when it's able. (With a suggestion for ramping up to 16-byte if we only know 4-byte alignment to start with.)
When gcc knows it has 4-byte alignment for the buffer (guaranteed by calloc), it chooses to inline strlen as a 4-byte-at-a-time scalar bithack using GP integer registers (-O2 and higher).
(Reading 4 bytes at a time is only safe if we know we can't cross into a page that doesn't contain any string bytes, and thus might be unmapped. Is it safe to read past the end of a buffer within the same page on x86 and x64? (TL:DR yes, in asm it is, so compilers can emit code that does that even if doing so in the C source is UB. libc strlen implementations also take advantage of that. See my answer there for links to glibc strlen and a summary of how it runs so fast for large strings.)
At -O1, gcc always (even without known alignment) chooses to inline strlen as repnz scasb, which is very slow (about 1 byte per clock cycle on modern Intel CPUs). "Fast strings" only applies to rep stos and rep movs, not the repz/repnz instructions, unfortunately. Their microcode is just simple 1 byte at a time, but they still have some startup overhead. (https://agner.org/optimize/)
(We can test this by "hiding" the pointer from the compiler by storing / reloading s to a volatile void *tmp, for example. gcc has to make zero assumptions about the pointer value that's read back from a volatile, destroying any alignment info.)
GCC does have some x86 tuning options like -mstringop-strategy=libcall vs. unrolled_loop vs. rep_byte for inlining string operations in general (not just strlen; memcmp would be another major one that can be done with rep or a loop). I haven't checked what effect these have here.
The docs for another option also describe the current behaviour. We could get this inlining (with extra code for alignment-handling) even in cases where we wanted it on unaligned pointers. (This used to be an actual perf win, especially for small strings, on targets where the inline loop wasn't garbage compared to what the machine can do.)
-minline-all-stringops
By default GCC inlines string operations only when the destination is known to be aligned to least a 4-byte boundary. This enables more inlining and increases code size, but may improve performance of code that depends on fast memcpy, strlen, and memset for short lengths.
GCC also has per-function attributes you can apparently use to control this, like __attribute__((no-inline-all-stringops)) void foo() { ... }, but I haven't played around with it. (That's the opposite of inline-all. It doesn't mean inline none, it just goes back to only inlining when 4-byte alignment is known.)
Both of gcc's inline strlen strategies fail to take advantage of 16-byte alignment, and are pretty bad for x86-64
Unless the small-string case is very common, doing one 4-byte chunk, then aligned 8-byte chunks would go about twice as fast as 4-byte.
And the 4-byte strategy has much slower cleanup than necessary for finding the byte within the dword containing the zero byte. It detects this by looking for a byte with its high bit set, so it should just mask off the other bits and use bsf (bit-scan forward). That has 3 cycle latency on modern CPUs (Intel and Ryzen). Or compilers can use rep bsf so it runs as tzcnt on CPUs that support BMI1, which is more efficient on AMD. bsf and tzcnt give the same result for non-zero inputs.
GCC's 4-byte loop looks like it's compiled from pure C, or some target-independent logic, not taking advantage of bitscan. gcc does use andn to optimize it when compiling for x86 with BMI1, but it's still less than 4 bytes per cycle.
SSE2 pcmpeqb + bsf is much much better for both short and long inputs. x86-64 guarantees that SSE2 is available, and the x86-64 System V has alignof(maxalign_t) = 16 so calloc will always return pointers that are at least 16-byte aligned.
I wrote a replacement for the strlen block to test performance
As expected it's about 4x faster on Skylake going 16 bytes at a time instead of 4.
(I compiled the original source to asm with -O3, then edited the asm to see what performance should have been with this strategy for inline expansion of strlen. I also ported it to inline asm inside the C source; see that version on Godbolt.)
# at this point gcc has `s` in RDX, `i` in ECX
pxor %xmm0, %xmm0 # zeroed vector to compare against
.p2align 4
.Lstrlen16: # do {
#ifdef __AVX__
vpcmpeqb (%rdx), %xmm0, %xmm1
#else
movdqa (%rdx), %xmm1
pcmpeqb %xmm0, %xmm1 # xmm1 = -1 where there was a 0 in memory
#endif
add $16, %rdx # ptr++
pmovmskb %xmm1, %eax # extract high bit of each byte to a 16-bit mask
test %eax, %eax
jz .Lstrlen16 # }while(mask==0);
# RDX points at the 16-byte chunk *after* the one containing the terminator
# EAX = bit-mask of the 0 bytes, and is known to be non-zero
bsf %eax, %eax # EAX = bit-index of the lowest set bit
movb $'A', -16(%rdx, %rax)
Note that I optimized part of the strlen cleanup into the store addressing mode: I correct for the overshoot with the -16 displacement, and that this is just finding the end of the string, not actually calculating the length and then indexing like GCC was already doing after inlining its 4-byte-at-a-time loop.
To get actual string length (instead of pointer to the end), you'd subtract rdx-start and then add rax-16 (maybe with an LEA to add 2 registers + a constant, but 3-component LEA has more latency.)
With AVX to allow load+compare in one instruction without destroying the zeroed register, the whole loop is only 4 uops, down from 5. (test/jz macro-fuses into one uop on both Intel and AMD. vpcmpeqb with a non-indexed memory-source can keep it micro-fused through the whole pipeline, so it's only 1 fused-domain uop for the front-end.)
(Note that mixing 128-bit AVX with SSE does not cause stalls even on Haswell, as long as you're in clean-upper state to start with. So I didn't bother about changing the other instructions to AVX, only the one that mattered. There seemed to be some minor effect where pxor was actually slightly better than vpxor on my desktop, though, for an AVX loop body. It seemed somewhat repeatable, but it's weird because there's no code-size difference and thus no alignment difference.)
pmovmskb is a single-uop instruction. It has 3-cycle latency on Intel and Ryzen (worse on Bulldozer-family). For short strings, the trip through the SIMD unit and back to integer is an important part of the critical path dependency chain for latency from input memory bytes to store-address being ready. But only SIMD has packed-integer compares, so scalar would have to do more work.
For the very-small string case (like 0 to 3 bytes), it might be possible to achieve slightly lower latency for that case by using pure scalar (especially on Bulldozer-family), but having all strings from 0 to 15 bytes take the same branch path (loop branch never taken) is very nice for most short-strings use-cases.
Being very good for all strings up to 15 bytes seems like a good choice, when we know we have 16-byte alignment. More predictable branching is very good. (And note that when looping, pmovmskb latency only affects how quickly we can detect branch mispredicts to break out of the loop; branch prediction + speculative execution hides the latency of the independent pmovmskb in each iteration.
If we expected longer strings to be common, we could unroll a bit, but at that point you should just call the libc function so it can dispatch to AVX2 if available at runtime. Unrolling to more than 1 vector complicates the cleanup, hurting the simple cases.
On my machine i7-6700k Skylake at 4.2GHz max turbo (and energy_performance_preference = performance), with gcc8.2 on Arch Linux, I get somewhat consistent benchmark timing because my CPU clock speed ramps up during the memset. But maybe not always to max turbo; Skylake's hw power management downclocks when memory-bound. perf stat showed I typically got right around 4.0GHz when running this to average the stdout output and see perf summary on stderr.
perf stat -r 100 ./a.out | awk '{sum+= $1} END{print sum/100;}'
I ended up copying my asm into a GNU C inline-asm statement, so I could put the code on the Godbolt compiler explorer.
For large strings, same length as in the question: times on ~4GHz Skylake
~62100 clock_t time units: -O1 rep scas: (clock() is a bit obsolete, but I didn't bother changing it.)
~15900 clock_t time units: -O3 gcc 4-byte loop strategy: avg of 100 runs = . (Or maybe ~15800 with -march=native for andn)
~1880 clock_t time units: -O3 with glibc strlen function calls, using AVX2
~3190 clock_t time units: (AVX1 128-bit vectors, 4 uop loop) hand-written inline asm that gcc could/should inline.
~3230 clock_t time units: (SSE2 5 uop loop) hand-written inline asm that gcc could/should inline.
My hand-written asm should be very good for short strings, too, because it doesn't need to branch specially. Known alignment is very good for strlen, and libc can't take advantage of it.
If we expect large strings to be rare, 1.7x slower than libc for that case. The length of 1M bytes means it won't be staying hot in L2 (256k) or L1d cache (32k) on my CPU, so even bottlenecked on L3 cache the libc version was faster. (Probably an unrolled loop and 256-bit vectors doesn't clog up the ROB with as many uops per byte, so OoO exec can see farther ahead and get more memory parallelism, especially at page boundaries.)
But L3 cache bandwidth is probably a bottleneck stopping the 4-uop version from running at 1 iteration per clock, so we're seeing less benefit from AVX saving us a uop in the loop. With data hot in L1d cache, we should get 1.25 cycles per iteration vs. 1.
But a good AVX2 implementation can read up to 64 bytes per cycle (2x 32 byte loads) using vpminub to combine pairs before checking for zeros and going back to find where they were. The gap between this and libc opens wider for sizes of ~2k to ~30 kiB or so that stay hot in L1d.
Some read-only testing with length=1000 indicates that glibc strlen really is about 4x faster than my loop for medium size strings hot in L1d cache. That's large enough for AVX2 to ramp up to the big unrolled loop, but still easily fits in L1d cache. (Read-only avoid store-forwarding stalls, and so we can do many iterations)
If your strings are that big, you should be using explicit-length strings instead of needing to strlen at all, so inlining a simple loop still seems like a reasonable strategy, as long as it's actually good for short strings and not total garbage for medium (like 300 bytes) and very long (> cache size) strings.
Benchmarking small strings with this:
I ran into some oddities in trying to get the results I expected:
I tried s[31] = 0 to truncate the string before every iteration (allowing short constant length). But then my SSE2 version was almost the same speed as GCC's version. Store-forwarding stalls were the bottleneck! A byte store followed by a wider load makes store-forwarding take the slow path that merges bytes from the store buffer with bytes from L1d cache. This extra latency is part of a loop-carried dep chain through the last 4-byte or 16-byte chunk of the string, to calculate the store index for the next iteration.
GCC's slower 4-byte-at-a-time code could keep up by processing the earlier 4-byte chunks in the shadow of that latency. (Out-of-order execution is pretty fantastic: slow code can sometimes not affect the overall speed of your program).
I eventually solved it by making a read-only version, and using inline asm to stop the compiler from hoisting strlen out of the loop.
But store-forwarding is a potential issue with using 16-byte loads. If other C variables are stored past the end of the array, we might hit a SF stall due to loading off the end of the array farther than with narrower stores. For recently-copied data, we're fine if it was copied with 16-byte or wider aligned stores, but glibc memcpy for small copies does 2x overlapping loads that cover the whole object, from the start and end of the object. Then it stores both, again overlapping, handling the memmove src overlaps dst case for free. So the 2nd 16-byte or 8-byte chunk of a short string that was just memcpyied might give us a SF stall for reading the last chunk. (The one that has the data dependency for the output.)
Just running slower so you don't get to the end before it's ready isn't good in general, so there's no great solution here. I think most of the time you're not going to strlen a buffer you just wrote, usually you're going to strlen an input that you're only reading so store-forwarding stalls aren't a problem. If something else just wrote it, then efficient code hopefully wouldn't have thrown away the length and called a function that required recalculating it.
Other weirdness I haven't totally figured out:
Code alignment is making a factor of 2 difference for read-only, size=1000 (s[1000] = 0;). But the inner-most asm loop itself is aligned with .p2align 4 or .p2align 5. Increasing the loop alignment can slow it down by a factor of 2!
# slow version, with *no* extra HIDE_ALIGNMENT function call before the loop.
# using my hand-written asm, AVX version.
i<1280000 read-only at strlen(s)=1000 so strlen time dominates the total runtime (not startup overhead)
.p2align 5 in the asm inner loop. (32-byte code alignment with NOP padding)
gcc -DUSE_ASM -DREAD_ONLY -DHIDE_ALIGNMENT -march=native -O3 -g strlen-microbench.c &&
time taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread -r 100 ./a.out |
awk '{sum+= $1} END{print sum/100;}'
Performance counter stats for './a.out' (100 runs):
40.92 msec task-clock # 0.996 CPUs utilized ( +- 0.20% )
2 context-switches # 0.052 K/sec ( +- 3.31% )
0 cpu-migrations # 0.000 K/sec
313 page-faults # 0.008 M/sec ( +- 0.05% )
168,103,223 cycles # 4.108 GHz ( +- 0.20% )
82,293,840 branches # 2011.269 M/sec ( +- 0.00% )
1,845,647 branch-misses # 2.24% of all branches ( +- 0.74% )
412,769,788 instructions # 2.46 insn per cycle ( +- 0.00% )
466,515,986 uops_issued.any # 11401.694 M/sec ( +- 0.22% )
487,011,558 uops_executed.thread # 11902.607 M/sec ( +- 0.13% )
0.0410624 +- 0.0000837 seconds time elapsed ( +- 0.20% )
40326.5 (clock_t)
real 0m4.301s
user 0m4.050s
sys 0m0.224s
Note branch misses definitely non-zero, vs. almost exactly zero for the fast version. And uops issued is much higher than the fast version: it may be speculating down the wrong path for a long time on each of those branch misses.
Probably the inner and outer loop-branches are aliasing each other, or not.
Instruction count is nearly identical, just different by some NOPs in the outer loop ahead of the inner loop. But IPC is vastly different: without problems, the fast version runs an average of 4.82 instructions per clock for the whole program. (Most of that is in the inner-most loop running 5 instructions per cycle, thanks to a test/jz that macro-fuses 2 instructions into 1 uop.) And note that uops_executed is much higher than uops_issued: that means micro-fusion is working well to get more uops through the front-end bottleneck.
fast version, same read-only strlen(s)=1000 repeated 1280000 times
gcc -DUSE_ASM -DREAD_ONLY -UHIDE_ALIGNMENT -march=native -O3 -g strlen-microbench.c &&
time taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread -r 100 ./a.out |
awk '{sum+= $1} END{print sum/100;}'
Performance counter stats for './a.out' (100 runs):
21.06 msec task-clock # 0.994 CPUs utilized ( +- 0.10% )
1 context-switches # 0.056 K/sec ( +- 5.30% )
0 cpu-migrations # 0.000 K/sec
313 page-faults # 0.015 M/sec ( +- 0.04% )
86,239,943 cycles # 4.094 GHz ( +- 0.02% )
82,285,261 branches # 3906.682 M/sec ( +- 0.00% )
17,645 branch-misses # 0.02% of all branches ( +- 0.15% )
415,286,425 instructions # 4.82 insn per cycle ( +- 0.00% )
335,057,379 uops_issued.any # 15907.619 M/sec ( +- 0.00% )
409,255,762 uops_executed.thread # 19430.358 M/sec ( +- 0.00% )
0.0211944 +- 0.0000221 seconds time elapsed ( +- 0.10% )
20504 (clock_t)
real 0m2.309s
user 0m2.085s
sys 0m0.203s
I think it's just the branch prediction, not other front-end stuff that's a problem. The test/branch instructions aren't getting split across a boundary that would prevent macro-fusion.
Changing .p2align 5 to .p2align 4 reverses them: -UHIDE_ALIGNMENT becomes slow.
This Godbolt binary link reproduces the same padding I'm seeing with gcc8.2.1 on Arch Linux for both cases: 2x 11-byte nopw + a 3-byte nop inside the outer loop for the fast case. It also has the exact source I was using locally.
short strlen read-only micro-benchmarks:
Tested with stuff chosen so it doesn't suffer from branch mispredicts or store-forwarding, and can test the same short length repeatedly for enough iterations to get meaningful data.
strlen=33, so the terminator is near the start of the 3rd 16-byte vector. (Makes my version look as bad as possible vs. the 4-byte version.) -DREAD_ONLY, and i<1280000 as an outer-loop repeat loop.
1933 clock_t: my asm: nice and consistent best-case time (not noisy / bouncing around when re-running the average.) Equal perf with/without -DHIDE_ALIGNMENT, unlike for the longer strlen. The loop branch is much more easily predictable with that much shorter pattern. (strlen=33, not 1000).
3220 clock_t: gcc -O3 call glibc strlen. (-DHIDE_ALIGNMENT)
6100 clock_t: gcc -O3 4-byte loop
37200 clock_t: gcc -O1 repz scasb
So for short strings, my simple inline loop beats a library function call to strlen that has to go through the PLT (call + jmp [mem]), then run strlen's startup overhead that can't depend on alignment.
There were negligible branch-mispredicts, like 0.05% for all the versions with strlen(s)=33. The repz scasb version had 0.46%, but that's out of fewer total branches. No inner loop to rack up many correctly predicted branches.
With branch predictors and code-cache hot, repz scasb is more than 10x worse than calling glibc strlen for a 33-byte string. It would be less bad in real use cases where strlen could branch miss or even miss in code-cache and stall, but straight-line repz scasb wouldn't. But 10x is huge, and that's for a fairly short string.

Related

Loop takes less than 1 cycle despite dependency between iterations

I wanted to benchmark the time needed to do a single addition on my Skylake (i5-6500) CPU. C is low-level enough for me, so I wrote the following code:
// Initializing stuffs
int a = rand();
int b = rand();
const unsigned long loop_count = 1000000000;
unsigned int ignored; // used for __rdtscp
// Warming up whatever needs to be warmed up
for (int i = 0; i < 100000; i++) {
asm volatile("" : "+r" (a)); // prevents Clang from replacing the loop with a multiplication
a += b;
}
// The actual measurement
uint64_t timer = __rdtscp(&ignored);
for (unsigned long i = 0; i < loop_count; i++) {
asm volatile("" : "+r" (a)); // prevents Clang from replacing the loop with a multiplication
a += b;
}
timer = __rdtscp(&ignored) - timer;
printf("%.2f cycles/iteration\n", (double)timer / loop_count);
Compiling with Clang 7.0.0 -O3, I get the following assembly (for the loop only):
# %bb.2:
rdtscp
movq %rdx, %rdi
movl %ecx, 4(%rsp)
shlq $32, %rdi
orq %rax, %rdi
movl $1000000000, %eax # imm = 0x3B9ACA00
.p2align 4, 0x90
.LBB0_3: # =>This Inner Loop Header: Depth=1
#APP
#NO_APP
addl %esi, %ebx
addq $-1, %rax
jne .LBB0_3
# %bb.4:
rdtscp
And running this code outputs
0.94 cycles/iteration
(or a number pretty much always between 0.93 and 0.96)
I'm surprised that this loop can execute in less than 1 cycle/iteration, since there is a data dependency on a that should prevent parallel execution of a += b.
IACA also confirms that the expected throughput is 0.96 cycles. llvm-mca on the other hand predicts a total of 104 cycles to execute 100 iterations of the loop. (I can edit in the traces if needed; let me know)
I observe a similar behavior when I use SSE registers rather than general purpose ones.
I can imagine that the CPU is smart enough to notice that b is constant and since addition is commutative, it could unroll the loop and optimize the additions somehow. However, I've never heard nor read anything about this. And furthermore, if this was what's going on, I'd expect better performances (ie. fewer cycles/iteration) than 0.94 cycles/iteration.
What is going on? How is this loop able to execute in less than 1 cycle per iteration?
Some background, for completeness. Ignore the remaining of the question if you're not interested in why I'm trying to benchmark a single addition.
I know that there are tools (llvm-exegesis for instance) designed to benchmark a single instruction and that I should instead of them (or just look at agner fog's docs). However, I'm actually trying to compare three different additions: one doing a single addition in a loop (the object of my question); one doing 3 additions per loop (on SSE registers, that should maximize port usage and not be limited by data dependencies), and one where the addition is implemented as a circuit in software. While the results are mostly as I expected; the 0.94 cycles/iteration for the version with a single addition in a loop left me puzzled.
The core frequency and the TSC frequency can be different. Your loop is expected to run at 1 core cycles per iteration. If the core frequency happens to be twice as the TSC frequency for the duration of the loop execution, the throughput would be 0.5 TSC cycles per iteration, which is equivalent to 1 core cycles per iteration.
In your case, it appears that the average core frequency was slightly higher than the TSC frequency. If you don't want to take dynamic frequency scaling into account when doing experiments, it'd be easier to just fix the core frequency to be equal to the TSC frequency so that you don't have to convert the numbers. Otherwise, you'd have to measure the average the core frequency as well.
On processors that support per-core frequency scaling, you have to either fix the frequency on all the cores or pin the experiments to a single core with fixed frequency. Alternatively, instead of measuring in TSC cycles, you can use a tool like perf to easily measure the time in core cycles or seconds.
See also: How to get the CPU cycle count in x86_64 from C++?.

MOVSD performance depends on arguments

I just noticed a pieces of my code exhibit different performance when copying memory. A test showed that a memory copying performance degraded if the address of destination buffer is greater than address of source. Sounds ridiculous, but the following code shows the difference (Delphi):
const MEM_CHUNK = 50 * 1024 * 1024;
ROUNDS_COUNT = 100;
LpSrc := VirtualAlloc(0,MEM_CHUNK,MEM_COMMIT,PAGE_READWRITE);
LpDest := VirtualAlloc(0,MEM_CHUNK,MEM_COMMIT,PAGE_READWRITE);
QueryPerformanceCounter(LTick1);
for i := 0 to ROUNDS_COUNT - 1 do
CopyMemory(LpDest,LpSrc,MEM_CHUNK);
QueryPerformanceCounter(LTick2);
// show timings
QueryPerformanceCounter(LTick1);
for i := 0 to ROUNDS_COUNT - 1 do
CopyMemory(LpSrc,LpDest,MEM_CHUNK);
QueryPerformanceCounter(LTick2);
// show timings
Here CopyMemory is based on MOVSD. The results :
Starting Memory Bandwidth Test...
LpSrc 0x06FC0000
LpDest 0x0A1C0000
src->dest Transfer: 5242880000 bytes in 1,188 sec #4,110 GB/s.
dest->src Transfer: 5242880000 bytes in 0,805 sec #6,066 GB/s.
src->dest Transfer: 5242880000 bytes in 1,142 sec #4,275 GB/s.
dest->src Transfer: 5242880000 bytes in 0,832 sec #5,871 GB/s.
Tried on two systems, the results are consistent no matter how many times repeated.
Never saw anything like that. Was unable to google it. Is this a known behavior? Is this just another cache-related peculiarity?
Update:
Here are the final results with page-aligned buffers and forward direction of MOVSD (DF=0):
Starting Memory Bandwidth Test...
LpSrc 0x06F70000
LpDest 0x0A170000
src->dest Transfer: 5242880000 bytes in 0,781 sec #6,250 GB/s.
dest->src Transfer: 5242880000 bytes in 0,731 sec #6,676 GB/s.
src->dest Transfer: 5242880000 bytes in 0,750 sec #6,510 GB/s.
dest->src Transfer: 5242880000 bytes in 0,735 sec #6,640 GB/s.
src->dest Transfer: 5242880000 bytes in 0,742 sec #6,585 GB/s.
dest->src Transfer: 5242880000 bytes in 0,750 sec #6,515 GB/s.
... and so on.
Here the transfer rates are constant.
Normally fast-strings or ERMSB microcode makes rep movsb/w/d/q and rep stosb/w/d/q fast for large counts (copying in 16, 32, or maybe even 64-byte chunks). And possibly with an RFO-avoiding protocol for the stores. (Other repe/repne scas/cmps are always slow).
Some conditions of the inputs can interfere with that best-case, notably having DF=1 (backward) instead of the normal DF=0.
rep movsd performance can depend on alignment of src and dst, including their relative misalignment. Apparently having both pointers = 32*n + same is not too bad, so most of the copy can be done after reaching an alignment boundary. (Absolute misalignment, but the pointers are aligned relative to each other. i.e. dst-src is a multiple of 32 or 64 bytes).
Performance does not depend on src > dst or src < dst per-se. If the pointers are within 16 or 32 byte of overlapping, that can also force a fall-back to 1 element at a time.
Intel's optimization manual has a section about memcpy implementations and comparing rep movs with well-optimized SIMD loops. Startup overhead is one of the the biggest downsides for rep movs, but so are misalignments that it doesn't handle well. (IceLake's "fast short rep" feature presumably addresses that.)
I did not disclose the CopyMemory body - and it indeed used copying backwards (df=1) when avoiding overlaps.
Yup, there's your problem. Only copy backwards if there would be actual overlap you need to avoid, not just based on which address is higher. And then do it with SIMD vectors, not rep movsd.
rep movsd is only fast with DF=0 (ascending addresses), at least on Intel CPUs. I just checked on Skylake: 1000000 reps of copying 4096 non-overlapping bytes from page-aligned buffers with rep movsb runs in:
174M cycles with cld (DF=0 forwards). about 42ms at about 4.1GHz, or about 90GiB/s L1d read+write bandwidth achieved. About 23 bytes per cycle, so startup overhead of each rep movsb seems to be hurting us. An AVX copy loop should achieve close to 32B/s with this easy case of pure L1d cache hits, even with a branch mispredict on loop exit from an inner loop.
4161M cycles with std (DF=1 backwards). about 1010ms at about 4.1GHz, or about 3.77GiB/s read+write. About 0.98 bytes / cycle, consistent with rep movsb being totally un-optimized. (1 count per cycle, so rep movsd would be about 4x that bandwidth with cache hits.)
uops_executed perf counter also confirms that it's spending many more uops when copying backwards. (This was inside a dec ebp / jnz loop in long mode under Linux. The same test loop as Can x86's MOV really be "free"? Why can't I reproduce this at all? built with NASM, with the buffers in the BSS. The loop did cld or std / 2x lea / mov ecx, 4096 / rep movsb. Hoisting cld out of the loop didn't make much difference.)
You were using rep movsd which copies 4 bytes at a time, so for backwards copying we can expect 4 bytes / cycle if they hit in cache. And you were probably using large buffers so cache misses bottleneck the forward direction to not much faster than backwards. But the extra uops from backward copy would hurt memory parallelism: fewer cache lines are touched by the load uops that fit in the out-of-order window. Also, some prefetchers work less well going backwards, in Intel CPUs. The L2 streamer works in either direction, but I think L1d prefetch only goes forward.
Related: Enhanced REP MOVSB for memcpy Your Sandybridge is too old for ERMSB, but Fast Strings for rep movs/rep stos has existed since original P6. Your Clovertown Xeon from ~2006 is pretty much ancient by today's standards. (Conroe/Merom microarchitecture). Those CPUs might be so old that a single core of a Xeon can saturate the meagre memory bandwidth, unlike today's many-core Xeons.
My buffers were page-aligned. For downward, I tried having the initial RSI/RDI point to the last byte of a page so the initial pointers were not aligned but the total region to be copied was. I also tried lea rdi, [buf+4096] so the starting pointers were page-aligned, so [buf+0] didn't get written. Neither made backwards copy any faster; rep movs is just garbage with DF=1; use SIMD vectors if you need to copy backwards.
Usually a SIMD vector loop can be at least as fast as rep movs, if you can use vectors as wide as the machine supports. That means having SSE, AVX, and AVX512 versions... In portable code without runtime dispatching to a memcpy implementation tuned for the specific CPU, rep movsd is often pretty good, and should be even better on future CPUs like IceLake.
You don't actually need page alignment for rep movs to be fast. IIRC, 32-byte aligned source and destination is sufficient. But also 4k aliasing could be a problem: if dst & 4095 is slightly higher than src & 4095, the load uops might internally have to wait some extra cycles for the store uops because the fast-path mechanism for detecting when a load is reloading a recent store only looks at page-offset bits.
Page alignment is one way to make sure you get the optimal case for rep movs, though.
Normally you get best performance from a SIMD loop, but only if you use SIMD vectors as wide as the machine supports (like AVX, or maybe even AVX512). And you should choose NT stores vs. normal depending on the hardware and the surrounding code.

Does a series of x86 call/ret instructions form a dependent chain?

Consider the following x86-64 assembly:
inner:
...
ret
outer:
.top:
call inner
dec rdi
jnz .top
ret
The function outer simply repeatedly makes a call to the function inner (whose body isn't shown - it may be empty).
Does the series of call instructions in outer, and the corresponding ret instructions inside inner form a dependent chain in practice (for the purposes of estimating performance)?
There is more than one way this chain could be formed. For example, does the ret depend on the latency of the preceding call instruction and then does the subsequent call instruction depend on the ret, forming a call -> ret -> call chain? Or perhaps the ret is independent but the call is not, forming a call -> call chain? If there is a chain, is it through memory, a register, the stack engine, the return address predictor1, or what?
Motivation: This question originated from a series of comments on another question, mostly this comment and earlier ones.
1 The terminology might be somewhat unclear here: the stack engine is normally understood to handle transforming rsp-modifying instructions into a single access with an appropriate offset, so that push rax; push rbx might be transformed into something like mov [t0], rax; mov [t0 - 8], rbx where t0 is some temporary register that captured the value of rsp at some point. It also understood to handle a similar transformation for call and ret instructions, which both modify the stack in a way similar to push and pop as well as including a direct, indirect (respectively) jump. The CPU also includes a mechanism to predict that return indirect jump, which some lump under "stack engine" - but here I'm separating that out into "return address predictor".
No, branch-prediction + speculative execution break the store/reload dependency.
RIP is (speculatively) known by the front-end, from the return-address predictor. The next call instruction can thus push a return address without waiting for the ret to execute (and actually load and verity the correctness of the predicted return address against the data from the stack).
Speculative stores can enter the store buffer and be store-forwarded.
There is of course a dependency chain, it's not loop-carried. Out-of-order execution hides it by keeping many iterations in flight.
Proof: call's store breaks what would otherwise be a loop-carried memory dependency chain.
align 64
global _start
_start:
mov ebp, 250000000 ; I had been unrolling by 4, should have changed this to 5000... before measuring, but forgot.
align 32
.mainloop:
call delay_retaddr
call delay_retaddr
dec ebp
jg .mainloop
xor edi,edi
mov eax,231 ; __NR_exit_group from /usr/include/asm/unistd_64.h
syscall ; sys_exit_group(0)
;; Placing this function *before* _start, or increasing the alignment,
;; makes it somewhat slower!
align 32
delay_retaddr:
add qword [rsp], 0
add qword [rsp], 0 ; create latency for the ret addr
ret
Assemble and link with yasm -felf64 -Worphan-labels -gdwarf2 foo.asm && ld -o foo foo.o, producing a static ELF binary.
Profiled (on an i7-6700k) with ocperf.py, I get 0.99 instructions per core clock cycle:
$ taskset -c 3 ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,dsb2mite_switches.penalty_cycles -r2 ./foo
Performance counter stats for './foo' (2 runs):
645.770390 task-clock (msec) # 1.000 CPUs utilized ( +- 0.05% )
1 context-switches # 0.002 K/sec ( +-100.00% )
0 cpu-migrations # 0.000 K/sec
2 page-faults # 0.004 K/sec ( +- 20.00% )
2,517,412,984 cycles # 3.898 GHz ( +- 0.09% )
1,250,159,413 branches # 1935.919 M/sec ( +- 0.00% )
2,500,838,090 instructions # 0.99 insn per cycle ( +- 0.00% )
4,010,093,750 uops_issued_any # 6209.783 M/sec ( +- 0.03% )
7,010,150,784 uops_executed_thread # 10855.485 M/sec ( +- 0.02% )
62,838 dsb2mite_switches_penalty_cycles # 0.097 M/sec ( +- 30.92% )
0.645899414 seconds time elapsed ( +- 0.05% )
With the called function before _start, and alignment values of 128, IPC can go down from 0.99 to 0.84, which is super-weird. Counts for dsb2mite switches are still low-ish, so it's mostly still running from the uop cache, not the legacy decoders. (This Skylake CPU has the microcode update that disables the loop buffer, in case that would be relevant with all this jumping.)
To sustain good throughput, the CPU has to keep many iterations of the inner loop in flight because we've significantly lengthened the independent dep chains that need to overlap.
Changing the add [rsp], 0 instructions to [rsp+16] creates a loop-carried dependency chain on a different location, which isn't being stored to by call. So the loop bottlenecks on that store-forwarding latency and runs at ~half speed.
# With add qword [rsp+16], 0
Performance counter stats for './foo' (2 runs):
1212.339007 task-clock (msec) # 1.000 CPUs utilized ( +- 0.04% )
2 context-switches # 0.002 K/sec ( +- 60.00% )
0 cpu-migrations # 0.000 K/sec
2 page-faults # 0.002 K/sec
4,727,361,809 cycles # 3.899 GHz ( +- 0.02% )
1,250,292,058 branches # 1031.306 M/sec ( +- 0.00% )
2,501,537,152 instructions # 0.53 insn per cycle ( +- 0.00% )
4,026,138,227 uops_issued_any # 3320.967 M/sec ( +- 0.02% )
7,026,457,222 uops_executed_thread # 5795.786 M/sec ( +- 0.01% )
230,287 dsb2mite_switches_penalty_cycles # 0.190 M/sec ( +- 68.23% )
1.212612110 seconds time elapsed ( +- 0.04% )
Note that I'm still using an RSP-relative address so there's still a stack-sync uop. I could have kept both cases the same and avoided it in both by using an address relative to a different register (e.g. rbp) to address the location where call/ret store/reload the return address.
I don't think the variable latency of store-forwarding (worse in simple back-to-back reload right away cases) is sufficient to explain the difference. Adding a redundant assignment speeds up code when compiled without optimization. This is a factor of 2 speedup from breaking the dependency. (0.99 IPC vs. 0.53 IPC, with the same instructions just different addressing mode.)
The instructions are 1 byte longer with the disp8 in the addressing mode, and there was front-end weirdness with alignment in the faster version, but moving things around doesn't seem to change anything with the [rsp+16] version.
Using a version that creates a store-forwarding stall (with add dword [rsp], 0) makes the dep chain too long for OoO exec to hide easily. I didn't play around with this a huge amount.

reduction is very slow in OpenMP

I am doing some optimization on my code. First i proceeded to a parallel programming with OpenMP Then i used the optimization flags provided by GNU GCC compiler. Also i included an SSE instruction to compute inverse square root. But i realized finally that the problem is that the last operation, when each thread writes the result into the reduction variable, takes ~ 80% of time. Here the parallel loop :
time(&t5);
# pragma omp parallel for shared(NTOT) private(dx,dy,d,H,V,E,F,G,K) reduction(+:dU)
for(j = 1; j <= NTOT; j++){
if(!(j-i)) continue;
dx = (X[2*j-2]-X[2*i-2])*a;
dy = (X[2*j-1]-X[2*i-1])*a;
d = rsqrtSSE(dx*dx+dy*dy);
H = D*d*d*d;
V = dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1];
E = dS[0]*dx+dS[1]*dy;
F = spin[2*j-2]*dx+spin[2*j-1]*dy;
G = -3*d*d*E*F;
K = H*(V+G);
dU += K;
}
time(&t6);
t_loop = difftime(t6, t5);
where rsqrtSSE() is a function based on __mm_rsqrt_ps(__m128 X) predefined function in xmmintrin.h .
If there is a solution to overcome this problem? or this is due to bandwidth limitation?
i compile with gcc -o prog prog.c -lm -fopenmp -O3 - ffast-math -march=native
Here some infos about my computer :
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 69
Model name: Intel(R) Core(TM) i5-4200U CPU # 1.60GHz
Stepping: 1
CPU MHz: 849.382
CPU max MHz: 2600.0000
CPU min MHz: 800.0000
BogoMIPS: 4589.17
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3
and with turboboost :
CPU Avg_MHz %Busy Bzy_MHz TSC_MHz
- 2294 99.97 2300 2295
0 2295 100.00 2300 2295
1 2295 100.00 2300 2295
2 2292 99.87 2300 2295
3 2295 100.00 2300 2295
Your measurement is flawed.
The first approach, removing the line alltogether, allows the compiler to optimize away most of the computations, simply because the result is unused.
The second approach, if i understand it correctly, was to place timing instructins inside the loop itself, e.g. before/after dU += K. Unfortunately this, also has no hope of producing meaingful results. Any library timing call is orders of magintune slower than this operation. So you basically measure the time it takes to get the time. You can try that by repeating multiple timing calls and compare the difference.
From what I have seen, I suspect that the OpenMP implementations simply keep dU as thread-private variable and only after the loop is completed perform the reductions atomic/locked operations.
A better approach to determine the performance of individual lines is to use sampling. Take a look at the perf tool for linux, it can be very helpful. The results still may not always be perfectly reliable because there can be bias. Due to compiler optimizations (e.g. unrolling), and hardware optimizations (e.g. pipelining) multiple lines of code can be in the state of execution at the same time. It becomes a very difficult question to say how much time a line of code is taking.
One approach to see if you have a problem, is to try and figure out the theoretical hardware performance. For __mm_rsqrt_ps that seems difficult. You can also take a look at hardware performance counters to see if you have a lot of cache misses etc. perf can also help you with that.

Why isn't MOVNTI slower, in a loop storing repeatedly to the same address?

section .text
%define n 100000
_start:
xor rcx, rcx
jmp .cond
.begin:
movnti [array], eax
.cond:
add rcx, 1
cmp rcx, n
jl .begin
section .data
array times 81920 db "A"
According to perf it runs at 1.82 instructions per cycle. I cannot understand why it's so fast. After all, it has to be stored in memory (RAM) so it should be slow.
P.S Is there any loop-carried-dependency?
EDIT
section .text
%define n 100000
_start:
xor rcx, rcx
jmp .cond
.begin:
movnti [array+rcx], eax
.cond:
add rcx, 1
cmp rcx, n
jl .begin
section .data
array times n dq 0
Now, the iteration take 5 cycle per iteration. Why? After all, there is still no loop-carried-dependency.
movnti can apparently sustain a throughput of one per clock when writing to the same address repeatedly.
I think movnti keeps writing into the same fill buffer, and it's not getting flushed very often because there are no other loads or stores happening. (That link is about copying from WC video memory with SSE4.1 NT loads, as well as storing to normal memory with NT stores.)
So the NT write-combining fill-buffer acts like a cache for multiple overlapping NT stores to the same address, and writes are actually hitting in the fill buffer instead of going to DRAM each time.
DDR DRAM only supports burst-transfer commands. If every movnti produced a 4B write that actually was visible to the memory chips, there'd be no way it could run that fast. The memory controller either has to read/modify/write, or do an interrupted burst transfer, since there is no non-burst write command. See also Ulrich Drepper's What Every Programmer Should Know About Memory.
We can further prove this is the case by running the test on multiple cores at once. Since they don't slow each other down at all, we can be sure that the writes are only infrequently making it out of the CPU cores and competing for memory cycles.
The reason your experiment doesn't show your loop running at 4 instruction per clock (one cycle per iteration) is that you used such a tiny repeat count. 100k cycles barely accounts for the startup overhead (which perf's timing includes).
For example, on a Core2 E6600 (Merom/Conroe) with dual channel DDR2 533MHz, the total time including all process startup / exit stuff is 0.113846 ms. That's only 266,007 cycles.
A more reasonable microbenchmark shows one iteration (one movnti) per cycle:
global _start
_start:
xor ecx,ecx
.begin:
movnti [array], eax
dec ecx
jnz .begin ; 2^32 iterations
mov eax, 60 ; __NR_exit
xor edi,edi
syscall ; exit(0)
section .bss
array resb 81920
(asm-link is a script I wrote)
$ asm-link movnti-same-address.asm
+ yasm -felf64 -Worphan-labels -gdwarf2 movnti-same-address.asm
+ ld -o movnti-same-address movnti-same-address.o
$ perf stat -e task-clock,cycles,instructions ./movnti-same-address
Performance counter stats for './movnti-same-address':
1835.056710 task-clock (msec) # 0.995 CPUs utilized
4,398,731,563 cycles # 2.397 GHz
12,891,491,495 instructions # 2.93 insns per cycle
1.843642514 seconds time elapsed
Running in parallel:
$ time ./movnti-same-address; time ./movnti-same-address & time ./movnti-same-address &
real 0m1.844s / user 0m1.828s # running alone
[1] 12523
[2] 12524
peter#tesla:~/src/SO$
real 0m1.855s / user 0m1.824s # running together
real 0m1.984s / user 0m1.808s
# output compacted by hand to save space
I expect perfect SMP scaling (except with hyperthreading), up to any number of cores. e.g. on a 10-core Xeon, 10 copies of this test could run at the same time (on separate physical cores), and each one would finish in the same time as if it was running alone. (Single-core turbo vs. multi-core turbo will also be a factor, though, if you measure wall-clock time instead of cycle counts.)
zx485's uop count nicely explains why the loop isn't bottlenecked by the frontend or unfused-domain execution resources.
However, this disproves his theory about the ratio of CPU to memory clocks having anything to do with it. Interesting coincidence, though, that the OP chose a count that happened to make the final total IPC work out that way.
P.S Is there any loop-carried-dependency?
Yes, the loop counter. (1 cycle). BTW, you could have saved an insn by counting down towards zero with dec / jg instead of counting up and having to use a cmp.
The write-after-write memory dependency isn't a "true" dependency in the normal sense, but it is something the CPU has to keep track of. The CPU doesn't "notice" that the same value is written repeatedly, so it has to make sure the last write is the one that "counts".
This is called an architectural hazard. I think the term still applies when talking about memory, rather than registers.
The result is plausible. Your loop code consists of the follwing instuctions. According to Agner Fog's instruction tables, these have the following timings:
Instruction regs fused unfused ports Latency Reciprocal Throughput
---------------------------------------------------------------------------------------------------------------------------
MOVNTI m,r 2 2 p23 p4 ~400 1
ADD r,r/i 1 1 p0156 1 0.25
CMP r,r/i 1 1 p0156 1 0.25
Jcc short 1 1 p6 1 1-2 if predicted that the jump is taken
Fused CMP+Jcc short 1 1 p6 1 1-2 if predicted that the jump is taken
So
MOVNTI consumes 2 uOps, 1 in port 2 or 3 and one in port 4
ADD consumes 1 uOps in port 0 or 1 or 5 or 6
CMP and Jcc macro-fuse to the last line in the table resulting in a consumption of 1 uOp
Because neither ADD nor CMP+Jcc depend on the result of MOVNTI they can be executed (nearly) in parallel on recent architectures, for example using the ports 1,2,4,6. The worst case would be a latency of 1 between ADD and CMP+Jcc.
This is most likely a design error in your code: you're essentially writing to the same address [array] a 100000 times, because you do not adjust the address.
The repeated writes can even go to the L1-cache under the condition that
The memory type of the region being written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an uncacheable (UC) or write protected (WP) memory region.
but it doesn't look like this and won't make for a great difference, anyway, because even if writing to memory, the memory speed will be the limiting factor.
For example, if you have a 3GHz CPU and 1600MHz DDR3-RAM this will result in 3/1.6 = 1.875 CPU cycles per memory cycle. This seems plausible.

Resources