Variations in measurements of parallel code - parallel-processing

I have way too much difference between different runs of sequential and, especially, parallel code. For example, the sequential version takes 418s. The parallel versions take:
2 threads - 250.630453 ; 339.735046 ; 256.153005 ; 256.153005 ; 311.177856
4 threads - 119.442949 ; 116.032005 ; 165.095566 ; 149.539717 ; 180.880198
8 threads - 73.856070 ; 68.082326 ; 76.318023 ; 68.922623 ; 55.321316
16 threads - 56.687378 ; 45.672769 ; 48.757555 ; 42.978104 ; 36.978891
32 threads - 24.421824 ; 21.459057 ; 23.815743 ; 24.936219 ; 24.581316
64 threads - 14.789693 ; 15.312125 ; 16.770807 ; 13.371806 ; 14.282328
The machine has 2 sockets, 32 physical cores (Intel Xeon E5-2698v3) and hyper threading. There are no other user processes running on the machine.
How normal is this? Some runs have variations of more than 55%. The parallel code does interfere in the convergence rate of the algorithm (which is iterative), but not to this extent. In particular, I ran this very same code on another computer and it was way more stable. I didn't try yet to run other parallel codes to see how stable they are.
EDIT: Forgot to say that (1) the sequential version has a lot of variation itself (at least 20%) and (2) I tried all combination of affinities and neither stability nor performance was rendered consistently better.

Related

Perf Result Conflict During Multiplexing

I have an Intel(R) Core(TM) i7-4720HQ CPU # 2.60GHz (Haswell) processor (Linux 4.15.0-20-generic kernel). In a relatively idle situation, I ran the following Perf commands and their outputs are shown, below. The counters are offcore_response.all_data_rd.l3_miss.local_dram, offcore_response.all_code_rd.l3_miss.local_dram and mem_load_uops_retired.l3_miss:
sudo perf stat -a -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss
^C
Performance counter stats for 'system wide':
229,579 offcore_response.all_data_rd.l3_miss.local_dram (99.72%)
489,151 offcore_response.all_code_rd.l3_miss.local_dram (99.77%)
110,543 mem_load_uops_retired.l3_miss (99.79%)
2.868899111 seconds time elapsed
As can be seen, event multiplexing occurred due to PMU sharing among these three events. In a similar scenario, I used the same command, except that I appended :D (mentioned in The Linux perf Event Scheduling Algorithm) to prevent multiplexing for the third event:
sudo perf stat -a -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss:D
^C
Performance counter stats for 'system wide':
539,397 offcore_response.all_data_rd.l3_miss.local_dram (68.71%)
890,344 offcore_response.all_code_rd.l3_miss.local_dram (68.67%)
193,555 mem_load_uops_retired.l3_miss:D
2.853095575 seconds time elapsed
But adding :D is leading to much larger values for all counters and this seems to occur only when event multiplexing occurs. Is this output normal? Are the percentage values in parentheses valid? How could the differences in counter values be prevented?
UPDATE:
I also traced the following loop implementation:
#include <iostream>
using namespace std;
int main()
{
for (unsigned long i = 0; i < 3 * 1e9; i++)
;
return 0;
}
This time Perf was executed 7 (both with and without :D) but not using the -a option. The commands are as follows:
sudo perf stat -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss ./loop
and
sudo perf stat -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss:D ./loop
The values of the three counters are compared in the following figures:
For all_data_rd, the :D has almost no effect, for all_code reduces the values and for load_uops_retired causes larger values.
UPDATE 2:
Based on Peter Cordes's comments, I used a memory-intensive program as follows:
#include <iostream>
#include <cstring>
#define DIM_SIZE 10000
using namespace std;
char from [DIM_SIZE][DIM_SIZE], to [DIM_SIZE][DIM_SIZE];
int main()
{
for (char x = 'a'; x <= 'z'; x++)
{
// set the 100 million element char array 'from'
to x
for (int i = 0; i < DIM_SIZE; i++)
memset(from[i], x, DIM_SIZE);
// copy the entire char array 'from' to char array 'to'
for (int i = 0; i < DIM_SIZE; i++)
memcpy(to[i], from[i], DIM_SIZE);
}
return 0;
}
The following Perf commands and outputs show that the counter values are almost the same:
sudo perf stat -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss:D ./loop
Performance counter stats for './loop':
19,836,745 offcore_response.all_data_rd.l3_miss.local_dram (50.04%)
47,309 offcore_response.all_code_rd.l3_miss.local_dram (49.96%)
6,556,957 mem_load_uops_retired.l3_miss:D
0.592795335 seconds time elapsed
and
sudo perf stat -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss ./loop
Performance counter stats for './loop':
18,742,540 offcore_response.all_data_rd.l3_miss.local_dram (66.64%)
76,854 offcore_response.all_code_rd.l3_miss.local_dram (66.64%)
6,967,919 mem_load_uops_retired.l3_miss (66.72%)
0.575828303 seconds time elapsed
In your first two tests, you're not doing anything to generate a consistent amount of off-core memory traffic, so all you're measuring is the background "noise" on a mostly-idle system. (System-wide for the first test with -a, for the second just interrupt handlers which run some code that does miss in cache while this task is the current on a logical core.)
You never said anything about whether those differences are repeatable across runs, e.g. with perf stat ... -r5 to re-run the same test 5 times, and print average and variance. I'd expect it's not repeatable, just random fluctuation of background stuff like network and keyboard/mouse interrupts, but if the :D version makes a consistent or statistically significant difference that might be interesting.
In your 2nd test, the loop you used won't create any extra memory traffic; it will either be purely registers, or if compiled without optimization will load/store the same one cache line so everything will hit in L1d cache. So you still get a tiny amount of counts from it, mostly from the CRT startup and libc init code. From your main itself, probably one code fetch and zero data loads that miss in L3 since stack memory was probably already hot in cache.
As I commented, a loop that does some memcpy or memset might make sense, or a simple known-good already written example is the STREAM benchmark, so that third test is finally something useful.
The code-read counts in the 3rd test are also just idle noise: glibc memset and memcpy are unrolled but small enough that they fit in the uop cache, so it doesn't even need to touch L1i cache.
We do have an interesting difference in counts for mem_load_uops_retired.l3_miss with / without :D, and off-core data-reads L3 misses.
With the program having two different phases (memset and memcpy), different multiplexing timing will get sample different amounts of that loop. Extrapolating to the whole time from a few of those sample periods won't always be correct.
If memset uses NT stores (or rep stosb on a CPU with ERMSB), it won't be doing any reads, just writes. Those methods use a no-RFO store protocol to avoid fetching the old value of a cache line, because they're optimized for overwriting the full line. Without that, plain stores could generate offcore_response.all_data_rd.l3_miss.local_dram I think, if RFO (Read For Ownership) count as part of all-data-read. (On my Skylake, perf list has separate events for offcore_response.demand_data_rd vs. offcore_response.demand_rfo, but "all" would include both, and prefetches (non-"demand").)
memcpy has to read its source data, so it has a source of mem_load_uops_retired.l3_miss. It can also use NT stores or ERMSB to avoid load requests on the destination.
Glibc memcpy / memset do use NT stores or rep movsb / rep stosb for large-enough buffers. So there will be a large difference in rate of offcore_response.all_data_rd.l3_miss.local_dram as well as mem_load_uops_retired.l3_miss between the memcpy and memset portions of your workload.
(I wrote memcpy or memset when I was suggesting creating a simple consistent workload because that would make it uniform with time. I hadn't looked carefully at what events you were counting, that it was only load uops and offcore load requests. So memset wasn't a useful suggestion.)
And of course code reads that miss all the way to L3 are hard to generate. In hand-written asm you could make a huge block of fully-unrolled instructions with GAS .rept 10000000 ; lea %fs:0x1234(,%rax,4), %rcx ; .endr. Or with NASM times 10000000 lea rcx, [fs: 0x1234 + rax*4]. (The segment prefix is just there to make it longer, no effect on the result.) That's a long-ish instruction (7 bytes) with no false dependency that can run at least 2 per clock on modern Intel and AMD CPUs, so should go quickly through the CPUs legacy decoders, maybe fast enough to exceed prefetch and result in lots of demand-loads for code that miss in L3.

Is bash in windows implemented differently from native bash, specifically for loops

I ran the following command on mac in an ad hoc fashion in mac store:
time for x in {1..5000000}; do if ! (($x % 10000)); then echo $x; fi done
to perform a very rudimentary benchmark. What this does is that it creates a list from 1 - 5000000, check if it's divisible by 10000, and print if it does. And time benchmark the time for the process to execute. I've been arriving at around 40 secs for macbook air, 32 for pros, all 8th gen intel processors. A particular pattern I noticed is that it freezes for a long time before printing out anything, presumably this is because it's creating a list from 1 to 5000000 and putting it in memory.
However, my friend who use windows reported faster times on gen 5 core m processor with Windows 10 native bash shell, on the order of 15 seconds. I suspect it's because windows bash treat for x in {1..5000000} as a generator. In this way the process never made into memory as everything would only needed to be stored in cache, achieving greater speed. Can anyone confirm that for loops for bash interpreter is the same/different across windows implementation and linux/mac implementations?

Deterministic execution time on x86-64

Is there an x64 instruction(s) that takes a fixed amount of time, regardless of the micro-architectural state such as caches, branch predictors, etc.?
For instance, if a hypothetical add or increment instruction always takes n cycles, then I can implement a timer in my program by performing that add instruction multiple times. Perhaps an increment instruction with register operands may work, but it's not clear to me whether Intel's spec guarantees that it would take deterministic number of cycles. Note that I am not interested in current time, but only a primitive / instruction sequence that takes a fixed number of cycles.
Assume that I have a way to force atomic execution i.e. no context switches during timer's execution i.e. only my program gets to run.
On a related note, I also cannot use system services to keep track of time, because I am working in a setting where my program is a user-level program running on an untrusted OS.
The x86 ISA documents don't guarantee anything about what takes a certain amount of cycles. The ISA allows things like Transmeta's Crusoe that JIT-compiled x86 instructions to an internal VLIW instruction set. It could conceivably do optimizations between adjacent instructions.
The best you can do is write something that will work on as many known microarchitectures as possible. I'm not aware of any x86-64 microarchitectures that are "weird" like Transmeta, only the usual superscalar decode-to-uops designs like Intel and AMD use.
Simple integer ALU instructions like ADD are almost all 1c latency, and tiny loops that don't touch memory are almost totally unaffected anything, and are very predictable. If they run a lot of iterations, they're also almost totally unaffected by anything to do with the impact of surrounding code on the out-of-order core, and recover very quickly from disruptions like timer interrupts.
On nearly every Intel microarchitecture, this loop will run at one iteration per clock:
mov ecx, 1234567 ; or use a 64-bit register for higher counts.
ALIGN 16
.loop:
sub ecx, 1 ; not dec because of Pentium 4.
jnz .loop
Agner Fog's microarch guide and instruction tables say that VIA Nano3000 has a taken-branch throughput of one per 3 cycles, so this loop would only run at one iteration per 3 clocks there. AMD Bulldozer-family and Jaguar similarly have a max throughput of one taken JCC per 2 clocks.
See also other performance links in the x86 tag wiki.
If you want a more power-efficient loop, you could use PAUSE in the loop, but it waits ~100 cycles on Skylake, up from ~5 cycles on previous microarchitectures. (You can make cycle-accurate predictions for more complicated loops that don't touch memory, but that depends on microarchitectural details.)
You could make a more reliable loop that's less likely to have different bottlenecks on different CPUs by making a longer dependency chain within each iteration. Since each instruction depends on the previous, it can still only run at one instruction per cycle (not counting the branch), drastically the branches per cycle.
# one add/sub per clock, limited by latency
# should run one iteration per 6 cycles on every CPU listed in Agner Fog's tables
# And should be the same on all future CPUs unless they do magic inter-instruction optimizations.
# Or it could be slower on CPUs that always have a bubble on taken branches, but it seems unlikely anyone would design one.
ALIGN 16
.loop:
add ecx, 1
sub ecx, 1 ; net result ecx+0
add ecx, 1
sub ecx, 1 ; net result ecx+0
add ecx, 1
sub ecx, 2 ; net result ecx-1
jnz .loop
Unrolling like this ensures that front-end effects are not a bottleneck. It gives the frontend decoders plenty of time to queue up the 6 add/sub insns and the jcc before the next branch.
Using add/sub instead of dec/inc avoids a partial-flag false dependency on Pentium 4. (Although I don't think that would be an issue anyway.)
Pentium4's double-clocked ALUs can each run two ADDs per clock, but the latency is still one cycle. i.e. apparently it can't forward a result internally to chew through this dependency chain twice as fast as any other CPU.
And yes, Prescott P4 is an x86-64 CPU, so we can't quite ignore P4 if we need a general purpose answer.

Why isn't MOVNTI slower, in a loop storing repeatedly to the same address?

section .text
%define n 100000
_start:
xor rcx, rcx
jmp .cond
.begin:
movnti [array], eax
.cond:
add rcx, 1
cmp rcx, n
jl .begin
section .data
array times 81920 db "A"
According to perf it runs at 1.82 instructions per cycle. I cannot understand why it's so fast. After all, it has to be stored in memory (RAM) so it should be slow.
P.S Is there any loop-carried-dependency?
EDIT
section .text
%define n 100000
_start:
xor rcx, rcx
jmp .cond
.begin:
movnti [array+rcx], eax
.cond:
add rcx, 1
cmp rcx, n
jl .begin
section .data
array times n dq 0
Now, the iteration take 5 cycle per iteration. Why? After all, there is still no loop-carried-dependency.
movnti can apparently sustain a throughput of one per clock when writing to the same address repeatedly.
I think movnti keeps writing into the same fill buffer, and it's not getting flushed very often because there are no other loads or stores happening. (That link is about copying from WC video memory with SSE4.1 NT loads, as well as storing to normal memory with NT stores.)
So the NT write-combining fill-buffer acts like a cache for multiple overlapping NT stores to the same address, and writes are actually hitting in the fill buffer instead of going to DRAM each time.
DDR DRAM only supports burst-transfer commands. If every movnti produced a 4B write that actually was visible to the memory chips, there'd be no way it could run that fast. The memory controller either has to read/modify/write, or do an interrupted burst transfer, since there is no non-burst write command. See also Ulrich Drepper's What Every Programmer Should Know About Memory.
We can further prove this is the case by running the test on multiple cores at once. Since they don't slow each other down at all, we can be sure that the writes are only infrequently making it out of the CPU cores and competing for memory cycles.
The reason your experiment doesn't show your loop running at 4 instruction per clock (one cycle per iteration) is that you used such a tiny repeat count. 100k cycles barely accounts for the startup overhead (which perf's timing includes).
For example, on a Core2 E6600 (Merom/Conroe) with dual channel DDR2 533MHz, the total time including all process startup / exit stuff is 0.113846 ms. That's only 266,007 cycles.
A more reasonable microbenchmark shows one iteration (one movnti) per cycle:
global _start
_start:
xor ecx,ecx
.begin:
movnti [array], eax
dec ecx
jnz .begin ; 2^32 iterations
mov eax, 60 ; __NR_exit
xor edi,edi
syscall ; exit(0)
section .bss
array resb 81920
(asm-link is a script I wrote)
$ asm-link movnti-same-address.asm
+ yasm -felf64 -Worphan-labels -gdwarf2 movnti-same-address.asm
+ ld -o movnti-same-address movnti-same-address.o
$ perf stat -e task-clock,cycles,instructions ./movnti-same-address
Performance counter stats for './movnti-same-address':
1835.056710 task-clock (msec) # 0.995 CPUs utilized
4,398,731,563 cycles # 2.397 GHz
12,891,491,495 instructions # 2.93 insns per cycle
1.843642514 seconds time elapsed
Running in parallel:
$ time ./movnti-same-address; time ./movnti-same-address & time ./movnti-same-address &
real 0m1.844s / user 0m1.828s # running alone
[1] 12523
[2] 12524
peter#tesla:~/src/SO$
real 0m1.855s / user 0m1.824s # running together
real 0m1.984s / user 0m1.808s
# output compacted by hand to save space
I expect perfect SMP scaling (except with hyperthreading), up to any number of cores. e.g. on a 10-core Xeon, 10 copies of this test could run at the same time (on separate physical cores), and each one would finish in the same time as if it was running alone. (Single-core turbo vs. multi-core turbo will also be a factor, though, if you measure wall-clock time instead of cycle counts.)
zx485's uop count nicely explains why the loop isn't bottlenecked by the frontend or unfused-domain execution resources.
However, this disproves his theory about the ratio of CPU to memory clocks having anything to do with it. Interesting coincidence, though, that the OP chose a count that happened to make the final total IPC work out that way.
P.S Is there any loop-carried-dependency?
Yes, the loop counter. (1 cycle). BTW, you could have saved an insn by counting down towards zero with dec / jg instead of counting up and having to use a cmp.
The write-after-write memory dependency isn't a "true" dependency in the normal sense, but it is something the CPU has to keep track of. The CPU doesn't "notice" that the same value is written repeatedly, so it has to make sure the last write is the one that "counts".
This is called an architectural hazard. I think the term still applies when talking about memory, rather than registers.
The result is plausible. Your loop code consists of the follwing instuctions. According to Agner Fog's instruction tables, these have the following timings:
Instruction regs fused unfused ports Latency Reciprocal Throughput
---------------------------------------------------------------------------------------------------------------------------
MOVNTI m,r 2 2 p23 p4 ~400 1
ADD r,r/i 1 1 p0156 1 0.25
CMP r,r/i 1 1 p0156 1 0.25
Jcc short 1 1 p6 1 1-2 if predicted that the jump is taken
Fused CMP+Jcc short 1 1 p6 1 1-2 if predicted that the jump is taken
So
MOVNTI consumes 2 uOps, 1 in port 2 or 3 and one in port 4
ADD consumes 1 uOps in port 0 or 1 or 5 or 6
CMP and Jcc macro-fuse to the last line in the table resulting in a consumption of 1 uOp
Because neither ADD nor CMP+Jcc depend on the result of MOVNTI they can be executed (nearly) in parallel on recent architectures, for example using the ports 1,2,4,6. The worst case would be a latency of 1 between ADD and CMP+Jcc.
This is most likely a design error in your code: you're essentially writing to the same address [array] a 100000 times, because you do not adjust the address.
The repeated writes can even go to the L1-cache under the condition that
The memory type of the region being written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an uncacheable (UC) or write protected (WP) memory region.
but it doesn't look like this and won't make for a great difference, anyway, because even if writing to memory, the memory speed will be the limiting factor.
For example, if you have a 3GHz CPU and 1600MHz DDR3-RAM this will result in 3/1.6 = 1.875 CPU cycles per memory cycle. This seems plausible.

why is mpi slower on my laptop

I am running MPI on my laptop (intel i7 quad core 4700m 12Gb RAM) and the efficiency drops even for codes that involve no inter-process communication. Obviously I cannot just throw 100 processes at it since my machine is only quad-core, but I thought that it should scale well up to 8 process (intel quad core simulates as 8???). For example consider the simple toy Fortran code:
program test
implicit none
integer, parameter :: root=0
integer :: ierr,rank,nproc,tt,i
integer :: n=100000
real :: s=0.0,tstart,tend
complex, dimension(100000/nproc) :: u=2.0,v=0.0
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,nproc,ierr)
call cpu_time(tstart)
do tt=1,200000
v=0.0
do i=1,100000/nproc
v(i) = v(i) + 0.1*u(i)
enddo
enddo
call cpu_time(tend)
if (rank==root) then
print *, 'total time was: ',tend-tstart
endif
call MPI_FINALIZE(ierr)
end subroutine test2
For 2 processes it takes half the time, but even trying 4 processes (should be quarter of the time?) the result begins to become less efficient and for 8 processes there is no improvement whatsoever. Basically I am wondering if this is just because I am running on a laptop and has something to do with shared memory, or if I am making some fundamental mistake in my code. Thanks
Note: In the above example I manually change the nproc in the array declaration and the inner loop to be equal to the number of processors I am using.
A quad core processor, thanks to hyperthreading shows itself as having 8 threads, but physically they are just 4 cores. The other 4 are scheduled by the hardware itself using the free slots in the execution pipelines.
It happens that especially with compute intensive loads this approach does not pay at all, being often counter-productive too on extreme loads because of overheads and not always optimized cache usage.
You can try to disable hyperthreading in the BIOS and compare it: you will have just 4 threads, 4 cores.
Even going from 1 to 4 there are resources that are being in competition. In particular each core has its own L1 cache, but each pair of cores shares the L2 cache (2x256KB) and the 4 cores share the L3 cache.
And all the cores obviously share the memory channels.
So you cannot expect to have linear scaling occupying more and more cores, since they will have to balance the usage of the resources, that are dedicated to one core/one thread in the sequential case.
All of this without involving communications at all.
The same behavior happens on desktops/servers, in particular for memory-intensive loads, as the one in your test case.
For example it's less evident with matrix-matrix multiplies, that is compute-intensive: for a NxN matrix, you have O(N^2) memory accesses but O(N^3) floating point operations.

Resources