Can perf account for all cache misses? - performance

I'm trying to understand the cache misses recorded by perf. I have a minimal program:
int main(void)
{
return 0;
}
If I compile this:
gcc -std=c99 -W -Wall -Werror -O3 -S -o test.S test.c
I get an expectedly small program:
.file "test.c"
.section .text.startup,"ax",#progbits
.p2align 4,,15
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
xorl %eax, %eax
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Debian 4.7.2-5) 4.7.2"
.section .note.GNU-stack,"",#progbits
With only the two instruction, xorl and ret, the program should be less than a cache line in size so I would expect that if I run perf -e "cache-misses:u" ./test I should see only a single cache miss. However, I instead see between 2 and ~400. Similarly, perf -e "cache-misses" ./test results in ~700 to ~2500.
Is this simply a case of perf estimating counts or is there something about the way cache misses occur that makes reasoning about them approximate? For example, if I generate and then read an array of integers in memory, can I reason about the prefetching (sequential access should allow for perfect prefetching) or is there something else at play?

You created a main instead of _start, and probably built it into a dynamically-linked executable!! So there's all the CRT startup code, initializing libc, and several system calls. Run strace ./test and see how many systems calls it's making. (And of course there's lots of work in user-space that doesn't involve system calls).
What would be more interesting is a statically linked executable that just makes an _exit(0) or exit_group(0) system call with the syscall instruction, from the _start entry point.
Given an exit.s with these contents:
mov $231, %eax
syscall
build it into a static executable so these two instructions are the only ones executed in user-space:
$ gcc -static -nostdlib exit.s
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000401000
# the default is fine, our instructions are at the start of the .text section
$ perf stat -e cache-misses:u ./a.out
Performance counter stats for './a.out':
6 cache-misses:u
0.000345362 seconds time elapsed
0.000382000 seconds user
0.000000000 seconds sys
I told it to count cache-misses:u to only measure user-space cache misses, instead of everything on the core the process was running on. (That would include kernel cache misses before entering user-space and while handling the exit_group() system call. And potentially interrupt handlers).
(There is hardware support in the PMU for events to count when the privilege level is user, kernel, or both. So we should expect counts to be off by at most 1 or 2 from counting stuff done during the transition from kernel->user or user->kernel. (Changing CS, potentially resulting in a load from the GDT of the segment descriptor indexed by the new CS value).
But what event does cache-misses actually count?
How does Linux perf calculate the cache-references and cache-misses events explains:
perf apparently maps cache-misses to a HW event that counts last-level cache misses. So it's something like the number of DRAM accesses.
Multiple attempts to access the same line in L1d or L1i cache while an L1 miss is already outstanding just adds another thing waiting for the same incoming cache line. So it's not counting loads (or code-fetch) that have to wait for cache.
Multiple loads can coalesce into one access.
But also remember that code-fetch needs to go through the iTLB, triggering a page-walk. Page-walk loads are cached, i.e. they're fetched through the cache hierarchy. So they're counted by the cache-misses event if they do miss.
Repeated runs of the program can result in 0 cache-miss events. The executable binary is a file, and the file is cached (OS's disk cache) by the pagecache. That physical memory is mapped into the address-space of the process running it. It can certainly stay hot in L3 across process start/stop. More interesting is that apparently the page-table stays hot, too. (Not literally "stays" hot; I assume the kernel has to write a new one every time. But presumably the page-walker is hitting at least in L3 cache.)
Or at least whatever else was causing the "extra" cache-miss events doesn't have to happen.
I used perf stat -r16 to run it 16 times and show mean +stddev
$ perf stat -e instructions:u,L1-dcache-loads:u,L1-dcache-load-misses:u,cache-misses:u,itlb_misses.walk_completed:u -r 16 ./exit
Performance counter stats for './exit' (16 runs):
3 instructions:u
1 L1-dcache-loads
5 L1-dcache-load-misses # 506.25% of all L1-dcache hits ( +- 6.37% )
1 cache-misses:u ( +-100.00% )
2 itlb_misses.walk_completed:u
0.0001422 +- 0.0000108 seconds time elapsed ( +- 7.57% )
Note the +-100% on cache-misses.
I don't know why we have 2 itlb_misses.walk_completed events, not just 1. Counting itlb_misses.miss_causes_a_walk:u instead gives us 4 consistently.
Reducing to -r 1 and running repeatedly with manual up-arrow, cache-misses bounces around between 3 and 13. The system is mostly idle but with a bit of background network traffic.
I also don't know why anything is showing as an L1D load, or how there can be 6 misses from one load. But Hadi's answer says that perf's L1-dcache-load-misses event actually counts L1D.REPLACEMENT, so the page-walks could account for that. While L1-dcache-loads counts MEM_INST_RETIRED.ALL_LOADS. mov-immediate isn't a load, and I wouldn't have thought syscall is either. But maybe it is, otherwise the HW is falsely counting a kernel instruction or there's an off-by-1 somewhere.

This is not an easy topic, but if you are interested in counting cache misses from (for example) accessing an array, then that is what you should start with.
There are numerous pitfalls, but the simplest approach that is likely to lead to insight would start with a program that allocates an array, stores values into the array, and then reads the array a programmable number of times.
Storing values into the array is necessary to create the virtual to physical page mappings. The performance counter results for this section are likely to be incomprehensible because of the tricks that the OS uses in initializing these pages -- e.g., starting with a mapping to a zero-filled page and setting the access to "copy on write".
After the pages are instantiated, the performance counts for the reads are likely to make a lot more sense. I use a programmable number of reads so that I can take the differences between the counter values for 20 reads and 10 reads (for example).
The array size should be chosen to be significantly larger than the available cache at the level you want to test.
Unfortunately, "perf" makes it relatively difficult to figure out what is actually being programmed into the performance counters at the hardware level (which is the only level that counts!). The more "generic" the event, the harder it is to guess what is actually being measured.... On my recent Intel-based systems, "perf list" gives a long (>3600 lines) listing of available events. The events starting in the section labelled "cache:" are direct translations of the hardware events that are described in Chapter 19 of Volume 3 of the Intel Architectures Software Developers Manual.
You are correct to be concerned about how hardware prefetches are counted. In recent Intel architectures, events that report cache accesses can typically be configured to count demand accesses, hardware prefetches, or both. Events that report source locations for load instructions won't give any insight into where the HW prefetch found the data -- only how close to the processor it had gotten by the time the load operation executed.
I have found the event "l1d.replacements" to be a reliable L1 Data Cache Miss indicator on recent Intel processors. It simply counts all cache lines moved into the L1 Data Cache (whether due to loads, stores, prefetches, etc). At the other end of the hierarchy, the DRAM counters (e.g., "uncore_imc_0/cas_count_read/") are also reliable, but are subject to contamination due to any other activity in the system. Counters for "two-sided" caches (e.g., L2 & L3) are more likely to be confusing because it is not always clear whether the event is counting cache lines sent in from one side or the other or both (e.g., "l2_lines_in.all"). With some carefully controlled experiments, it is usually possible to find a subset of reliable & understandable events at these intermediate levels. It is not always possible to find enough reliable counters to make a full accounting of all traffic at each level of the memory hierarchy, but that is a longer story....

The process memory space is not only about your code, there are difference sources such as heap, stack, data segment will also contribute to the cache misses.
(source: tenouk.com)
I don't think u can estimate cache-misses numbers, just like u cannot predict the running sequence of every thread in a multithreading program.
However, cache misses analysis is useful to find out and target false sharing. Here are some useful links u can refer:
http://igoro.com/archive/gallery-of-processor-cache-effects/
http://qqibrow.github.io/CPU-Cache-Effects-and-Linux-Perf/

Related

x86 store when data is in 2 different blocks

Supose linux-32: the aligment rules say, for example, that doubles (8 Bytes) must be aligned to 4 Bytes. This means that, if we assume 64 Bytes cache blocks (a typical value for modern processors) we can have a double aligned in the 60th position, which mean that this double will be in 2 different cache blocks.
It could even happen that both parts of the double were in 2 different cache blocks located in 2 different 4KB pages.
After this brief introduction to put the question in context, I have a couple of doubts:
1- For an assembler programming where we seek maximum performance, it is recommended to prevent these things from happenning by putting alignment directives, right? Or, for any reason that I unknow, making the alignment to make the double in only 1 block doesn't imply any performance change?
2- How will be the store instruction decoded in the in the mentioned case? (supose modern intel microarchitecture). I mean, I know that a normal store x86 instruction is decoded in a micro-fused pair of str-addr and str-data, but in this case where 2 different cache blocks (and maybe even 2 different 4KB pages) are involved, this will be decoded in 2 micro-fused pair of str-addr and str-data (one for the first 4 bytes of the double and another for the last 4 bytes)? Or it will be decoded to a single micro-fused pair but having to do both the str-addr and the str-data twice the work until finally being able to exit the execution port?
Yes, of course you should align a double whenever possible, like compilers do except when forced by ABI struct-layout rules to misalign them. (The ABI was designed when i386 was current so a double always required 2 loads anyway.)
The current version of the i386 System V ABI requires 16-byte stack alignment, so local doubles (that have to get spilled at all instead of kept in regs) can be aligned, and malloc has to return memory suitable for any type, and alignof(max_align_t) = 16 on 32-bit Linux (8 on 32-bit Windows) so 32-bit malloc will always give you at least 16 (or 8)-byte aligned memory. And of course in static storage you control the alignment with align (NASM) or .p2align (GAS) directives.
For the perf downsides of cacheline splits and page splits, see How can I accurately benchmark unaligned access speed on x86_64
re: decoding: The address isn't know at decode time so obviously any effects of a line-split page-split are resolved later. For stores, probably no effect until the store-buffer entry has to commit to L1d cache. Are two store buffer entries needed for split line/page stores on recent Intel? - probably no, allocating a 2nd entry after executing the store-address uop is implausible.
For loads, re-running the load through the execution unit to get the other half (or whatever uneven split), using internal line-split buffers to combine data. (Not re-dispatching from the RS, just internally handled in the load port. But the RS does aggressively replay uops waiting for the result of a load.)
Re-running the store-data uop for a misaligned store seems unlikely, too. I don't think we see extra counts for uops_dispatched_port.port_4 perf events.

L2 instruction fetch misses much higher than L1 instruction fetch misses

I am generating a synthetic C benchmark aimed at causing a large number of instruction fetch misses via the following Python script:
#!/usr/bin/env python
import tempfile
import random
import sys
if __name__ == '__main__':
functions = list()
for i in range(10000):
func_name = "f_{}".format(next(tempfile._get_candidate_names()))
sys.stdout.write("void {}() {{\n".format(func_name))
sys.stdout.write(" double pi = 3.14, r = 50, h = 100, e = 2.7, res;\n")
sys.stdout.write(" res = pi*r*r*h;\n")
sys.stdout.write(" res = res/(e*e);\n")
sys.stdout.write("}\n")
functions.append(func_name)
sys.stdout.write("int main() {\n")
sys.stdout.write("unsigned int i;\n")
sys.stdout.write("for(i =0 ; i < 100000 ;i ++ ){\n")
for i in range(10000):
r = random.randint(0, len(functions)-1)
sys.stdout.write("{}();\n".format(functions[r]))
sys.stdout.write("}\n")
sys.stdout.write("}\n")
What the code does is simply generating a C program that consists of a lot of randomly named dummy functions that are in turn called in random order in main(). I am compiling the resulting code with gcc 4.8.5 under CentOS 7 with -O0. The code is running on a dual socket machine fitted with 2x Intel Xeon E5-2630v3 (Haswell architecture).
What I am interested in is understanding instruction-related counters reported by perf when profiling the binary compiled from the C code (not the Python script, that is only used to automatically generate the code). In particular, I am observing the following counters with perf stat:
instructions
L1-icache-load-misses (instruction fetches that miss L1, aka r0280 on Haswell)
r2424, L2_RQSTS.CODE_RD_MISS (instruction fetches that miss L2)
rf824, L2_RQSTS.ALL_PF (all L2 hardware prefetcher requests, both code and data)
I first profiled the code with all hardware prefetchers disabled in the BIOS, i.e.
MLC Streamer Disabled
MLC Spatial Prefetcher Disabled
DCU Data Prefetcher Disabled
DCU Instruction Prefetcher Disabled
and the results are the following (process is pinned to first core of second CPU and corresponding NUMA domain, but I guess this doesn't make much difference):
perf stat -e instructions,L1-icache-load-misses,r2424,rf824 numactl --physcpubind=8 --membind=1 /tmp/code
Performance counter stats for 'numactl --physcpubind=8 --membind=1 /tmp/code':
25,108,610,204 instructions
2,613,075,664 L1-icache-load-misses
5,065,167,059 r2424
17 rf824
33.696954142 seconds time elapsed
Considering the figures above, I cannot explain such a high number of instruction fetch misses in L2. I have disabled all prefetchers, and L2_RQSTS.ALL_PF confirms so. But why do I see twice as much the number of instruction fetch misses in L2 than in L1i? In my (simple) mental processor model, if an instruction is looked up in L2, it must have necessarily been looked up in L1i before. Clearly I am wrong, what am I missing?
I then tried to run the same code with all the hardware prefetchers enabled, i.e.
MLC Streamer Enabled
MLC Spatial Prefetcher Enabled
DCU Data Prefetcher Enabled
DCU Instruction Prefetcher Enabled
and the results are the following:
perf stat -e instructions,L1-icache-load-misses,r2424,rf824 numactl --physcpubind=8 --membind=1 /tmp/code
Performance counter stats for 'numactl --physcpubind=8 --membind=1 /tmp/code':
25,109,877,626 instructions
2,599,883,072 L1-icache-load-misses
5,054,883,231 r2424
908,494 rf824
Now, L2_RQSTS.ALL_PF seems to indicate that something more is happening and although I expected the prefetcher to be a bit more aggressive, I imagine that the instruction prefetcher is severely put to the test due to the jump-intensive type of workload and data prefetcher has not much to do with this kind of workload. But again, L2_RQSTS.CODE_RD_MISS is still too high with the prefetchers enabled.
So, to sum up, my question is:
With hardware prefetchers disabled, L2_RQSTS.CODE_RD_MISS seems to be much higher than L1-icache-load-misses. Even with hardware prefetchers enabled, I still cannot explain it. What is the reason behind such a high count of L2_RQSTS.CODE_RD_MISS compared to L1-icache-load-misses?
The instruction prefetcher can generate requests are that don't count as accesses to the L1I cache, but are counted as code fetch requests at higher-numbered memory levels, such as the L2. This is generally true on all Intel microarchitectures with an instruction prefetcher. L2_RQSTS.CODE_RD_MISS counts both demand and prefetch requests from the L1I. Demand requests are generated by a multiplexing unit in the IFU that chooses a target fetch linear address from among the different units in the pipeline that may change the flow, such as the branch prediction units. Prefetch requests are generated by the L1I instruction prefetcher on an L1I miss if possible.
In general, the number of prefetch fetch requests is nearly proportional to the number of L1I misses. For instruction fetches from memory regions of cacheable memory types, the following formula holds:
ICACHE.MISSES <= L2_RQSTS.CODE_RD_MISS + L2_RQSTS.CODE_RD_HIT
I'm not sure whether this formula also holds for uncacheable fetch requests. I didn't test it in that condition. I know these requests are counted as ICACHE.MISSES, but not sure about the other events.
In your case, most instruction fetches will miss in the L1I and L2. You have 10,000 functions each nearly fully spans 2 64-btye cache lines (here is a version with only two functions), so the code size is much larger than the 256 KiB L2 available on Haswell. The functions are being called in a non-sequential and upredictable order, so the L1I and L2 prefetchers won't significantly help. The only noteworthy exception are returns, all of which will be predicted correctly using the RSB mechanism.
Each of the 10,000 functions are being called 100,000 times in a loop. Most fetch requests are for lines occupied by these functions. The total number of useful instruction fetch requests is about 2 lines per function * 10,000 function * 100,000 iterations = 2,000,000,000 lines, most of which will miss in the L1I and L2 (but probably hit in the L3 after the first cold iteration). Several millions of other requests will be for lines occupied by the loop body. Your measurements show that there are about 30% more instruction fetches that miss in the L1I. This is because of branch mispredictions, which cause fetch requests for incorrect lines that may not be even be in the L1I and/or L2. Each L1I miss may trigger a prefetch, so it's normal for L2 instruction fetches to be within two times the number of L1I misses. This is consistent with your numbers.
In my two-function version, I'm counting 24 instructions per invoked function, so I expect the total number of retired instructions to be approximately 24 billion, but you got 25 billion. Either I don't know how to count, or you have 25 instructions per function for some reason.

Why do memory instructions take 4 cycles in ARM assembly?

Memory instructions such as ldr, str or b take 4 cycles each in ARM assembly.
Is it because each memory location is 4 bytes long?
ARM has a pipelined architecture. Each clock cycle advances the pipeline by one step (e.g. fetch/decode/execute/read...). Since the pipeline is continuously fed, the overall time to execute each instruction can approach 1 cycle, but the actual time for an individual instruction from 'fetch' through completion can be 3+ cycles. ARM has a good explanation on their website:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0222b/ch01s01s01.html
Memory latency adds another layer of complication to this idea. ARM employs a multi-level cache system which aims to have the most frequently used data available in the fewest cycles. Even a read from the fastest (L0) cache involves several cycles of latency. The pipeline includes facilities to allow read requests to complete at a later time if the data is not used right away. It's easier to understand by way of example:
LDR R0,[R1]
MOV R2,R3 // Allow time for memory read to occur
ADD R4,R4,#200 // by interleaving other instructions
CMP R0,#0 // before trying to use the value
// By trying to access the data immediately, this will cause a pipeline
// 'stall' and waste time waiting for the data to become available.
LDR R0,[R1]
CMP R0,#0 // Wastes at least 1 cycle due to pipeline not having the data
The idea is to hide the inherent latencies in the pipeline and, if you can, hide additional latencies in the memory access by delaying dependencies on registers (aka instruction interleaving).

Deterministic execution time on x86-64

Is there an x64 instruction(s) that takes a fixed amount of time, regardless of the micro-architectural state such as caches, branch predictors, etc.?
For instance, if a hypothetical add or increment instruction always takes n cycles, then I can implement a timer in my program by performing that add instruction multiple times. Perhaps an increment instruction with register operands may work, but it's not clear to me whether Intel's spec guarantees that it would take deterministic number of cycles. Note that I am not interested in current time, but only a primitive / instruction sequence that takes a fixed number of cycles.
Assume that I have a way to force atomic execution i.e. no context switches during timer's execution i.e. only my program gets to run.
On a related note, I also cannot use system services to keep track of time, because I am working in a setting where my program is a user-level program running on an untrusted OS.
The x86 ISA documents don't guarantee anything about what takes a certain amount of cycles. The ISA allows things like Transmeta's Crusoe that JIT-compiled x86 instructions to an internal VLIW instruction set. It could conceivably do optimizations between adjacent instructions.
The best you can do is write something that will work on as many known microarchitectures as possible. I'm not aware of any x86-64 microarchitectures that are "weird" like Transmeta, only the usual superscalar decode-to-uops designs like Intel and AMD use.
Simple integer ALU instructions like ADD are almost all 1c latency, and tiny loops that don't touch memory are almost totally unaffected anything, and are very predictable. If they run a lot of iterations, they're also almost totally unaffected by anything to do with the impact of surrounding code on the out-of-order core, and recover very quickly from disruptions like timer interrupts.
On nearly every Intel microarchitecture, this loop will run at one iteration per clock:
mov ecx, 1234567 ; or use a 64-bit register for higher counts.
ALIGN 16
.loop:
sub ecx, 1 ; not dec because of Pentium 4.
jnz .loop
Agner Fog's microarch guide and instruction tables say that VIA Nano3000 has a taken-branch throughput of one per 3 cycles, so this loop would only run at one iteration per 3 clocks there. AMD Bulldozer-family and Jaguar similarly have a max throughput of one taken JCC per 2 clocks.
See also other performance links in the x86 tag wiki.
If you want a more power-efficient loop, you could use PAUSE in the loop, but it waits ~100 cycles on Skylake, up from ~5 cycles on previous microarchitectures. (You can make cycle-accurate predictions for more complicated loops that don't touch memory, but that depends on microarchitectural details.)
You could make a more reliable loop that's less likely to have different bottlenecks on different CPUs by making a longer dependency chain within each iteration. Since each instruction depends on the previous, it can still only run at one instruction per cycle (not counting the branch), drastically the branches per cycle.
# one add/sub per clock, limited by latency
# should run one iteration per 6 cycles on every CPU listed in Agner Fog's tables
# And should be the same on all future CPUs unless they do magic inter-instruction optimizations.
# Or it could be slower on CPUs that always have a bubble on taken branches, but it seems unlikely anyone would design one.
ALIGN 16
.loop:
add ecx, 1
sub ecx, 1 ; net result ecx+0
add ecx, 1
sub ecx, 1 ; net result ecx+0
add ecx, 1
sub ecx, 2 ; net result ecx-1
jnz .loop
Unrolling like this ensures that front-end effects are not a bottleneck. It gives the frontend decoders plenty of time to queue up the 6 add/sub insns and the jcc before the next branch.
Using add/sub instead of dec/inc avoids a partial-flag false dependency on Pentium 4. (Although I don't think that would be an issue anyway.)
Pentium4's double-clocked ALUs can each run two ADDs per clock, but the latency is still one cycle. i.e. apparently it can't forward a result internally to chew through this dependency chain twice as fast as any other CPU.
And yes, Prescott P4 is an x86-64 CPU, so we can't quite ignore P4 if we need a general purpose answer.

perf reports misses larger than total accesses

I am using the below code to understand the behavior of cache misses. I am compiling the code using gcc and then reporting cache statistics from perf command in unix.
What I observe is that the number of misses reported is much larger than the total number of accesses. Couldn't figure out the reason for the same. If anyone else has seen similar behavior and could throw some light, it would really help.
C-code:
#define N 30000
static char array[N][N];
int main(void) {
register int i,j;
for (i=0;i<N;i++)
for(j=0;j<N;j++)
array[j][i]++;
return 0;
}
command for compile:
gcc test1.c -O0 -o test1.out
command for running the perf tool:
perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses ./test1.out
Recently, I have come across the document which mentions "If an instruction fetch misses in the L1 Icache, the fetch may be retried several times before the instructions have been returned to the L1 Icache. The L1 Icache miss event might be incremented every time the fetch is attempted, while the L2 cache access counter may only be incremented on the initial fetch." I am not sure it is true for data-cache or not.
If L1 misses are retried before it is being loaded from L2, it is same for L1 loads as well. For the array of 30000 by 30000, you are accessing the data 30000 bytes apart in each iteration and the l1-dcache-load fails, hence l1-dcache-load-misses counter increases. Then before it loads data from L2 or LLC, it retires to see if data really misses, and increases the L1-dcache-load-misses counter further. This could be a good explanation for your case of L1-dcache-load-misses is higher than L1-dcache-load.
On extra note:
When you measure the cpu performance of a program by perf, the measurement contains cpu cycles consumed by your program and the perf program itself. So, be sure to take into account the cycles that is only used by your program. perf-stat may not allow you to do that. You can use perf-record and perf-report to see the cache-loads and cache-misses by your program.
As an example, you can use following perf command to record measurements.
perf record -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses ./test1.out
Then you can use the following perf command to see the recording only for your test1.out program.
perf report --dsos=test1.out

Resources