perf reports misses larger than total accesses - caching

I am using the below code to understand the behavior of cache misses. I am compiling the code using gcc and then reporting cache statistics from perf command in unix.
What I observe is that the number of misses reported is much larger than the total number of accesses. Couldn't figure out the reason for the same. If anyone else has seen similar behavior and could throw some light, it would really help.
C-code:
#define N 30000
static char array[N][N];
int main(void) {
register int i,j;
for (i=0;i<N;i++)
for(j=0;j<N;j++)
array[j][i]++;
return 0;
}
command for compile:
gcc test1.c -O0 -o test1.out
command for running the perf tool:
perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses ./test1.out

Recently, I have come across the document which mentions "If an instruction fetch misses in the L1 Icache, the fetch may be retried several times before the instructions have been returned to the L1 Icache. The L1 Icache miss event might be incremented every time the fetch is attempted, while the L2 cache access counter may only be incremented on the initial fetch." I am not sure it is true for data-cache or not.
If L1 misses are retried before it is being loaded from L2, it is same for L1 loads as well. For the array of 30000 by 30000, you are accessing the data 30000 bytes apart in each iteration and the l1-dcache-load fails, hence l1-dcache-load-misses counter increases. Then before it loads data from L2 or LLC, it retires to see if data really misses, and increases the L1-dcache-load-misses counter further. This could be a good explanation for your case of L1-dcache-load-misses is higher than L1-dcache-load.
On extra note:
When you measure the cpu performance of a program by perf, the measurement contains cpu cycles consumed by your program and the perf program itself. So, be sure to take into account the cycles that is only used by your program. perf-stat may not allow you to do that. You can use perf-record and perf-report to see the cache-loads and cache-misses by your program.
As an example, you can use following perf command to record measurements.
perf record -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses ./test1.out
Then you can use the following perf command to see the recording only for your test1.out program.
perf report --dsos=test1.out

Related

Perf Result Conflict During Multiplexing

I have an Intel(R) Core(TM) i7-4720HQ CPU # 2.60GHz (Haswell) processor (Linux 4.15.0-20-generic kernel). In a relatively idle situation, I ran the following Perf commands and their outputs are shown, below. The counters are offcore_response.all_data_rd.l3_miss.local_dram, offcore_response.all_code_rd.l3_miss.local_dram and mem_load_uops_retired.l3_miss:
sudo perf stat -a -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss
^C
Performance counter stats for 'system wide':
229,579 offcore_response.all_data_rd.l3_miss.local_dram (99.72%)
489,151 offcore_response.all_code_rd.l3_miss.local_dram (99.77%)
110,543 mem_load_uops_retired.l3_miss (99.79%)
2.868899111 seconds time elapsed
As can be seen, event multiplexing occurred due to PMU sharing among these three events. In a similar scenario, I used the same command, except that I appended :D (mentioned in The Linux perf Event Scheduling Algorithm) to prevent multiplexing for the third event:
sudo perf stat -a -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss:D
^C
Performance counter stats for 'system wide':
539,397 offcore_response.all_data_rd.l3_miss.local_dram (68.71%)
890,344 offcore_response.all_code_rd.l3_miss.local_dram (68.67%)
193,555 mem_load_uops_retired.l3_miss:D
2.853095575 seconds time elapsed
But adding :D is leading to much larger values for all counters and this seems to occur only when event multiplexing occurs. Is this output normal? Are the percentage values in parentheses valid? How could the differences in counter values be prevented?
UPDATE:
I also traced the following loop implementation:
#include <iostream>
using namespace std;
int main()
{
for (unsigned long i = 0; i < 3 * 1e9; i++)
;
return 0;
}
This time Perf was executed 7 (both with and without :D) but not using the -a option. The commands are as follows:
sudo perf stat -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss ./loop
and
sudo perf stat -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss:D ./loop
The values of the three counters are compared in the following figures:
For all_data_rd, the :D has almost no effect, for all_code reduces the values and for load_uops_retired causes larger values.
UPDATE 2:
Based on Peter Cordes's comments, I used a memory-intensive program as follows:
#include <iostream>
#include <cstring>
#define DIM_SIZE 10000
using namespace std;
char from [DIM_SIZE][DIM_SIZE], to [DIM_SIZE][DIM_SIZE];
int main()
{
for (char x = 'a'; x <= 'z'; x++)
{
// set the 100 million element char array 'from'
to x
for (int i = 0; i < DIM_SIZE; i++)
memset(from[i], x, DIM_SIZE);
// copy the entire char array 'from' to char array 'to'
for (int i = 0; i < DIM_SIZE; i++)
memcpy(to[i], from[i], DIM_SIZE);
}
return 0;
}
The following Perf commands and outputs show that the counter values are almost the same:
sudo perf stat -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss:D ./loop
Performance counter stats for './loop':
19,836,745 offcore_response.all_data_rd.l3_miss.local_dram (50.04%)
47,309 offcore_response.all_code_rd.l3_miss.local_dram (49.96%)
6,556,957 mem_load_uops_retired.l3_miss:D
0.592795335 seconds time elapsed
and
sudo perf stat -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss ./loop
Performance counter stats for './loop':
18,742,540 offcore_response.all_data_rd.l3_miss.local_dram (66.64%)
76,854 offcore_response.all_code_rd.l3_miss.local_dram (66.64%)
6,967,919 mem_load_uops_retired.l3_miss (66.72%)
0.575828303 seconds time elapsed
In your first two tests, you're not doing anything to generate a consistent amount of off-core memory traffic, so all you're measuring is the background "noise" on a mostly-idle system. (System-wide for the first test with -a, for the second just interrupt handlers which run some code that does miss in cache while this task is the current on a logical core.)
You never said anything about whether those differences are repeatable across runs, e.g. with perf stat ... -r5 to re-run the same test 5 times, and print average and variance. I'd expect it's not repeatable, just random fluctuation of background stuff like network and keyboard/mouse interrupts, but if the :D version makes a consistent or statistically significant difference that might be interesting.
In your 2nd test, the loop you used won't create any extra memory traffic; it will either be purely registers, or if compiled without optimization will load/store the same one cache line so everything will hit in L1d cache. So you still get a tiny amount of counts from it, mostly from the CRT startup and libc init code. From your main itself, probably one code fetch and zero data loads that miss in L3 since stack memory was probably already hot in cache.
As I commented, a loop that does some memcpy or memset might make sense, or a simple known-good already written example is the STREAM benchmark, so that third test is finally something useful.
The code-read counts in the 3rd test are also just idle noise: glibc memset and memcpy are unrolled but small enough that they fit in the uop cache, so it doesn't even need to touch L1i cache.
We do have an interesting difference in counts for mem_load_uops_retired.l3_miss with / without :D, and off-core data-reads L3 misses.
With the program having two different phases (memset and memcpy), different multiplexing timing will get sample different amounts of that loop. Extrapolating to the whole time from a few of those sample periods won't always be correct.
If memset uses NT stores (or rep stosb on a CPU with ERMSB), it won't be doing any reads, just writes. Those methods use a no-RFO store protocol to avoid fetching the old value of a cache line, because they're optimized for overwriting the full line. Without that, plain stores could generate offcore_response.all_data_rd.l3_miss.local_dram I think, if RFO (Read For Ownership) count as part of all-data-read. (On my Skylake, perf list has separate events for offcore_response.demand_data_rd vs. offcore_response.demand_rfo, but "all" would include both, and prefetches (non-"demand").)
memcpy has to read its source data, so it has a source of mem_load_uops_retired.l3_miss. It can also use NT stores or ERMSB to avoid load requests on the destination.
Glibc memcpy / memset do use NT stores or rep movsb / rep stosb for large-enough buffers. So there will be a large difference in rate of offcore_response.all_data_rd.l3_miss.local_dram as well as mem_load_uops_retired.l3_miss between the memcpy and memset portions of your workload.
(I wrote memcpy or memset when I was suggesting creating a simple consistent workload because that would make it uniform with time. I hadn't looked carefully at what events you were counting, that it was only load uops and offcore load requests. So memset wasn't a useful suggestion.)
And of course code reads that miss all the way to L3 are hard to generate. In hand-written asm you could make a huge block of fully-unrolled instructions with GAS .rept 10000000 ; lea %fs:0x1234(,%rax,4), %rcx ; .endr. Or with NASM times 10000000 lea rcx, [fs: 0x1234 + rax*4]. (The segment prefix is just there to make it longer, no effect on the result.) That's a long-ish instruction (7 bytes) with no false dependency that can run at least 2 per clock on modern Intel and AMD CPUs, so should go quickly through the CPUs legacy decoders, maybe fast enough to exceed prefetch and result in lots of demand-loads for code that miss in L3.

Adding a redundant assignment speeds up code when compiled without optimization

I find an interesting phenomenon:
#include<stdio.h>
#include<time.h>
int main() {
int p, q;
clock_t s,e;
s=clock();
for(int i = 1; i < 1000; i++){
for(int j = 1; j < 1000; j++){
for(int k = 1; k < 1000; k++){
p = i + j * k;
q = p; //Removing this line can increase running time.
}
}
}
e = clock();
double t = (double)(e - s) / CLOCKS_PER_SEC;
printf("%lf\n", t);
return 0;
}
I use GCC 7.3.0 on i5-5257U Mac OS to compile the code without any optimization. Here is the average run time over 10 times:
There are also other people who test the case on other Intel platforms and get the same result.
I post the assembly generated by GCC here. The only difference between two assembly codes is that before addl $1, -12(%rbp) the faster one has two more operations:
movl -44(%rbp), %eax
movl %eax, -48(%rbp)
So why does the program run faster with such an assignment?
Peter's answer is very helpful. The tests on an AMD Phenom II X4 810 and an ARMv7 processor (BCM2835) shows an opposite result which supports that store-forwarding speedup is specific to some Intel CPU.
And BeeOnRope's comment and advice drives me to rewrite the question. :)
The core of this question is the interesting phenomenon which is related to processor architecture and assembly. So I think it may be worth to be discussed.
TL:DR: Sandybridge-family store-forwarding has lower latency if the reload doesn't try to happen "right away". Adding useless code can speed up a debug-mode loop because loop-carried latency bottlenecks in -O0 anti-optimized code almost always involve store/reload of some C variables.
Other examples of this slowdown in action: hyperthreading, calling an empty function, accessing vars through pointers.
And apparently also on low-power Goldmont, unless there's a different cause there for an extra load helping.
None of this is relevant for optimized code. Bottlenecks on store-forwarding latency can occasionally happen, but adding useless complications to your code won't speed it up.
You're benchmarking a debug build, which is basically useless. They have different bottlenecks than optimized code, not a uniform slowdown.
But obviously there is a real reason for the debug build of one version running slower than the debug build of the other version. (Assuming you measured correctly and it wasn't just CPU frequency variation (turbo / power-saving) leading to a difference in wall-clock time.)
If you want to get into the details of x86 performance analysis, we can try to explain why the asm performs the way it does in the first place, and why the asm from an extra C statement (which with -O0 compiles to extra asm instructions) could make it faster overall. This will tell us something about asm performance effects, but nothing useful about optimizing C.
You haven't shown the whole inner loop, only some of the loop body, but gcc -O0 is pretty predictable. Every C statement is compiled separately from all the others, with all C variables spilled / reloaded between the blocks for each statement. This lets you change variables with a debugger while single-stepping, or even jump to a different line in the function, and have the code still work. The performance cost of compiling this way is catastrophic. For example, your loop has no side-effects (none of the results are used) so the entire triple-nested loop can and would compile to zero instructions in a real build, running infinitely faster. Or more realistically, running 1 cycle per iteration instead of ~6 even without optimizing away or doing major transformations.
The bottleneck is probably the loop-carried dependency on k, with a store/reload and an add to increment. Store-forwarding latency is typically around 5 cycles on most CPUs. And thus your inner loop is limited to running once per ~6 cycles, the latency of memory-destination add.
If you're on an Intel CPU, store/reload latency can actually be lower (better) when the reload can't try to execute right away. Having more independent loads/stores in between the dependent pair may explain it in your case. See Loop with function call faster than an empty loop.
So with more work in the loop, that addl $1, -12(%rbp) which can sustain one per 6 cycle throughput when run back-to-back might instead only create a bottleneck of one iteration per 4 or 5 cycles.
This effect apparently happens on Sandybridge and Haswell (not just Skylake), according to measurements from a 2013 blog post, so yes, this is the most likely explanation on your Broadwell i5-5257U, too. It appears that this effect happens on all Intel Sandybridge-family CPUs.
Without more info on your test hardware, compiler version (or asm source for the inner loop), and absolute and/or relative performance numbers for both versions, this is my best low-effort guess at an explanation. Benchmarking / profiling gcc -O0 on my Skylake system isn't interesting enough to actually try it myself. Next time, include timing numbers.
The latency of the stores/reloads for all the work that isn't part of the loop-carried dependency chain doesn't matter, only the throughput. The store queue in modern out-of-order CPUs does effectively provide memory renaming, eliminating write-after-write and write-after-read hazards from reusing the same stack memory for p being written and then read and written somewhere else. (See https://en.wikipedia.org/wiki/Memory_disambiguation#Avoiding_WAR_and_WAW_dependencies for more about memory hazards specifically, and this Q&A for more about latency vs. throughput and reusing the same register / register renaming)
Multiple iterations of the inner loop can be in flight at once, because the memory-order buffer (MOB) keeps track of which store each load needs to take data from, without requiring a previous store to the same location to commit to L1D and get out of the store queue. (See Intel's optimization manual and Agner Fog's microarch PDF for more about CPU microarchitecture internals. The MOB is a combination of the store buffer and load buffer)
Does this mean adding useless statements will speed up real programs? (with optimization enabled)
In general, no, it doesn't. Compilers keep loop variables in registers for the innermost loops. And useless statements will actually optimize away with optimization enabled.
Tuning your source for gcc -O0 is useless. Measure with -O3, or whatever options the default build scripts for your project use.
Also, this store-forwarding speedup is specific to Intel Sandybridge-family, and you won't see it on other microarchitectures like Ryzen, unless they also have a similar store-forwarding latency effect.
Store-forwarding latency can be a problem in real (optimized) compiler output, especially if you didn't use link-time-optimization (LTO) to let tiny functions inline, especially functions that pass or return anything by reference (so it has to go through memory instead of registers). Mitigating the problem may require hacks like volatile if you really want to just work around it on Intel CPUs and maybe make things worse on some other CPUs. See discussion in comments

L2 instruction fetch misses much higher than L1 instruction fetch misses

I am generating a synthetic C benchmark aimed at causing a large number of instruction fetch misses via the following Python script:
#!/usr/bin/env python
import tempfile
import random
import sys
if __name__ == '__main__':
functions = list()
for i in range(10000):
func_name = "f_{}".format(next(tempfile._get_candidate_names()))
sys.stdout.write("void {}() {{\n".format(func_name))
sys.stdout.write(" double pi = 3.14, r = 50, h = 100, e = 2.7, res;\n")
sys.stdout.write(" res = pi*r*r*h;\n")
sys.stdout.write(" res = res/(e*e);\n")
sys.stdout.write("}\n")
functions.append(func_name)
sys.stdout.write("int main() {\n")
sys.stdout.write("unsigned int i;\n")
sys.stdout.write("for(i =0 ; i < 100000 ;i ++ ){\n")
for i in range(10000):
r = random.randint(0, len(functions)-1)
sys.stdout.write("{}();\n".format(functions[r]))
sys.stdout.write("}\n")
sys.stdout.write("}\n")
What the code does is simply generating a C program that consists of a lot of randomly named dummy functions that are in turn called in random order in main(). I am compiling the resulting code with gcc 4.8.5 under CentOS 7 with -O0. The code is running on a dual socket machine fitted with 2x Intel Xeon E5-2630v3 (Haswell architecture).
What I am interested in is understanding instruction-related counters reported by perf when profiling the binary compiled from the C code (not the Python script, that is only used to automatically generate the code). In particular, I am observing the following counters with perf stat:
instructions
L1-icache-load-misses (instruction fetches that miss L1, aka r0280 on Haswell)
r2424, L2_RQSTS.CODE_RD_MISS (instruction fetches that miss L2)
rf824, L2_RQSTS.ALL_PF (all L2 hardware prefetcher requests, both code and data)
I first profiled the code with all hardware prefetchers disabled in the BIOS, i.e.
MLC Streamer Disabled
MLC Spatial Prefetcher Disabled
DCU Data Prefetcher Disabled
DCU Instruction Prefetcher Disabled
and the results are the following (process is pinned to first core of second CPU and corresponding NUMA domain, but I guess this doesn't make much difference):
perf stat -e instructions,L1-icache-load-misses,r2424,rf824 numactl --physcpubind=8 --membind=1 /tmp/code
Performance counter stats for 'numactl --physcpubind=8 --membind=1 /tmp/code':
25,108,610,204 instructions
2,613,075,664 L1-icache-load-misses
5,065,167,059 r2424
17 rf824
33.696954142 seconds time elapsed
Considering the figures above, I cannot explain such a high number of instruction fetch misses in L2. I have disabled all prefetchers, and L2_RQSTS.ALL_PF confirms so. But why do I see twice as much the number of instruction fetch misses in L2 than in L1i? In my (simple) mental processor model, if an instruction is looked up in L2, it must have necessarily been looked up in L1i before. Clearly I am wrong, what am I missing?
I then tried to run the same code with all the hardware prefetchers enabled, i.e.
MLC Streamer Enabled
MLC Spatial Prefetcher Enabled
DCU Data Prefetcher Enabled
DCU Instruction Prefetcher Enabled
and the results are the following:
perf stat -e instructions,L1-icache-load-misses,r2424,rf824 numactl --physcpubind=8 --membind=1 /tmp/code
Performance counter stats for 'numactl --physcpubind=8 --membind=1 /tmp/code':
25,109,877,626 instructions
2,599,883,072 L1-icache-load-misses
5,054,883,231 r2424
908,494 rf824
Now, L2_RQSTS.ALL_PF seems to indicate that something more is happening and although I expected the prefetcher to be a bit more aggressive, I imagine that the instruction prefetcher is severely put to the test due to the jump-intensive type of workload and data prefetcher has not much to do with this kind of workload. But again, L2_RQSTS.CODE_RD_MISS is still too high with the prefetchers enabled.
So, to sum up, my question is:
With hardware prefetchers disabled, L2_RQSTS.CODE_RD_MISS seems to be much higher than L1-icache-load-misses. Even with hardware prefetchers enabled, I still cannot explain it. What is the reason behind such a high count of L2_RQSTS.CODE_RD_MISS compared to L1-icache-load-misses?
The instruction prefetcher can generate requests are that don't count as accesses to the L1I cache, but are counted as code fetch requests at higher-numbered memory levels, such as the L2. This is generally true on all Intel microarchitectures with an instruction prefetcher. L2_RQSTS.CODE_RD_MISS counts both demand and prefetch requests from the L1I. Demand requests are generated by a multiplexing unit in the IFU that chooses a target fetch linear address from among the different units in the pipeline that may change the flow, such as the branch prediction units. Prefetch requests are generated by the L1I instruction prefetcher on an L1I miss if possible.
In general, the number of prefetch fetch requests is nearly proportional to the number of L1I misses. For instruction fetches from memory regions of cacheable memory types, the following formula holds:
ICACHE.MISSES <= L2_RQSTS.CODE_RD_MISS + L2_RQSTS.CODE_RD_HIT
I'm not sure whether this formula also holds for uncacheable fetch requests. I didn't test it in that condition. I know these requests are counted as ICACHE.MISSES, but not sure about the other events.
In your case, most instruction fetches will miss in the L1I and L2. You have 10,000 functions each nearly fully spans 2 64-btye cache lines (here is a version with only two functions), so the code size is much larger than the 256 KiB L2 available on Haswell. The functions are being called in a non-sequential and upredictable order, so the L1I and L2 prefetchers won't significantly help. The only noteworthy exception are returns, all of which will be predicted correctly using the RSB mechanism.
Each of the 10,000 functions are being called 100,000 times in a loop. Most fetch requests are for lines occupied by these functions. The total number of useful instruction fetch requests is about 2 lines per function * 10,000 function * 100,000 iterations = 2,000,000,000 lines, most of which will miss in the L1I and L2 (but probably hit in the L3 after the first cold iteration). Several millions of other requests will be for lines occupied by the loop body. Your measurements show that there are about 30% more instruction fetches that miss in the L1I. This is because of branch mispredictions, which cause fetch requests for incorrect lines that may not be even be in the L1I and/or L2. Each L1I miss may trigger a prefetch, so it's normal for L2 instruction fetches to be within two times the number of L1I misses. This is consistent with your numbers.
In my two-function version, I'm counting 24 instructions per invoked function, so I expect the total number of retired instructions to be approximately 24 billion, but you got 25 billion. Either I don't know how to count, or you have 25 instructions per function for some reason.

Can perf account for all cache misses?

I'm trying to understand the cache misses recorded by perf. I have a minimal program:
int main(void)
{
return 0;
}
If I compile this:
gcc -std=c99 -W -Wall -Werror -O3 -S -o test.S test.c
I get an expectedly small program:
.file "test.c"
.section .text.startup,"ax",#progbits
.p2align 4,,15
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
xorl %eax, %eax
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Debian 4.7.2-5) 4.7.2"
.section .note.GNU-stack,"",#progbits
With only the two instruction, xorl and ret, the program should be less than a cache line in size so I would expect that if I run perf -e "cache-misses:u" ./test I should see only a single cache miss. However, I instead see between 2 and ~400. Similarly, perf -e "cache-misses" ./test results in ~700 to ~2500.
Is this simply a case of perf estimating counts or is there something about the way cache misses occur that makes reasoning about them approximate? For example, if I generate and then read an array of integers in memory, can I reason about the prefetching (sequential access should allow for perfect prefetching) or is there something else at play?
You created a main instead of _start, and probably built it into a dynamically-linked executable!! So there's all the CRT startup code, initializing libc, and several system calls. Run strace ./test and see how many systems calls it's making. (And of course there's lots of work in user-space that doesn't involve system calls).
What would be more interesting is a statically linked executable that just makes an _exit(0) or exit_group(0) system call with the syscall instruction, from the _start entry point.
Given an exit.s with these contents:
mov $231, %eax
syscall
build it into a static executable so these two instructions are the only ones executed in user-space:
$ gcc -static -nostdlib exit.s
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000401000
# the default is fine, our instructions are at the start of the .text section
$ perf stat -e cache-misses:u ./a.out
Performance counter stats for './a.out':
6 cache-misses:u
0.000345362 seconds time elapsed
0.000382000 seconds user
0.000000000 seconds sys
I told it to count cache-misses:u to only measure user-space cache misses, instead of everything on the core the process was running on. (That would include kernel cache misses before entering user-space and while handling the exit_group() system call. And potentially interrupt handlers).
(There is hardware support in the PMU for events to count when the privilege level is user, kernel, or both. So we should expect counts to be off by at most 1 or 2 from counting stuff done during the transition from kernel->user or user->kernel. (Changing CS, potentially resulting in a load from the GDT of the segment descriptor indexed by the new CS value).
But what event does cache-misses actually count?
How does Linux perf calculate the cache-references and cache-misses events explains:
perf apparently maps cache-misses to a HW event that counts last-level cache misses. So it's something like the number of DRAM accesses.
Multiple attempts to access the same line in L1d or L1i cache while an L1 miss is already outstanding just adds another thing waiting for the same incoming cache line. So it's not counting loads (or code-fetch) that have to wait for cache.
Multiple loads can coalesce into one access.
But also remember that code-fetch needs to go through the iTLB, triggering a page-walk. Page-walk loads are cached, i.e. they're fetched through the cache hierarchy. So they're counted by the cache-misses event if they do miss.
Repeated runs of the program can result in 0 cache-miss events. The executable binary is a file, and the file is cached (OS's disk cache) by the pagecache. That physical memory is mapped into the address-space of the process running it. It can certainly stay hot in L3 across process start/stop. More interesting is that apparently the page-table stays hot, too. (Not literally "stays" hot; I assume the kernel has to write a new one every time. But presumably the page-walker is hitting at least in L3 cache.)
Or at least whatever else was causing the "extra" cache-miss events doesn't have to happen.
I used perf stat -r16 to run it 16 times and show mean +stddev
$ perf stat -e instructions:u,L1-dcache-loads:u,L1-dcache-load-misses:u,cache-misses:u,itlb_misses.walk_completed:u -r 16 ./exit
Performance counter stats for './exit' (16 runs):
3 instructions:u
1 L1-dcache-loads
5 L1-dcache-load-misses # 506.25% of all L1-dcache hits ( +- 6.37% )
1 cache-misses:u ( +-100.00% )
2 itlb_misses.walk_completed:u
0.0001422 +- 0.0000108 seconds time elapsed ( +- 7.57% )
Note the +-100% on cache-misses.
I don't know why we have 2 itlb_misses.walk_completed events, not just 1. Counting itlb_misses.miss_causes_a_walk:u instead gives us 4 consistently.
Reducing to -r 1 and running repeatedly with manual up-arrow, cache-misses bounces around between 3 and 13. The system is mostly idle but with a bit of background network traffic.
I also don't know why anything is showing as an L1D load, or how there can be 6 misses from one load. But Hadi's answer says that perf's L1-dcache-load-misses event actually counts L1D.REPLACEMENT, so the page-walks could account for that. While L1-dcache-loads counts MEM_INST_RETIRED.ALL_LOADS. mov-immediate isn't a load, and I wouldn't have thought syscall is either. But maybe it is, otherwise the HW is falsely counting a kernel instruction or there's an off-by-1 somewhere.
This is not an easy topic, but if you are interested in counting cache misses from (for example) accessing an array, then that is what you should start with.
There are numerous pitfalls, but the simplest approach that is likely to lead to insight would start with a program that allocates an array, stores values into the array, and then reads the array a programmable number of times.
Storing values into the array is necessary to create the virtual to physical page mappings. The performance counter results for this section are likely to be incomprehensible because of the tricks that the OS uses in initializing these pages -- e.g., starting with a mapping to a zero-filled page and setting the access to "copy on write".
After the pages are instantiated, the performance counts for the reads are likely to make a lot more sense. I use a programmable number of reads so that I can take the differences between the counter values for 20 reads and 10 reads (for example).
The array size should be chosen to be significantly larger than the available cache at the level you want to test.
Unfortunately, "perf" makes it relatively difficult to figure out what is actually being programmed into the performance counters at the hardware level (which is the only level that counts!). The more "generic" the event, the harder it is to guess what is actually being measured.... On my recent Intel-based systems, "perf list" gives a long (>3600 lines) listing of available events. The events starting in the section labelled "cache:" are direct translations of the hardware events that are described in Chapter 19 of Volume 3 of the Intel Architectures Software Developers Manual.
You are correct to be concerned about how hardware prefetches are counted. In recent Intel architectures, events that report cache accesses can typically be configured to count demand accesses, hardware prefetches, or both. Events that report source locations for load instructions won't give any insight into where the HW prefetch found the data -- only how close to the processor it had gotten by the time the load operation executed.
I have found the event "l1d.replacements" to be a reliable L1 Data Cache Miss indicator on recent Intel processors. It simply counts all cache lines moved into the L1 Data Cache (whether due to loads, stores, prefetches, etc). At the other end of the hierarchy, the DRAM counters (e.g., "uncore_imc_0/cas_count_read/") are also reliable, but are subject to contamination due to any other activity in the system. Counters for "two-sided" caches (e.g., L2 & L3) are more likely to be confusing because it is not always clear whether the event is counting cache lines sent in from one side or the other or both (e.g., "l2_lines_in.all"). With some carefully controlled experiments, it is usually possible to find a subset of reliable & understandable events at these intermediate levels. It is not always possible to find enough reliable counters to make a full accounting of all traffic at each level of the memory hierarchy, but that is a longer story....
The process memory space is not only about your code, there are difference sources such as heap, stack, data segment will also contribute to the cache misses.
(source: tenouk.com)
I don't think u can estimate cache-misses numbers, just like u cannot predict the running sequence of every thread in a multithreading program.
However, cache misses analysis is useful to find out and target false sharing. Here are some useful links u can refer:
http://igoro.com/archive/gallery-of-processor-cache-effects/
http://qqibrow.github.io/CPU-Cache-Effects-and-Linux-Perf/

Is it possible to know the address of a cache miss?

Whenever a cache miss occurs, is it possible to know the address of that missed cache line? Are there any hardware performance counters in modern processors that can provide such information?
Yes, on modern Intel hardware there are precise memory sampling events that track not only the address of the instruction, but the data address as well. These events also includes a great deal of other information, such as what level of the cache hierarchy the memory access was satisfied it, the total latency and so on.
You can use perf mem to sample this information and produces a report.
For example, the following program:
#include <stddef.h>
#define SIZE (100 * 1024 * 1024)
int p[SIZE] = {1};
void do_writes(volatile int *p) {
for (size_t i = 0; i < SIZE; i += 5) {
p[i] = 42;
}
}
void do_reads(volatile int *p) {
volatile int sink;
for (size_t i = 0; i < SIZE; i += 5) {
sink = p[i];
}
}
int main(int argc, char **argv) {
do_writes(p);
do_reads(p);
}
compiled with:
g++ -g -O1 -march=native perf-mem-test.cpp -o perf-mem-test
and run with:
sudo perf mem record -U ./perf-mem-test && sudo perf mem report
Produces a report of memory accesses sorted by latency like this:
The Data Symbol column shows where address the load was targeting - most here show up as something like p+0xa0658b4 which means at an offset of 0xa0658b4 from the start of p which makes sense as the code is reading and writing p. The list is sorted by "local weight" which is the access latency in reference cycles1.
Note that the information recorded is only a sample of memory accesses: recording every miss would usually be way too much information. Furthermore, it only records loads with a latency of 30 cycles or more by default, but you can apparently tweak this with command line arguments.
If you're only interested in accesses that miss in all levels of cache, you're looking for the "Local RAM hit" lines2. Perhaps you can restrict your sampling to only cache misses - I'm pretty sure the Intel memory sampling stuff supports that, and I think you can tell perf mem to look at only misses.
Finally, note that here I'm using the -U argument after record which instructs perf mem to only record userspace events. By default it will include kernel events, which may or may not be useful for your. For the example program, there are many kernel events associated with copying the p array from the binary into writable process memory.
Keep in mind that I specifically arranged my program such that the global array p ended up in the initialized .data section (the binary is ~400 MB!), so that it shows up with the right symbol in the listing. The vast majority of the time your process is going to be accessing dynamically allocated or stack memory, which will just give you a raw address. Whether you can map this back to a meaningful object depends on if you track enough information to make that possible.
1 I think it's in reference cycles, but I could be wrong and the kernel may have already converted it to nanoseconds?
2 The "Local" and "hit" part here refer to the fact that we hit the RAM attached to the current core, i.e., we didn't have go to the RAM associated with another socket in a multi-socket NUMA configuration.
If you want to know the exact virtual or physical address of every cache miss on a particular processor, that would be very hard and sometimes impossible. But you are more likely to be interested in expensive memory access patterns; those patterns that incur large latencies because they miss in one or more levels of the cache subsystem. Note that it is important to keep in mind that a cache miss on one processor might be a cache hit on another depending on design details of each processor and depending also on the operating system.
There are several ways to find such patterns, two are commonly used. One is to use a simulator such as gem5 or Sniper. Another is to use hardware performance events. Events that represent cache misses are available but they do not provide any details on why or where a miss occurred. However, using a profiler, you can approximately associate cache misses as reported by the corresponding hardware performance events with the instructions that caused them which in turn can be mapped back to locations in the source code using debug information. Examples of such profilers include Intel VTune Amplifier and AMD CodeXL. The results produced by simulators and profilers may not be accurate and so you have to be careful when interpreting them.

Resources