Perf Result Conflict During Multiplexing - linux-kernel

I have an Intel(R) Core(TM) i7-4720HQ CPU # 2.60GHz (Haswell) processor (Linux 4.15.0-20-generic kernel). In a relatively idle situation, I ran the following Perf commands and their outputs are shown, below. The counters are offcore_response.all_data_rd.l3_miss.local_dram, offcore_response.all_code_rd.l3_miss.local_dram and mem_load_uops_retired.l3_miss:
sudo perf stat -a -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss
^C
Performance counter stats for 'system wide':
229,579 offcore_response.all_data_rd.l3_miss.local_dram (99.72%)
489,151 offcore_response.all_code_rd.l3_miss.local_dram (99.77%)
110,543 mem_load_uops_retired.l3_miss (99.79%)
2.868899111 seconds time elapsed
As can be seen, event multiplexing occurred due to PMU sharing among these three events. In a similar scenario, I used the same command, except that I appended :D (mentioned in The Linux perf Event Scheduling Algorithm) to prevent multiplexing for the third event:
sudo perf stat -a -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss:D
^C
Performance counter stats for 'system wide':
539,397 offcore_response.all_data_rd.l3_miss.local_dram (68.71%)
890,344 offcore_response.all_code_rd.l3_miss.local_dram (68.67%)
193,555 mem_load_uops_retired.l3_miss:D
2.853095575 seconds time elapsed
But adding :D is leading to much larger values for all counters and this seems to occur only when event multiplexing occurs. Is this output normal? Are the percentage values in parentheses valid? How could the differences in counter values be prevented?
UPDATE:
I also traced the following loop implementation:
#include <iostream>
using namespace std;
int main()
{
for (unsigned long i = 0; i < 3 * 1e9; i++)
;
return 0;
}
This time Perf was executed 7 (both with and without :D) but not using the -a option. The commands are as follows:
sudo perf stat -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss ./loop
and
sudo perf stat -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss:D ./loop
The values of the three counters are compared in the following figures:
For all_data_rd, the :D has almost no effect, for all_code reduces the values and for load_uops_retired causes larger values.
UPDATE 2:
Based on Peter Cordes's comments, I used a memory-intensive program as follows:
#include <iostream>
#include <cstring>
#define DIM_SIZE 10000
using namespace std;
char from [DIM_SIZE][DIM_SIZE], to [DIM_SIZE][DIM_SIZE];
int main()
{
for (char x = 'a'; x <= 'z'; x++)
{
// set the 100 million element char array 'from'
to x
for (int i = 0; i < DIM_SIZE; i++)
memset(from[i], x, DIM_SIZE);
// copy the entire char array 'from' to char array 'to'
for (int i = 0; i < DIM_SIZE; i++)
memcpy(to[i], from[i], DIM_SIZE);
}
return 0;
}
The following Perf commands and outputs show that the counter values are almost the same:
sudo perf stat -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss:D ./loop
Performance counter stats for './loop':
19,836,745 offcore_response.all_data_rd.l3_miss.local_dram (50.04%)
47,309 offcore_response.all_code_rd.l3_miss.local_dram (49.96%)
6,556,957 mem_load_uops_retired.l3_miss:D
0.592795335 seconds time elapsed
and
sudo perf stat -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss ./loop
Performance counter stats for './loop':
18,742,540 offcore_response.all_data_rd.l3_miss.local_dram (66.64%)
76,854 offcore_response.all_code_rd.l3_miss.local_dram (66.64%)
6,967,919 mem_load_uops_retired.l3_miss (66.72%)
0.575828303 seconds time elapsed

In your first two tests, you're not doing anything to generate a consistent amount of off-core memory traffic, so all you're measuring is the background "noise" on a mostly-idle system. (System-wide for the first test with -a, for the second just interrupt handlers which run some code that does miss in cache while this task is the current on a logical core.)
You never said anything about whether those differences are repeatable across runs, e.g. with perf stat ... -r5 to re-run the same test 5 times, and print average and variance. I'd expect it's not repeatable, just random fluctuation of background stuff like network and keyboard/mouse interrupts, but if the :D version makes a consistent or statistically significant difference that might be interesting.
In your 2nd test, the loop you used won't create any extra memory traffic; it will either be purely registers, or if compiled without optimization will load/store the same one cache line so everything will hit in L1d cache. So you still get a tiny amount of counts from it, mostly from the CRT startup and libc init code. From your main itself, probably one code fetch and zero data loads that miss in L3 since stack memory was probably already hot in cache.
As I commented, a loop that does some memcpy or memset might make sense, or a simple known-good already written example is the STREAM benchmark, so that third test is finally something useful.
The code-read counts in the 3rd test are also just idle noise: glibc memset and memcpy are unrolled but small enough that they fit in the uop cache, so it doesn't even need to touch L1i cache.
We do have an interesting difference in counts for mem_load_uops_retired.l3_miss with / without :D, and off-core data-reads L3 misses.
With the program having two different phases (memset and memcpy), different multiplexing timing will get sample different amounts of that loop. Extrapolating to the whole time from a few of those sample periods won't always be correct.
If memset uses NT stores (or rep stosb on a CPU with ERMSB), it won't be doing any reads, just writes. Those methods use a no-RFO store protocol to avoid fetching the old value of a cache line, because they're optimized for overwriting the full line. Without that, plain stores could generate offcore_response.all_data_rd.l3_miss.local_dram I think, if RFO (Read For Ownership) count as part of all-data-read. (On my Skylake, perf list has separate events for offcore_response.demand_data_rd vs. offcore_response.demand_rfo, but "all" would include both, and prefetches (non-"demand").)
memcpy has to read its source data, so it has a source of mem_load_uops_retired.l3_miss. It can also use NT stores or ERMSB to avoid load requests on the destination.
Glibc memcpy / memset do use NT stores or rep movsb / rep stosb for large-enough buffers. So there will be a large difference in rate of offcore_response.all_data_rd.l3_miss.local_dram as well as mem_load_uops_retired.l3_miss between the memcpy and memset portions of your workload.
(I wrote memcpy or memset when I was suggesting creating a simple consistent workload because that would make it uniform with time. I hadn't looked carefully at what events you were counting, that it was only load uops and offcore load requests. So memset wasn't a useful suggestion.)
And of course code reads that miss all the way to L3 are hard to generate. In hand-written asm you could make a huge block of fully-unrolled instructions with GAS .rept 10000000 ; lea %fs:0x1234(,%rax,4), %rcx ; .endr. Or with NASM times 10000000 lea rcx, [fs: 0x1234 + rax*4]. (The segment prefix is just there to make it longer, no effect on the result.) That's a long-ish instruction (7 bytes) with no false dependency that can run at least 2 per clock on modern Intel and AMD CPUs, so should go quickly through the CPUs legacy decoders, maybe fast enough to exceed prefetch and result in lots of demand-loads for code that miss in L3.

Related

Does the using of SIMD load main CPU registers?

Let's imagine we have software developer that's goal is achieve absolute maximum of CPU's performance.
In today's CPUs we have many cores, we can load data in cache for faster processing and we also have SIMD instructions (AVX for example) that allow us to sum\multiply\do other ops with array of items (multiply 8 integers per one CPU clock). The disadvantage of this instruction is the cost of sending data & instructions to SIMD module + overhead of converting vector type to primitive types (sorry I familiar only with C#'s Vector) (We not looling on code complexety for now).
As far as I understand, while we using SIMD, main registers of CPU used only for sending and recieving data to this registers and main ALU blocks used for general purpose calculations are idle at this time.
And here is my question - will using of SIMD instructions load main CPU blocks? For example if we have huge amount of different calculations (let's imagine 40% of them are best to run on SIMD and 60% of them are better to run as a usual), will SIMD allow us to gain performance boost in this way: 100% of all cores performace + n% of SIMD's boost performance?
I'm asking this question because of for example with GPGPU we can use GPU for parallel calculations and CPU used in this case only for sending and recieving data, so it's idle all the time and we can utilize it's performance for sensitive for latency tasks.
Looks like this is a question about Out-Of-Order-Execution? Modern x64 have a number of execution ports on the CPU, and each can dispatch a new instruction per clock cycle (so about 8 CPU ops can run in parallel on an Intel SkyLake). Some of those ports handle memory loads/stores, some handle integer arithmetic, and some handle the SIMD instructions.
So for example, you may be able to displatch 2 AVX float mults, an AVX bitwise op, 2 AVX loads, a single AVX store, and a couple of bits of pointer arithmetic on the general purpose registers in a single cycle [you will have to wait for the operation to complete - the latency]. So in theory, as long as there aren't horrific dependency chains in the code, with some care you should able to keep each of those ports busy (or at least, that's the basic aim!).
Simple Rule 1: The busier you can keep the execution ports, the faster your code goes. This should be self evident. If you can keep 8 ports busy, you're doing 8 times more than if you can only keep 1 busy. In general though, it's mostly not worth worrying about (yes, there are always exceptions to the rule)
Simple Rule 2: When the SIMD execution ports are in use, the ALU doesn't suddenly become idle [A slight terminology error on your part here: The ALU is simply the bit of the CPU that does arithmetic. The computation for general purpose ops is done on an ALU, but it's also correct to call a SIMD unit an ALU. What you meant to ask is: do the general purpose parts of the CPU power down when SIMD units are in use? To which the answer is no... ]. Consider this AVX2 optimised method (which does nothing interesting!)
#include <immintrin.h>
typedef __m256 float8;
#define mul8f _mm256_mul_ps
void computeThing(float8 a[], float8 b[], float8 c[], int count)
{
for(int i = 0; i < count; ++i)
{
a[i] = mul8f(a[i], b[i]);
b[i] = mul8f(b[i], c[i]);
}
}
Since there are no dependencies between a, b, and c (which I should really be explicit about by specifying __restrict), then the two SIMD multiply instructions can both be dispatched in a single clock cycle (since there are two execution ports that can handle floating point multiply).
The General Purpose ALU doesn't suddenly power down here - The general purpose registers & instructions are still being used!
1. to compute memory addresses (for: a[i], b[i], c[i], d[i])
2. to load/store into those memory locations
3. to increment the loop counter
4. to test if the count has been reached?
It just so happens that we are also making use of the SIMD units to do a couple of multiplications...
Simple Rule 3: For floating point operations, using 'float' or '__m256' makes next to no difference. The same CPU hardware used to compute either float or float8 types is exactly the same. There are simply a couple of bits in the machine code encoding that specifies the choice between float/__m128/__m256.
i.e. https://godbolt.org/z/xTcLrf

Adding a redundant assignment speeds up code when compiled without optimization

I find an interesting phenomenon:
#include<stdio.h>
#include<time.h>
int main() {
int p, q;
clock_t s,e;
s=clock();
for(int i = 1; i < 1000; i++){
for(int j = 1; j < 1000; j++){
for(int k = 1; k < 1000; k++){
p = i + j * k;
q = p; //Removing this line can increase running time.
}
}
}
e = clock();
double t = (double)(e - s) / CLOCKS_PER_SEC;
printf("%lf\n", t);
return 0;
}
I use GCC 7.3.0 on i5-5257U Mac OS to compile the code without any optimization. Here is the average run time over 10 times:
There are also other people who test the case on other Intel platforms and get the same result.
I post the assembly generated by GCC here. The only difference between two assembly codes is that before addl $1, -12(%rbp) the faster one has two more operations:
movl -44(%rbp), %eax
movl %eax, -48(%rbp)
So why does the program run faster with such an assignment?
Peter's answer is very helpful. The tests on an AMD Phenom II X4 810 and an ARMv7 processor (BCM2835) shows an opposite result which supports that store-forwarding speedup is specific to some Intel CPU.
And BeeOnRope's comment and advice drives me to rewrite the question. :)
The core of this question is the interesting phenomenon which is related to processor architecture and assembly. So I think it may be worth to be discussed.
TL:DR: Sandybridge-family store-forwarding has lower latency if the reload doesn't try to happen "right away". Adding useless code can speed up a debug-mode loop because loop-carried latency bottlenecks in -O0 anti-optimized code almost always involve store/reload of some C variables.
Other examples of this slowdown in action: hyperthreading, calling an empty function, accessing vars through pointers.
And apparently also on low-power Goldmont, unless there's a different cause there for an extra load helping.
None of this is relevant for optimized code. Bottlenecks on store-forwarding latency can occasionally happen, but adding useless complications to your code won't speed it up.
You're benchmarking a debug build, which is basically useless. They have different bottlenecks than optimized code, not a uniform slowdown.
But obviously there is a real reason for the debug build of one version running slower than the debug build of the other version. (Assuming you measured correctly and it wasn't just CPU frequency variation (turbo / power-saving) leading to a difference in wall-clock time.)
If you want to get into the details of x86 performance analysis, we can try to explain why the asm performs the way it does in the first place, and why the asm from an extra C statement (which with -O0 compiles to extra asm instructions) could make it faster overall. This will tell us something about asm performance effects, but nothing useful about optimizing C.
You haven't shown the whole inner loop, only some of the loop body, but gcc -O0 is pretty predictable. Every C statement is compiled separately from all the others, with all C variables spilled / reloaded between the blocks for each statement. This lets you change variables with a debugger while single-stepping, or even jump to a different line in the function, and have the code still work. The performance cost of compiling this way is catastrophic. For example, your loop has no side-effects (none of the results are used) so the entire triple-nested loop can and would compile to zero instructions in a real build, running infinitely faster. Or more realistically, running 1 cycle per iteration instead of ~6 even without optimizing away or doing major transformations.
The bottleneck is probably the loop-carried dependency on k, with a store/reload and an add to increment. Store-forwarding latency is typically around 5 cycles on most CPUs. And thus your inner loop is limited to running once per ~6 cycles, the latency of memory-destination add.
If you're on an Intel CPU, store/reload latency can actually be lower (better) when the reload can't try to execute right away. Having more independent loads/stores in between the dependent pair may explain it in your case. See Loop with function call faster than an empty loop.
So with more work in the loop, that addl $1, -12(%rbp) which can sustain one per 6 cycle throughput when run back-to-back might instead only create a bottleneck of one iteration per 4 or 5 cycles.
This effect apparently happens on Sandybridge and Haswell (not just Skylake), according to measurements from a 2013 blog post, so yes, this is the most likely explanation on your Broadwell i5-5257U, too. It appears that this effect happens on all Intel Sandybridge-family CPUs.
Without more info on your test hardware, compiler version (or asm source for the inner loop), and absolute and/or relative performance numbers for both versions, this is my best low-effort guess at an explanation. Benchmarking / profiling gcc -O0 on my Skylake system isn't interesting enough to actually try it myself. Next time, include timing numbers.
The latency of the stores/reloads for all the work that isn't part of the loop-carried dependency chain doesn't matter, only the throughput. The store queue in modern out-of-order CPUs does effectively provide memory renaming, eliminating write-after-write and write-after-read hazards from reusing the same stack memory for p being written and then read and written somewhere else. (See https://en.wikipedia.org/wiki/Memory_disambiguation#Avoiding_WAR_and_WAW_dependencies for more about memory hazards specifically, and this Q&A for more about latency vs. throughput and reusing the same register / register renaming)
Multiple iterations of the inner loop can be in flight at once, because the memory-order buffer (MOB) keeps track of which store each load needs to take data from, without requiring a previous store to the same location to commit to L1D and get out of the store queue. (See Intel's optimization manual and Agner Fog's microarch PDF for more about CPU microarchitecture internals. The MOB is a combination of the store buffer and load buffer)
Does this mean adding useless statements will speed up real programs? (with optimization enabled)
In general, no, it doesn't. Compilers keep loop variables in registers for the innermost loops. And useless statements will actually optimize away with optimization enabled.
Tuning your source for gcc -O0 is useless. Measure with -O3, or whatever options the default build scripts for your project use.
Also, this store-forwarding speedup is specific to Intel Sandybridge-family, and you won't see it on other microarchitectures like Ryzen, unless they also have a similar store-forwarding latency effect.
Store-forwarding latency can be a problem in real (optimized) compiler output, especially if you didn't use link-time-optimization (LTO) to let tiny functions inline, especially functions that pass or return anything by reference (so it has to go through memory instead of registers). Mitigating the problem may require hacks like volatile if you really want to just work around it on Intel CPUs and maybe make things worse on some other CPUs. See discussion in comments

perf reports misses larger than total accesses

I am using the below code to understand the behavior of cache misses. I am compiling the code using gcc and then reporting cache statistics from perf command in unix.
What I observe is that the number of misses reported is much larger than the total number of accesses. Couldn't figure out the reason for the same. If anyone else has seen similar behavior and could throw some light, it would really help.
C-code:
#define N 30000
static char array[N][N];
int main(void) {
register int i,j;
for (i=0;i<N;i++)
for(j=0;j<N;j++)
array[j][i]++;
return 0;
}
command for compile:
gcc test1.c -O0 -o test1.out
command for running the perf tool:
perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses ./test1.out
Recently, I have come across the document which mentions "If an instruction fetch misses in the L1 Icache, the fetch may be retried several times before the instructions have been returned to the L1 Icache. The L1 Icache miss event might be incremented every time the fetch is attempted, while the L2 cache access counter may only be incremented on the initial fetch." I am not sure it is true for data-cache or not.
If L1 misses are retried before it is being loaded from L2, it is same for L1 loads as well. For the array of 30000 by 30000, you are accessing the data 30000 bytes apart in each iteration and the l1-dcache-load fails, hence l1-dcache-load-misses counter increases. Then before it loads data from L2 or LLC, it retires to see if data really misses, and increases the L1-dcache-load-misses counter further. This could be a good explanation for your case of L1-dcache-load-misses is higher than L1-dcache-load.
On extra note:
When you measure the cpu performance of a program by perf, the measurement contains cpu cycles consumed by your program and the perf program itself. So, be sure to take into account the cycles that is only used by your program. perf-stat may not allow you to do that. You can use perf-record and perf-report to see the cache-loads and cache-misses by your program.
As an example, you can use following perf command to record measurements.
perf record -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses ./test1.out
Then you can use the following perf command to see the recording only for your test1.out program.
perf report --dsos=test1.out

Can perf account for all cache misses?

I'm trying to understand the cache misses recorded by perf. I have a minimal program:
int main(void)
{
return 0;
}
If I compile this:
gcc -std=c99 -W -Wall -Werror -O3 -S -o test.S test.c
I get an expectedly small program:
.file "test.c"
.section .text.startup,"ax",#progbits
.p2align 4,,15
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
xorl %eax, %eax
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Debian 4.7.2-5) 4.7.2"
.section .note.GNU-stack,"",#progbits
With only the two instruction, xorl and ret, the program should be less than a cache line in size so I would expect that if I run perf -e "cache-misses:u" ./test I should see only a single cache miss. However, I instead see between 2 and ~400. Similarly, perf -e "cache-misses" ./test results in ~700 to ~2500.
Is this simply a case of perf estimating counts or is there something about the way cache misses occur that makes reasoning about them approximate? For example, if I generate and then read an array of integers in memory, can I reason about the prefetching (sequential access should allow for perfect prefetching) or is there something else at play?
You created a main instead of _start, and probably built it into a dynamically-linked executable!! So there's all the CRT startup code, initializing libc, and several system calls. Run strace ./test and see how many systems calls it's making. (And of course there's lots of work in user-space that doesn't involve system calls).
What would be more interesting is a statically linked executable that just makes an _exit(0) or exit_group(0) system call with the syscall instruction, from the _start entry point.
Given an exit.s with these contents:
mov $231, %eax
syscall
build it into a static executable so these two instructions are the only ones executed in user-space:
$ gcc -static -nostdlib exit.s
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000401000
# the default is fine, our instructions are at the start of the .text section
$ perf stat -e cache-misses:u ./a.out
Performance counter stats for './a.out':
6 cache-misses:u
0.000345362 seconds time elapsed
0.000382000 seconds user
0.000000000 seconds sys
I told it to count cache-misses:u to only measure user-space cache misses, instead of everything on the core the process was running on. (That would include kernel cache misses before entering user-space and while handling the exit_group() system call. And potentially interrupt handlers).
(There is hardware support in the PMU for events to count when the privilege level is user, kernel, or both. So we should expect counts to be off by at most 1 or 2 from counting stuff done during the transition from kernel->user or user->kernel. (Changing CS, potentially resulting in a load from the GDT of the segment descriptor indexed by the new CS value).
But what event does cache-misses actually count?
How does Linux perf calculate the cache-references and cache-misses events explains:
perf apparently maps cache-misses to a HW event that counts last-level cache misses. So it's something like the number of DRAM accesses.
Multiple attempts to access the same line in L1d or L1i cache while an L1 miss is already outstanding just adds another thing waiting for the same incoming cache line. So it's not counting loads (or code-fetch) that have to wait for cache.
Multiple loads can coalesce into one access.
But also remember that code-fetch needs to go through the iTLB, triggering a page-walk. Page-walk loads are cached, i.e. they're fetched through the cache hierarchy. So they're counted by the cache-misses event if they do miss.
Repeated runs of the program can result in 0 cache-miss events. The executable binary is a file, and the file is cached (OS's disk cache) by the pagecache. That physical memory is mapped into the address-space of the process running it. It can certainly stay hot in L3 across process start/stop. More interesting is that apparently the page-table stays hot, too. (Not literally "stays" hot; I assume the kernel has to write a new one every time. But presumably the page-walker is hitting at least in L3 cache.)
Or at least whatever else was causing the "extra" cache-miss events doesn't have to happen.
I used perf stat -r16 to run it 16 times and show mean +stddev
$ perf stat -e instructions:u,L1-dcache-loads:u,L1-dcache-load-misses:u,cache-misses:u,itlb_misses.walk_completed:u -r 16 ./exit
Performance counter stats for './exit' (16 runs):
3 instructions:u
1 L1-dcache-loads
5 L1-dcache-load-misses # 506.25% of all L1-dcache hits ( +- 6.37% )
1 cache-misses:u ( +-100.00% )
2 itlb_misses.walk_completed:u
0.0001422 +- 0.0000108 seconds time elapsed ( +- 7.57% )
Note the +-100% on cache-misses.
I don't know why we have 2 itlb_misses.walk_completed events, not just 1. Counting itlb_misses.miss_causes_a_walk:u instead gives us 4 consistently.
Reducing to -r 1 and running repeatedly with manual up-arrow, cache-misses bounces around between 3 and 13. The system is mostly idle but with a bit of background network traffic.
I also don't know why anything is showing as an L1D load, or how there can be 6 misses from one load. But Hadi's answer says that perf's L1-dcache-load-misses event actually counts L1D.REPLACEMENT, so the page-walks could account for that. While L1-dcache-loads counts MEM_INST_RETIRED.ALL_LOADS. mov-immediate isn't a load, and I wouldn't have thought syscall is either. But maybe it is, otherwise the HW is falsely counting a kernel instruction or there's an off-by-1 somewhere.
This is not an easy topic, but if you are interested in counting cache misses from (for example) accessing an array, then that is what you should start with.
There are numerous pitfalls, but the simplest approach that is likely to lead to insight would start with a program that allocates an array, stores values into the array, and then reads the array a programmable number of times.
Storing values into the array is necessary to create the virtual to physical page mappings. The performance counter results for this section are likely to be incomprehensible because of the tricks that the OS uses in initializing these pages -- e.g., starting with a mapping to a zero-filled page and setting the access to "copy on write".
After the pages are instantiated, the performance counts for the reads are likely to make a lot more sense. I use a programmable number of reads so that I can take the differences between the counter values for 20 reads and 10 reads (for example).
The array size should be chosen to be significantly larger than the available cache at the level you want to test.
Unfortunately, "perf" makes it relatively difficult to figure out what is actually being programmed into the performance counters at the hardware level (which is the only level that counts!). The more "generic" the event, the harder it is to guess what is actually being measured.... On my recent Intel-based systems, "perf list" gives a long (>3600 lines) listing of available events. The events starting in the section labelled "cache:" are direct translations of the hardware events that are described in Chapter 19 of Volume 3 of the Intel Architectures Software Developers Manual.
You are correct to be concerned about how hardware prefetches are counted. In recent Intel architectures, events that report cache accesses can typically be configured to count demand accesses, hardware prefetches, or both. Events that report source locations for load instructions won't give any insight into where the HW prefetch found the data -- only how close to the processor it had gotten by the time the load operation executed.
I have found the event "l1d.replacements" to be a reliable L1 Data Cache Miss indicator on recent Intel processors. It simply counts all cache lines moved into the L1 Data Cache (whether due to loads, stores, prefetches, etc). At the other end of the hierarchy, the DRAM counters (e.g., "uncore_imc_0/cas_count_read/") are also reliable, but are subject to contamination due to any other activity in the system. Counters for "two-sided" caches (e.g., L2 & L3) are more likely to be confusing because it is not always clear whether the event is counting cache lines sent in from one side or the other or both (e.g., "l2_lines_in.all"). With some carefully controlled experiments, it is usually possible to find a subset of reliable & understandable events at these intermediate levels. It is not always possible to find enough reliable counters to make a full accounting of all traffic at each level of the memory hierarchy, but that is a longer story....
The process memory space is not only about your code, there are difference sources such as heap, stack, data segment will also contribute to the cache misses.
(source: tenouk.com)
I don't think u can estimate cache-misses numbers, just like u cannot predict the running sequence of every thread in a multithreading program.
However, cache misses analysis is useful to find out and target false sharing. Here are some useful links u can refer:
http://igoro.com/archive/gallery-of-processor-cache-effects/
http://qqibrow.github.io/CPU-Cache-Effects-and-Linux-Perf/

Is it possible to know the address of a cache miss?

Whenever a cache miss occurs, is it possible to know the address of that missed cache line? Are there any hardware performance counters in modern processors that can provide such information?
Yes, on modern Intel hardware there are precise memory sampling events that track not only the address of the instruction, but the data address as well. These events also includes a great deal of other information, such as what level of the cache hierarchy the memory access was satisfied it, the total latency and so on.
You can use perf mem to sample this information and produces a report.
For example, the following program:
#include <stddef.h>
#define SIZE (100 * 1024 * 1024)
int p[SIZE] = {1};
void do_writes(volatile int *p) {
for (size_t i = 0; i < SIZE; i += 5) {
p[i] = 42;
}
}
void do_reads(volatile int *p) {
volatile int sink;
for (size_t i = 0; i < SIZE; i += 5) {
sink = p[i];
}
}
int main(int argc, char **argv) {
do_writes(p);
do_reads(p);
}
compiled with:
g++ -g -O1 -march=native perf-mem-test.cpp -o perf-mem-test
and run with:
sudo perf mem record -U ./perf-mem-test && sudo perf mem report
Produces a report of memory accesses sorted by latency like this:
The Data Symbol column shows where address the load was targeting - most here show up as something like p+0xa0658b4 which means at an offset of 0xa0658b4 from the start of p which makes sense as the code is reading and writing p. The list is sorted by "local weight" which is the access latency in reference cycles1.
Note that the information recorded is only a sample of memory accesses: recording every miss would usually be way too much information. Furthermore, it only records loads with a latency of 30 cycles or more by default, but you can apparently tweak this with command line arguments.
If you're only interested in accesses that miss in all levels of cache, you're looking for the "Local RAM hit" lines2. Perhaps you can restrict your sampling to only cache misses - I'm pretty sure the Intel memory sampling stuff supports that, and I think you can tell perf mem to look at only misses.
Finally, note that here I'm using the -U argument after record which instructs perf mem to only record userspace events. By default it will include kernel events, which may or may not be useful for your. For the example program, there are many kernel events associated with copying the p array from the binary into writable process memory.
Keep in mind that I specifically arranged my program such that the global array p ended up in the initialized .data section (the binary is ~400 MB!), so that it shows up with the right symbol in the listing. The vast majority of the time your process is going to be accessing dynamically allocated or stack memory, which will just give you a raw address. Whether you can map this back to a meaningful object depends on if you track enough information to make that possible.
1 I think it's in reference cycles, but I could be wrong and the kernel may have already converted it to nanoseconds?
2 The "Local" and "hit" part here refer to the fact that we hit the RAM attached to the current core, i.e., we didn't have go to the RAM associated with another socket in a multi-socket NUMA configuration.
If you want to know the exact virtual or physical address of every cache miss on a particular processor, that would be very hard and sometimes impossible. But you are more likely to be interested in expensive memory access patterns; those patterns that incur large latencies because they miss in one or more levels of the cache subsystem. Note that it is important to keep in mind that a cache miss on one processor might be a cache hit on another depending on design details of each processor and depending also on the operating system.
There are several ways to find such patterns, two are commonly used. One is to use a simulator such as gem5 or Sniper. Another is to use hardware performance events. Events that represent cache misses are available but they do not provide any details on why or where a miss occurred. However, using a profiler, you can approximately associate cache misses as reported by the corresponding hardware performance events with the instructions that caused them which in turn can be mapped back to locations in the source code using debug information. Examples of such profilers include Intel VTune Amplifier and AMD CodeXL. The results produced by simulators and profilers may not be accurate and so you have to be careful when interpreting them.

Resources