Interpretation of perf stat output

Interpretation of perf stat output - performance

I have developed a code that gets as input a large 2-D image (up to 64MPixels) and:
Applies a filters on each row
Transposes the image (used blocking to avoid lots of cache misses)
Applies a filters on the columns (now-rows) of the image
Transposes the filtered image back to carry on with other calculations
Although it doesn't change something, for the sake of completeness of my question, the filtering is applying a discrete wavelet transform and the code is written in C.
My end goal is to make this run as fast as possible. The speedups I have so far are more than 10X times through the use of the blocking matrix transpose, vectorization, multithreading, compiler-friendly code etc.
Coming to my question: The latest profiling stats of the code I have (using perf stat -e) have troubled me.
76,321,873 cache-references
8,647,026,694 cycles # 0.000 GHz
7,050,257,995 instructions # 0.82 insns per cycle
49,739,417 cache-misses # 65.171 % of all cache refs
0.910437338 seconds time elapsed
The (# of cache-misses)/(# instructions) is low at around ~0.7%. Here it is mentioned that this number is a good thing to have in mind to check for memory efficiency.
On the other hand, the % of cache-misses to cache-references is significantly high (65%!) which as I see could indicates that something is going wrong with the execution in terms of cache efficiency.
The detailed stat from perf stat -d is:
2711.191150 task-clock # 2.978 CPUs utilized
1,421 context-switches # 0.524 K/sec
50 cpu-migrations # 0.018 K/sec
362,533 page-faults # 0.134 M/sec
8,518,897,738 cycles # 3.142 GHz [40.13%]
6,089,067,266 stalled-cycles-frontend # 71.48% frontend cycles idle [39.76%]
4,419,565,197 stalled-cycles-backend # 51.88% backend cycles idle [39.37%]
7,095,514,317 instructions # 0.83 insns per cycle
# 0.86 stalled cycles per insn [49.66%]
858,812,708 branches # 316.766 M/sec [49.77%]
3,282,725 branch-misses # 0.38% of all branches [50.19%]
1,899,797,603 L1-dcache-loads # 700.724 M/sec [50.66%]
153,927,756 L1-dcache-load-misses # 8.10% of all L1-dcache hits [50.94%]
45,287,408 LLC-loads # 16.704 M/sec [40.70%]
26,011,069 LLC-load-misses # 57.44% of all LL-cache hits [40.45%]
0.910380914 seconds time elapsed
Here frontend and backend stalled cycles are also high and the lower level caches seem to suffer from a high miss rate of 57.5%.
Which metric is the most appropriate for this scenario? One idea I was thinking is that it could be the case that the workload no longer requires further "touching" of the LL caches after the initial image load (loads the values once and after that it's done - the workload is more CPU-bound than memory-bound being an image filtering algorithm).
The machine I'm running this on is a Xeon E5-2680 (20M of Smart cache, out of which 256KB L2 cache per core, 8 cores).

The first thing you want to make sure is that no other compute intensive process is running on your machine. That's a server CPU so I thought that could be a problem.
If you use multi-threading in your program, and you distribute equal amount of work between threads, you might be interested collecting metrics only on one CPU.
I suggest disabling hyper-threading in the optimization phase as it can lead to confusion when interpreting the profiling metrics. (e.g. increased #cycles spent in the back-end). Also if you distribute work to 3 threads, you have a high chance that 2 threads will share the resources of one core and the 3rd will have the entire core for itself - and it will be faster.
Perf has never been very good at explaining the metrics. Judging by the order of magnitude, the cache references are the L2 misses that hit the LLC. A high LLC miss number compared with LLC references is not always a bad thing if the number of LLC references / #Instructions is low. In your case, you have 0.018 so that means that most of your data is being used from L2. The high LLC miss ratio means that you still need to get data from RAM and write it back.
Regarding #Cycles BE and FE bound, I'm a bit concerned about the values because they don't sum to 100% and to the total number of cycles. You have 8G but staying 6G cycles in the FE and 4G cycles in the BE. That does not seem very correct.
If the FE cycles is high, that means you have misses in the instruction cache or bad branch speculation. If the BE cycles is high, that means you wait for data.
Anyway, regarding your question. The most relevant metric to asses the performance of your code is Instructions / Cycle (IPC). Your CPU can execute up to 4 instructions / cycle. You only execute 0.8. That means resources are underutilized, except for the case where you have many vector instructions. After IPC you need to check branch misses and L1 misses (data and code) because those generate most penalties.
A final suggestion: you may be interested in trying Intel's vTune Amplifier. It gives a much better explaining on the metrics and points you to the eventual problems in your code.

Related

Cache miss latency in clock cycles

To measure the impact of cache-misses in a program, I want to latency caused by cache-misses to the cycles used for actual computation.
I use perf stat to measure the cycles, L1-loads, L1-misses, LLC-loads and LLC-misses in my program. Here is a example output:
467 769,70 msec task-clock # 1,000 CPUs utilized
1 234 063 672 432 cycles # 2,638 GHz (62,50%)
572 761 379 098 instructions # 0,46 insn per cycle (75,00%)
129 143 035 219 branches # 276,083 M/sec (75,00%)
6 457 141 079 branch-misses # 5,00% of all branches (75,00%)
195 360 583 052 L1-dcache-loads # 417,643 M/sec (75,00%)
33 224 066 301 L1-dcache-load-misses # 17,01% of all L1-dcache hits (75,00%)
20 620 655 322 LLC-loads # 44,083 M/sec (50,00%)
6 030 530 728 LLC-load-misses # 29,25% of all LL-cache hits (50,00%)
Then my question is:
How to convert the number of cache-misses into a number of "lost" clock cycles?
Or alternatively, what is the proportion of time spent for fetching data?
I think the factor should be known by the constructor. My processor is Intel Core i7-10810U, and I couldn't find this information in the specifications nor in this list of benchmarked CPUs.
This related problem describes how to measure the number of cycles lost in a cache-miss, but is there a way to obtain this as hardware information? Ideally, the output would be something like:
L1-hit: 3 cycles
L2-hit: 10 cycles
LLC-hit: 30 cycles
RAM: 300 cycles

Out-of-order exec and memory-level parallelism exist to hide some of that latency by overlapping useful work with time data is in flight. If you simply multiplied L3 miss count by say 300 cycles each, that could exceed the total number of cycles your whole program took. The perf event cycle_activity.stalls_l3_miss (which exists on my Skylake CPU) should count cycles when no uops execute and there's an outstanding L3 cache miss. i.e. cycles when execution is fully stalled. But there will also be cycles with some work, but less than without a cache miss, and that's harder to evaluate.
TL:DR: memory access is heavily pipelined; the whole core doesn't stop on one cache miss, that's the whole point. A pointer-chasing benchmark (to measure latency) is merely a worst case, where the only work is dereferencing a load result. See Modern Microprocessors
A 90-Minute Guide! which has a section about memory and the "memory wall". See also https://agner.org/optimize/ and https://www.realworldtech.com/haswell-cpu/ to learn more about the details of out-of-order exec CPUs and how they can continue doing independent work while one instruction is waiting for data from a cache miss, up to the limit of their out-of-order window size. (https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/)
Re: numbers from vendors:
L3 and RAM latencies aren't a fixed number of core clock cycles: first, core frequency is variable (and independent of uncore and memory clocks), and second because of contention from other cores, and number of hops over the interconnect. (Related: Is cycle count itself reliable on program timing? discusses some effects of core frequency independent of L3 and memory)
That said, Intel's optimization manual does include a table with exact latencies for L1 and L2, and typical for L3, and DRAM on Skylake-server. (2.2.1.3 Skylake Server Microarchitecture Cache Recommendations)
https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html#optimization - they say SKX L3 latency is typically 50-70 cycles. DRAM speed depends some on the timing of your DIMMs.
Other people have tested specific CPUs, like https://www.7-cpu.com/cpu/Skylake.html.

Does the Meltdown mitigation, in combination with `calloc()`s CoW "lazy allocation", imply a performance hit for calloc()-allocated memory?

So calloc() works by asking the OS for some virtual memory. The OS is working in cahoots with the MMU, and cleverly responds with a virtual memory address which actually maps to a copy-on-write, read-only page full of zeroes. When a program tries to write to anywhere in that page, a page fault occurs (because you cannot write to read-only pages), a copy of the page is created, and your program's virtual memory is mapped to this brand new copy of those zeroes.
Now that Meltdown is a thing, OSes have been patched so that it's no longer possible to speculatively execute across the kernel-user boundary. This means that whenever user code calls kernel code, it effectively causes a pipeline stall. Typically, when the pipeline stalls in a loop, it's devastating for performance, since the CPU ends up wasting time waiting for data, whether from cache or main memory.
Given such, what I want to know is:
When a program writes to a never-before-accessed page which was allocated with calloc(), and the remapping to the new CoW page occurs, is this executing kernel code?
Is the page fault copy-on-write functionality implemented at the OS level or the MMU level?
If I call calloc() to allocate 4GiB of memory, then initialize it with some arbitrary value (say, 0xFF instead of 0x00) in a tight loop, is my (Intel) CPU going to be hitting a speculation boundary every time it writes to a new page?
And finally, if it is real, is there any case where this effect is significant to real-world performance?

Your premise is wrong. Page faults were never pipelined / super-cheap. Meltdown (and Spectre) mitigation does make them more expensive, though, along with system calls and all other user->kernel transitions.
Speculative execution across the kernel/user boundary was never possible; Intel CPUs don't rename the privilege level, i.e. kernel/user transitions always required a full pipeline flush. I think you're misunderstanding Meltdown: it's cause purely by speculative execution in user-space and delayed handling of the privilege checks on TLB hits.
This is universal in CPU design, AFAIK. I'm not aware of any microarchitectures that rename the privilege level or otherwise speculate into kernel code, x86 or otherwise.
The cost added by Meltdown mitigation is that entering the kernel flushes the TLB. (Or on CPUs with TLB process-context ID support, the kernel can use PCIDs to make using separate page-tables for kernel vs. user-space much cheaper).
The kernel entry point (on Linux) becomes a trampoline that swaps page tables and jumps to the real kernel entry point, to avoid exposing the kernel ASLR offset to user-space. But other than that and an extra mov cr3, reg on entry and exit from the kernel (setting a new page table), nothing else is changed.
(Spectre mitigation is tricky, too, and required more changes like retpolines... and might also significantly increase the cost of user->kernel->user. IDK about page fault costs.)
#BeeOnRope reports (see comments and his answer for full details) that without Spectre patches, just Meltdown patches applied but nopti boot option to "disable" it, increased the cost of a round trip to the kernel on a Skylake CPU (with syscall with bogus RAX, returning -ENOSYS right away) went up from ~100 to ~300 cycles. So that's maybe the cost of the trampoline? And with actual page-table isolation enabled, it went up to ~700 cycles. That's without Spectre mitigation patches at all. (Also, that's the x86-64 syscall entry point, not page-fault. They're likely similar, though.)
Page fault exceptions:
CPUs don't predict page faults, so they couldn't speculatively execute the handler anyway. Prefetch or decode of the page fault entry point could maybe happen while the pipeline was flushing, but that process wouldn't start until the page-faulting instruction tried to retire. A faulting load/store is marked to take effect on retirement, and doesn't re-steer the front-end; the whole key to Meltdown is the lack of action on a faulting load until it reaches retirement.
Related: When an interrupt occurs, what happens to instructions in the pipeline?
Also: Out-of-order execution vs. speculative execution has some detail about what kind of speculation really causes Meltdown, and how CPUs handle faults.
When a program writes to a never-before-accessed page which was allocated with calloc(), and the remapping to the new CoW page occurs, is this executing kernel code?
Yes, page faults are handled by the kernel's page-fault handler. There's no pure-hardware handling for copy-on-write.
If I call calloc() to allocate 4GiB of memory, then initialize it with some arbitrary value (say, 0xFF instead of 0x00) in a tight loop, is my (Intel) CPU going to be hitting a speculation boundary every time it writes to a new page?
Yes. The kernel doesn't fault-around for zeroed pages (unlike for file-backed mappings when data is hot in the pagecache). So every new page touched causes a pagefault, even for small 4k normal pages. (Thanks to #BeeOnRope for accurate info on this.) With anonymous hugepages, you'll only pagefault once per 2MiB (x86-64), which is tremendously better.
If you want to avoid per-page costs, allocate with mmap(MAP_POPULATE) to prefault all the pages into the HW page table, on a Linux system. I'm not sure if madvise can prefault pages for you, e.g. madvise(MADV_WILLNEED) on an already-mapped region. But madvise(MADV_HUGEPAGE) will encourage the kernel to use anonymous hugepages (and maybe to defrag physical memory to free up contiguous 2M blocks to enable that, if you don't have it configured to do that without madvise).
Related: Two TLB-miss per mmap/access/munmap has some perf results on a Linux kernel with KPTI patches.

Yes use of calloc()-allocated memory will suffer a performance degradation due to the Meltdown and Spectre patches.
In fact, calloc() isn't special here: malloc(), new and more generally all allocated memory will probably suffer approximately the same performance impact. Both calloc() and malloc() are ultimately backed by pages returned by the OS (although the allocator will re-use them after they are freed). The only real difference being that a smart allocator, when it goes down the path of using new pages from the OS (rather than re-using a previously freed allocation) in the case of calloc it can omit the zeroing because the OS-provided pages are guaranteed to be zero. Other than that the allocator behavior is largely the same and the OS-level zeroing behavior is the same (there is usually no option to ask the OS for non-zero pages).
So the performance impact applies more broadly than you thought, but the performance impact is likely smaller than you suggest, since a page fault is already doing a lot of work anyways, so you aren't talking an order of magnitude degradation or anything. See Peter's answer on the reasons the performance impact is likely to be limited. I wrote this answer mostly because the answer to your headline question is still yes as there is some impact.
To estimate the impact on a malloc heavy workflow, I tried running some allocation and page-fault heavy test on a current kernel (4.13.0-39-generic) with the Spectre and Meltdown mitigations, as well as on an older kernel prior to these mitigations.
The test code is very simple:
#include <stdlib.h>
#include <stdio.h>
#define SIZE (40 * 1024 * 1024)
#define PG_SIZE 4096
int main() {
char *mem = malloc(SIZE);
for (volatile char *p = mem; p < mem + SIZE; p += PG_SIZE) {
*p = 'z';
}
printf("pages touched: %d\npoitner value : %p\n", SIZE / PG_SIZE, mem);
}
The results on the newer kernel were about ~3700 cycles per page fault, and on the older kernel without mitigations around ~3300 cycles. The overall regression (presumably) due to the mitigations was about 14%. Note that this in on Skylake hardware (i7-6700HQ) where some of the Spectre mitigations are somewhat cheaper, and the kernel supports PCID which makes the KPTI Meltdown mitigations cheaper. The results might be worse on different hardware.
Oddly, the results on the new kernel with Spectre and Meltdown mitigations disabled at boot (using spectre_v2=off nopti) were much worse than either the new kernel default or the old kernel, coming in at about 5050 cycles per page fault, something like a 35% regression over the same kernel with the mitigations enabled. So something is going really wrong, performance-wise when the mitigations are disabled.
Full Results
Here is the full perf stat output for the two runs.
Old Kernel (4.10.0-42)
pages touched: 10240
poitner value : 0x7f7d2561e010
Performance counter stats for './pagefaults':
12.980048 task-clock (msec) # 0.976 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
10,286 page-faults # 0.792 M/sec
33,662,397 cycles # 2.593 GHz
27,230,864 instructions # 0.81 insn per cycle
4,535,443 branches # 349.417 M/sec
11,760 branch-misses # 0.26% of all branches
0.013293417 seconds time elapsed
New Kernel (4.13.0-39)
pages touched: 10240
poitner value : 0x7f306ad69010
Performance counter stats for './pagefaults':
14.789615 task-clock (msec) # 0.966 CPUs utilized
8 context-switches # 0.541 K/sec
0 cpu-migrations # 0.000 K/sec
10,288 page-faults # 0.696 M/sec
38,318,595 cycles # 2.591 GHz
28,796,523 instructions # 0.75 insn per cycle
4,693,944 branches # 317.381 M/sec
26,853 branch-misses # 0.57% of all branches
0.015312764 seconds time elapsed
New Kernel (4.13.0.-39) spectre_v2=off nopti
pages touched: 10240
poitner value : 0x7ff079ede010
Performance counter stats for './pagefaults':
16.690621 task-clock (msec) # 0.982 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
10,286 page-faults # 0.616 M/sec
51,964,080 cycles # 3.113 GHz
28,602,441 instructions # 0.55 insn per cycle
4,699,608 branches # 281.572 M/sec
25,064 branch-misses # 0.53% of all branches
0.017001581 seconds time elapsed

What can be the reason for CPU load to NOT scale linearly with the number of traffic processing workers?

We are writing a Front End that is supposed to process large volume of traffic (in our case it is Diameter traffic, but that may be irrelevant to the question). As client connects, the server socket gets assigned to one of the Worker processes that perform all the actual traffic processing. In other words, Worker does all the work, and more Workers should be added when more clients get connected.
One would expect the CPU load per message to be the same for different number of Workers, because Workers are totally independent, and serve different sets of client connections. Yet our tests show that it takes more CPU time per message, as the number of Workers grow.
To be more precise, the CPU load depends on the TPS (Transactions or Request-Responses per second) as follows.
For 1 Worker:
60K TPS - 16%, 65K TPS - 17%... i.e. ~0.26% CPU per KTPS
For 2 Workers:
80K TPS - 29%, 85K TPS - 30%... i.e. ~0.35% CPU per KTPS
For 4 Workers:
85K TPS - 33%, 90K TPS - 37%... i.e. ~0.41% CPU per KTPS
What is the explanation for this? Workers are independent processes and there is no inter-process communication between them. Also each Worker is single-threaded.
The programming language is C++
This effect is observed on any hardware, which is close to this one: 2 Intel Xeon CPU, 4-6 cores, 2-3 GHz
OS: RedHat Linux (RHEL) 5.8, 6.4
CPU load measurements are done using mpstat and top.

If either the size of the program code used by a worker or the size of the data processed by a worker (or both) is non-small, the reason could be the reduced effectiveness of the various caches: The locality-over-time of how a single worker accesses its program code and/or its data is disturbed by other workers intervening.
The effect can be quite complicated to understand, because:
it depends massively on the structure of your code's computations,
modern CPUs have about three levels of cache,
each cache has a different size,
some caches are local to one core, others are not,
how often the workers intervene depends on your operating system's scheduling strategy
which gets even more complicated if there are multiple cores,
unless your programming language's run-time system also intervenes,
in which case it is more complicated still,
your network interface is a computer of its own and has a cache, too,
and probably more.
Caveat: Given the relatively coarse granularity of process scheduling, the effect of this ought not to be as large as it is, I think.
But then: Have you looked up how "percent of CPU" is even defined?
Until you reach CPU saturation on your machine you cannot be sure that the effect is actually as large as it looks. And when you do reach saturation, it may not be the CPU at all that is the bottleneck here, so are you sure you need to care about CPU load?

I complete agree with #Lutz Prechelt. Here I just want to add the method about how to investigate on the issue and the answer is Perf.
Perf is a performance analyzing tool in Linux which collects both kernel and userspace events and provide some nice metrics. It’s been widely used in my team to find bottom neck in CPU-bound applications.
the output of perf is like this:
Performance counter stats for './cache_line_test 0 1 2 3':
1288.050638 task-clock # 3.930 CPUs utilized
185 context-switches # 0.144 K/sec
8 cpu-migrations # 0.006 K/sec
395 page-faults # 0.307 K/sec
3,182,411,312 cycles # 2.471 GHz [39.95%]
2,720,300,251 stalled-cycles-frontend # 85.48% frontend cycles idle [40.28%]
764,587,902 stalled-cycles-backend # 24.03% backend cycles idle [40.43%]
1,040,828,706 instructions # 0.33 insns per cycle
# 2.61 stalled cycles per insn [51.33%]
130,948,681 branches # 101.664 M/sec [51.48%]
20,721 branch-misses # 0.02% of all branches [50.65%]
652,263,290 L1-dcache-loads # 506.396 M/sec [51.24%]
10,055,747 L1-dcache-load-misses # 1.54% of all L1-dcache hits [51.24%]
4,846,815 LLC-loads # 3.763 M/sec [40.18%]
301 LLC-load-misses # 0.01% of all LL-cache hits [39.58%]
It output your cache miss rate with will easy you to tune your program and see the effect.
I write a article about cache line effects and perf and you can read it for more details.

Analyzing cause of performance regression with different kernel version

I have come across a strange performance regression from Linux kernel 3.11 to 3.12 on x86_64 systems.
Running Mark Stock's Radiance benchmark on Fedora 20, 3.12 is noticeably slower. Nothing else is changed - identical binary, identical glibc - I just boot a different kernel version, and the performance changes.
The timed program, rpict, is 100% CPU bound user-level code.
Before I report this as a bug, I'd like to find the cause for this behavior. I don't know a lot about the Linux kernel, and the change log from 3.11 to 3.12 does not give me any clue.
I observed this on two systems, an Intel Haswell (i7-4771) and an AMD Richland (A8-6600K).
On the Haswell system user time went from 895 sec with 3.11 to 962 with 3.12. On the Richland, from 1764 to 1844. These times are repeatable to within a few seconds.
I did some profiling with perf, and found that IPC went down in the same proportion as the slowdown. On the Haswell system, this seems to be caused by more missed branches, but why should the prediction rate go down? Radiance does use the random number generator - could "better" randomness cause the missed branches? But apart from OMAP4 support, the RNG does not have to seem changed in 3.12.
On the AMD system, perf just points to more idle backend cycles, but the cause is not clear.
Haswell system:
3.11.10 895s user, 3.74% branch-misses, 1.65 insns per cycle
3.12.6 962s user, 4.22% branch-misses, 1.52 insns per cycle
Richland system:
3.11.10 1764s user, 8.23% branch-misses, 0.75 insns per cycle
3.12.6 1844s user, 8.26% branch-misses, 0.72 insns per cycle
I also looked at a diff from the dmesg output of both kernels, but did not see anything that might have caused such a slowdown of a CPU-bound program.
I tried switching the cpufreq governor from the default ondemand to peformance but that did not have any effect.
The executable was compiled using gcc 4.7.3 but not using AVX instructions. libm still seems to use some AVX (e.g. __ieee754_pow_fma4) but these functions are only 0.3% of total execution time.
Additional info:
Diff of kernel configs
diff of the dmesg outputs on the Haswell system.
diff of /proc/pid/maps - 3.11 maps only one heap region; 3.12 lots.
perf stat output from the A8-6600K system
perf stats w/ TLB misses dTLB stats look very different!
/usr/bin/time -v output from the A8-6600K system
Any ideas (apart from bisecting the kernel changes)?

Let's check your perf stat outputs: http://www.chr-breitkopf.de/tmp/perf-stat.A8.txt
Kernel 3.11.10
1805057.522096 task-clock # 0.999 CPUs utilized
183,822 context-switches # 0.102 K/sec
109 cpu-migrations # 0.000 K/sec
40,451 page-faults # 0.022 K/sec
7,523,630,814,458 cycles # 4.168 GHz [83.31%]
628,027,409,355 stalled-cycles-frontend # 8.35% frontend cycles idle [83.34%]
2,688,621,128,444 stalled-cycles-backend # 35.74% backend cycles idle [33.35%]
5,607,337,995,118 instructions # 0.75 insns per cycle
# 0.48 stalled cycles per insn [50.01%]
825,679,208,404 branches # 457.425 M/sec [66.67%]
67,984,693,354 branch-misses # 8.23% of all branches [83.33%]
1806.804220050 seconds time elapsed
Kernel 3.12.6
1875709.455321 task-clock # 0.999 CPUs utilized
192,425 context-switches # 0.103 K/sec
133 cpu-migrations # 0.000 K/sec
40,356 page-faults # 0.022 K/sec
7,822,017,368,073 cycles # 4.170 GHz [83.31%]
634,535,174,769 stalled-cycles-frontend # 8.11% frontend cycles idle [83.34%]
2,949,638,742,734 stalled-cycles-backend # 37.71% backend cycles idle [33.35%]
5,607,926,276,713 instructions # 0.72 insns per cycle
# 0.53 stalled cycles per insn [50.01%]
825,760,510,232 branches # 440.239 M/sec [66.67%]
68,205,868,246 branch-misses # 8.26% of all branches [83.33%]
1877.263511002 seconds time elapsed
There are almost 300 Gcycles more for 3.12.6 in the "cycles" field; and only 6,5 Gcycles were stalls of frontend and 261 Gcycles were stalled in the backend. You have only 0,2 G of additional branch misses (each cost about 20 cycles - per optim.manual page 597; so 4Gcycles), so I think that your performance problems are related to memory subsystem problems (more realistict backend event, which can be influenced by kernel). Pagefaults diffs and migration count are low, and I think they will not slowdown test directly (but migrations may move program to the worse place).
You should go deeper into perf counters to find the exact type of problem (it will be easier if you have shorter runs of test). The Intel's manual http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf will help you. Check page 587 (B.3.2) for overall events hierarchy (FE and BE stalls are here too), B.3.2.1-B.3.2.3 for info on backend stalls and how to start digging (checks for cache events, etc) and below.
How can kernel influence the memory subsystem? It can setup different virtual-to-physical mapping (hardly the your case), or it can move process farther from data. You have not-NUMA machine, but Haswell is not the exact UMA - there is a ring bus and some cores are closer to memory controller or to some parts of shared LLC (last level cache). You can test you program with taskset utility, bounding it to some core - kernel will not move it to other core.
UPDATE: After checking your new perf stats from A8 we see that there are more DLTB-misses for 3.12.6. With changes in /proc/pid/maps (lot of short [heap] sections instead of single [heap], still no exact info why), I think that there can be differences in transparent hugepage (THP; with 2M hugepages there are less TLB entries needed for the same amount of memory and less tlb misses), for example in 3.12 it can't be applied due to short heap sections.
You can check your /proc/PID/smaps for AnonHugePages and /proc/vmstat for thp* values to see thp results. Values are documented here kernel.org/doc/Documentation/vm/transhuge.txt
#osgx You found the cause! After echo never > /sys/kernel/mm/transparent_hugepage/enabled, 3.11.10 takes as long as 3.12.6!
Good news!
Additional info on how to disable the randomization, and on where to report this as a bug (a 7% performance regression is quite severe) would be appreciated
I was wrong, this multi-heap section effect is not the brk randomisation (which changes only beginning of the heap). This is failure of VMA merging in do_brk; don't know why, but some changes for VM_SOFTDIRTY were seen in mm between 3.11.10 - 3.12.6.
UPDATE2: Possible cause of not-merging VMA:
http://lxr.missinglinkelectronics.com/linux+v3.11/mm/mmap.c#L2580 do_brk in 3.11
http://lxr.missinglinkelectronics.com/linux+v3.11/mm/mmap.c#L2577 do_brk in 3.12
3.12 just added at the end of do_brk
2663 vma->vm_flags |= VM_SOFTDIRTY;
2664 return addr;
And bit above we have
2635 /* Can we just expand an old private anonymous mapping? */
2636 vma = vma_merge(mm, prev, addr, addr + len, flags,
2637 NULL, NULL, pgoff, NULL);
and inside vma_merge there is test for vm_flags
http://lxr.missinglinkelectronics.com/linux+v3.11/mm/mmap.c#L994 3.11
http://lxr.missinglinkelectronics.com/linux+v3.12/mm/mmap.c#L994 3.12
1004 /*
1005 * We later require that vma->vm_flags == vm_flags,
1006 * so this tests vma->vm_flags & VM_SPECIAL, too.
1007 */
vma_merge --> can_vma_merge_before --> is_mergeable_vma ...
898 if (vma->vm_flags ^ vm_flags)
899 return 0;
But at time of check, new vma is not marked as VM_SOFTDIRTY, while old is already marked.

This change could be a likely candidate http://marc.info/?l=linux-kernel&m=138012715018064. I say this loosely as I don't have the resources to confirm. Its worth noting that this was the only significant change to the scheduler between 3.11.10 and 3.12.6.
Anyhow I'm very interested to see the end results of your findings so keep us posted.

Does software prefetching allocate a Line Fill Buffer (LFB)?

I've realized that Little's Law limits how fast data can be transferred at a given latency and with a given level of concurrency. If you want to transfer something faster, you either need larger transfers, more transfers "in flight", or lower latency. For the case of reading from RAM, the concurrency is limited by the number of Line Fill Buffers.
A Line Fill Buffer is allocated when a load misses the L1 cache. Modern Intel chips (Nehalem, Sandy Bridge, Ivy Bridge, Haswell) have 10 LFB's per core, and thus are limited to 10 outstanding cache misses per core. If RAM latency is 70 ns (plausible), and each transfer is 128 Bytes (64B cache line plus its hardware prefetched twin), this limits bandwidth per core to: 10 * 128B / 75 ns = ~16 GB/s. Benchmarks such as single-threaded Stream confirm that this is reasonably accurate.
The obvious way to reduce the latency would be prefetching the desired data with x64 instructions such as PREFETCHT0, PREFETCHT1, PREFETCHT2, or PREFETCHNTA so that it doesn't have to be read from RAM. But I haven't been able to speed anything up by using them. The problem seems to be that the __mm_prefetch() instructions themselves consume LFB's, so they too are subject to the same limits. Hardware prefetches don't touch the LFB's, but also will not cross page boundaries.
But I can't find any of this documented anywhere. The closest I've found is 15 year old article that says mentions that prefetch on the Pentium III uses the Line Fill Buffers. I worry things may have changed since then. And since I think the LFB's are associated with the L1 cache, I'm not sure why a prefetch to L2 or L3 would consume them. And yet, the speeds I measure are consistent with this being the case.
So: Is there any way to initiate a fetch from a new location in memory without using up one of those 10 Line Fill Buffers, thus achieving higher bandwidth by skirting around Little's Law?

Based on my testing, all types of prefetch instructions consume line fill buffers on recent Intel mainstream CPUs.
In particular, I added some load & prefetch tests to uarch-bench, which use large-stride loads over buffers of various sizes. Here are typical results on my Skylake i7-6700HQ:
Benchmark Cycles Nanos
16-KiB parallel loads 0.50 0.19
16-KiB parallel prefetcht0 0.50 0.19
16-KiB parallel prefetcht1 1.15 0.44
16-KiB parallel prefetcht2 1.24 0.48
16-KiB parallel prefetchtnta 0.50 0.19
32-KiB parallel loads 0.50 0.19
32-KiB parallel prefetcht0 0.50 0.19
32-KiB parallel prefetcht1 1.28 0.49
32-KiB parallel prefetcht2 1.28 0.49
32-KiB parallel prefetchtnta 0.50 0.19
128-KiB parallel loads 1.00 0.39
128-KiB parallel prefetcht0 2.00 0.77
128-KiB parallel prefetcht1 1.31 0.50
128-KiB parallel prefetcht2 1.31 0.50
128-KiB parallel prefetchtnta 4.10 1.58
256-KiB parallel loads 1.00 0.39
256-KiB parallel prefetcht0 2.00 0.77
256-KiB parallel prefetcht1 1.31 0.50
256-KiB parallel prefetcht2 1.31 0.50
256-KiB parallel prefetchtnta 4.10 1.58
512-KiB parallel loads 4.09 1.58
512-KiB parallel prefetcht0 4.12 1.59
512-KiB parallel prefetcht1 3.80 1.46
512-KiB parallel prefetcht2 3.80 1.46
512-KiB parallel prefetchtnta 4.10 1.58
2048-KiB parallel loads 4.09 1.58
2048-KiB parallel prefetcht0 4.12 1.59
2048-KiB parallel prefetcht1 3.80 1.46
2048-KiB parallel prefetcht2 3.80 1.46
2048-KiB parallel prefetchtnta 16.54 6.38
The key thing to note is that none of the prefetching techniques are much faster than loads at any buffer size. If any prefetch instruction didn't use the LFB, we would expect it to be very fast for a benchmark that fit into the level of cache it prefetches to. For example prefetcht1 brings lines into the L2, so for the 128-KiB test we might expect it to be faster than the load variant if it doesn't use LFBs.
More conclusively, we can examine the l1d_pend_miss.fb_full counter, whose description is:
Number of times a request needed a FB (Fill Buffer) entry but there
was no entry available for it. A request includes
cacheable/uncacheable demands that are load, store or SW prefetch
instructions.
The description already indicates that SW prefetches need LFB entries and testing confirmed it: for all types of prefetch, this figure was very high for any test where concurrency was a limiting factor. For example, for the 512-KiB prefetcht1 test:
Performance counter stats for './uarch-bench --test-name 512-KiB parallel prefetcht1':
38,345,242 branches
1,074,657,384 cycles
284,646,019 mem_inst_retired.all_loads
1,677,347,358 l1d_pend_miss.fb_full
The fb_full value is more than the number of cycles, meaning that the LFB was full almost all the time (it can be more than the number of cycles since up to two loads might want an LFB per cycle). This workload is pure prefetches, so there is nothing to fill up the LFBs except prefetch.
The results of this test also contract the claimed behavior in the section of the manual quoted by Leeor:
There are cases where a PREFETCH will not perform the data prefetch.
These include:
...
If the memory subsystem runs out of request buffers
between the first-level cache and the second-level cache.
Clearly this is not the case here: the prefetch requests are not dropped when the LFBs fill up, but are stalled like a normal load until resources are available (this is not an unreasonable behavior: if you asked for a software prefetch, you probably want to get it, perhaps even if it means stalling).
We also note the following interesting behaviors:
It seems like there is some small difference between prefetcht1 and prefetcht2 as they report different performance for the 16-KiB test (the difference varies, but is consistently different), but if you repeat the test you'll see that this is more likely just run-to-run variation as those particular values are somewhat unstable (most other values are very stable).
For the L2 contained tests, we can sustain 1 load per cycle, but only one prefetcht0 prefetch. This is kind of weird because prefetcht0 should be very similar to a load (and it can issue 2 per cycle in the L1 cases).
Even though the L2 has ~12 cycle latency, we are able to fully hide the latency LFB with only 10 LFBs: we get 1.0 cycles per load (limited by L2 throughput), not 12 / 10 == 1.2 cycles per load that we'd expect (best case) if LFB were the limiting fact (and very low values for fb_full confirms it). That's probably because the 12 cycle latency is the full load-to-use latency all the way to the execution core, which includes also several cycles of additional latency (e.g., L1 latency is 4-5 cycles), so the actual time spent in the LFB is less than 10 cycles.
For the L3 tests, we see values of 3.8-4.1 cycles, very close to the expected 42/10 = 4.2 cycles based on the L3 load-to-use latency. So we are definitely limited by the 10 LFBs when we hit the L3. Here prefetcht1 and prefetcht2 are consistently 0.3 cycles faster than loads or prefetcht0. Given the 10 LFBs, that equals 3 cycles less occupancy, more or less explained by the prefetch stopping at L2 rather than going all the way to L1.
prefetchtnta generally has much lower throughput than the others outside of L1. This probably means that prefetchtnta is actually doing what it is supposed to, and appears to bring lines into L1, not into L2, and only "weakly" into L3. So for the L2-contained tests it has concurrency-limited throughput as if it was hitting the L3 cache, and for the 2048-KiB case (1/3 of the L3 cache size) it has the performance of hitting main memory. prefetchnta limits L3 cache pollution (to something like only one way per set), so we seem to be getting evictions.
Could it be different?
Here's an older answer I wrote before testing, speculating on how it could work:
In general, I would expect any prefetch that results in data ending up in L1 to consume a line fill buffer, since I believe that the only path between L1 and the rest of the memory hierarchy is the LFB1. So SW and HW prefetches that target the L1 probably both use LFBs.
However, this leaves open the probability that prefetches that target L2 or higher levels don't consume LFBs. For the case of hardware prefetch, I'm quite sure this is the case: you can find many reference that explain that HW prefetch is a mechanism to effectively get more memory parallelism beyond the maximum of 10 offered by the LFB. Furthermore, it doesn't seem like the L2 prefetchers could use the LFBs if they wanted: they live in/near the L2 and issue requests to higher levels, presumably using the superqueue and wouldn't need the LFBs.
That leaves software prefetch that target the L2 (or higher), such as prefetcht1 and prefetcht22. Unlike requests generated by the L2, these start in the core, so they need some way to get from the core out, and this could be via the LFB. From the Intel Optimization guide have the following interesting quote (emphasis mine):
Generally, software prefetching into the L2 will show more benefit
than L1 prefetches. A software prefetch into L1 will consume critical
hardware resources (fill buffer) until the cacheline fill completes. A
software prefetch into L2 does not hold those resources, and it is
less likely to have a negative perfor- mance impact. If you do use L1
software prefetches, it is best if the software prefetch is serviced
by hits in the L2 cache, so the length of time that the hardware
resources are held is minimized.
This would seem to indicate that software prefetches don't consume LFBs - but this quote only applies to the Knights Landing architecture, and I can't find similar language for any of the more mainstream architectures. It appears that the cache design of Knights Landing is significantly different (or the quote is wrong).
1 In fact, I think that even non-temporal stores use the LFBs to get get out of the execution core - but their occupancy time is short because as soon as they get to the L2 they can enter the superqueue (without actually going into L2) and then free up their associated LFB.
2 I think both of these target the L2 on recent Intel, but this is also unclear - perhaps the t2 hint actually targets LLC on some uarchs?

First of all a minor correction - read the optimization guide, and you'll note that some HW prefetchers belong in the L2 cache, and as such are not limited by the number of fill buffers but rather by the L2 counterpart.
The "spatial prefetcher" (the colocated-64B line you meantion, completing to 128B chunks) is one of them, so in theory if you fetch every other line you'll be able to get a higher bandwidth (some DCU prefetchers might try to "fill the gaps for you", but in theory they should have lower priority so it might work).
However, the "king" prefetcher is the other guy, the "L2 streamer". Section 2.1.5.4 reads:
Streamer : This prefetcher monitors read requests from the L1 cache for ascending and descending sequences of addresses. Monitored read requests include L1 DCache requests initiated by load and store operations and by the hardware prefetchers, and L1 ICache requests for code fetch. When a forward or backward stream of requests is detected, the anticipated cache lines are prefetched. Prefetched cache lines must be in the same 4K page
The important part is -
The streamer may issue two prefetch requests on every L2 lookup. The streamer
can run up to 20 lines ahead of the load reques
This 2:1 ratio means that for a stream of accesses that is recognized by this prefetcher, it would always run ahead of your accesses. It's true that you won't see these lines in your L1 automatically, but it does mean that if all works well, you should always get L2 hit latency for them (once the prefetch stream had enough time to run ahead and mitigate L3/memory latencies). You may only have 10 LFBs, but as you noted in your calculation - the shorter the access latency becomes, the faster you can replace them the the higher bandwidth you can reach. This is essentially detaching the L1 <-- mem latency into parallel streams of L1 <-- L2 and L2 <-- mem.
As for the question in your headline - it stands to reason that prefetches attempting to fill the L1 would require a line fill buffer to hold the retrieved data for that level. This should probably include all L1 prefetches. As for SW prefetches, section 7.4.3 says:
There are cases where a PREFETCH will not perform the data prefetch. These include:
PREFETCH causes a DTLB (Data Translation Lookaside Buffer) miss. This applies to Pentium 4 processors with CPUID signature corresponding to family 15, model 0, 1, or 2. PREFETCH resolves DTLB misses and fetches data on Pentium 4 processors with CPUID signature corresponding to family 15, model 3.
An access to the specified address that causes a fault/exception.
If the memory subsystem runs out of request buffers between the first-level cache and the second-level cache.
...
So I assume you're right and SW prefetches are not a way to artificially increase your number of outstanding requests. However, the same explanation applies here as well - if you know how to use SW prefetching to access your lines well enough in advance, you may be able to mitigate some of the access latency and increase your effective BW. This however won't work for long streams for two reasons: 1) your cache capacity is limited (even if the prefetch is temporal, like t0 flavor), and 2) you still need to pay the full L1-->mem latency for each prefetch, so you're just moving your stress ahead a bit - if your data manipulation is faster than memory access, you'll eventually catch up with your SW prefetching. So this only works if you can prefetch all you need well enough in advance, and keep it there.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio