Good resources on how to program PEBS (Precise event based sampling) counters? - performance

I have been trying to log all memory accesses of a program, which as I read seems to be impossible. I have been trying to see to what extent can I go to log at least a major portion of the memory accesses, if not all. So I was looking to program the PEBS counters in such a way that I could see changes in the number of memory access samples collected. I wanted to know if I can do this by modifying the counter-reset value of PEBS counters. (Usually this goes to zero, but I want to set it to a higher value)
So I was looking to program these PEBS counters on my own. Has anybody had experience manipulating the PEBS counters ? Specifically I was looking for good sources to see how to program them. I have gone through the Intel documentation and understood the steps. But I wanted to understand some sample programs. I have gone through the below github repo :-
https://github.com/pyrovski/powertools
But I am not quite sure, how and where to start. Are there any other good sources that I need to look ? Any suggestion for good resources to understand and start programming will be very helpful.

Please, don't mix tracing and timing measurements in single run.
It is just impossible both to have fastest run of Spec and all memory accesses traced. Do one run for timing and other (longer,slower) for memory access tracing.
In https://github.com/pyrovski/powertools the frequency of collected events is controlled by reset_val argument of pebs_init:
https://github.com/pyrovski/powertools/blob/0f66c5f3939a9b7b88ec73f140f1a0892cfba235/msr_pebs.c#L72
void
pebs_init(int nRecords, uint64_t *counter, uint64_t *reset_val ){
// 1. Set up the precise event buffering utilities.
// a. Place values in the
// i. precise event buffer base,
// ii. precise event index
// iii. precise event absolute maximum,
// iv. precise event interrupt threshold,
// v. and precise event counter reset fields
// of the DS buffer management area.
//
// 2. Enable PEBS. Set the Enable PEBS on PMC0 flag
// (bit 0) in IA32_PEBS_ENABLE_MSR.
//
// 3. Set up the IA32_PMC0 performance counter and
// IA32_PERFEVTSEL0 for an event listed in Table
// 18-10.
// IA32_DS_AREA points to 0x58 bytes of memory.
// (11 entries * 8 bytes each = 88 bytes.)
// Each PEBS record is 0xB0 byes long.
...
pds_area->pebs_counter0_reset = reset_val[0];
pds_area->pebs_counter1_reset = reset_val[1];
pds_area->pebs_counter2_reset = reset_val[2];
pds_area->pebs_counter3_reset = reset_val[3];
...
write_msr(0, PMC0, reset_val[0]);
write_msr(1, PMC1, reset_val[1]);
write_msr(2, PMC2, reset_val[2]);
write_msr(3, PMC3, reset_val[3]);
This project is library to access PEBS, and there are no examples of its usage included in project (as I found there is only one disabled test in other projects by tpatki).
Check intel SDM Manual Vol 3B (this is the only good resource for PEBS programming) for meaning of the fields and PEBS configuration and output:
https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-734.html
18.15.7 Processor Event-Based Sampling
PEBS permits the saving of precise architectural information associated with one or more performance events in the precise event records buffer, which is part of the DS save area (see Section 17.4.9, “BTS and DS Save Area”).
To use this mechanism, a counter is configured to overflow after it has counted a preset number of events. After the counter overflows, the processor copies the current state of the general-purpose and EFLAGS registers and instruction pointer into a record in the precise event records buffer. The processor then resets the count in the performance counter and restarts the counter. When the precise event records buffer is nearly full, an interrupt is generated, allowing the precise event records to be saved. A circular buffer is not supported for precise event
records.
... After the PEBS-enabled counter has overflowed, PEBS
record is recorded
(So, reset value is probably negative, equal to -1000 to get every 1000th event, -10 to get every 10th event. Counter will increment and PEBS is written at counter overflow.)
and https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-656.html 18.4.4 Processor Event Based Sampling (PEBS) "Table 18-10" - only L1/L2/DTLB misses have PEBS event in Intel Core. (Find PEBS section for your CPU and search for memory events. PEBS-capable events are really rare.)
So, to have more event recorded you probably want to set reset part of this function to smaller absolute value, like -50 or -10. With PEBS this may work (and try perf -e cycles:upp -c 10 - don't ask to profile kernel with so high frequency, only user-space :u and ask for precise with :pp and ask for -10 counter with -c 10. perf has all PEBS mechanics implemented both for MSR and for buffer parsing).
Another good resource for PMU (hardware performance monitoring unit) are also from Intel, PMU Programming Guides. They have short and compact description both of usual PMU and PEBS too. There is public "Nehalem Core PMU", most of it still useful for newer CPUs - https://software.intel.com/sites/default/files/m/5/2/c/f/1/30320-Nehalem-PMU-Programming-Guide-Core.pdf (And there are uncore PMU guides: E5-2600 Uncore PMU Guide, 2012 https://www.intel.com/content/dam/www/public/us/en/documents/design-guides/xeon-e5-2600-uncore-guide.pdf)
External pdf about PEBS: https://www.blackhat.com/docs/us-15/materials/us-15-Herath-These-Are-Not-Your-Grand-Daddys-CPU-Performance-Counters-CPU-Hardware-Performance-Counters-For-Security.pdf#page=23 PMCs: Setting Up for PEBS - from "Black Hat USA 2015 - These are Not Your Grand Daddy's CPU Performance Counters"
You may start from short and simple program (not the ref inputs of recent SpecCPU) and use perf linux tool (perf_events) to find acceptable ratio of memory requests recorded to all memory requests. PEBS is used with perf by adding :p and :pp suffix to the event specifier record -e event:pp. Also try pmu-tools ocperf.py for easier intel event name encoding.
Try to find the real (maximum) overhead with different recording ratios (1% / 10% / 50%) on the memory tests like (worst case of memory recording overhead, left part on the Arithmetic Intensity scale of Roofline model - STREAM is BLAS1, GUPS and memlat are almost SpMV; real tasks are usually not so left on the scale):
STREAM test (linear access to memory),
RandomAccess (GUPS) test
some memory latency test (memlat of 7z, lat_mem_rd of lmbench).
Do you want to trace every load/store commands or you only want to record requests that missed all (some) caches and were sent to main RAM memory of PC (to L3)?
Why you want no overhead and all memory accesses recorded? It is just impossible as every memory access have tracing of several bytes to be recorded to the memory. So, having memory tracing enabled (more than 10% or mem.access tracing) clearly will limit available memory bandwidth and the program will run slower. Even 1% tracing can be noted, but it effect (overhead) is smaller.
Your CPU E5-2620 v4 is Broadwell-EP 14nm so it may have also some earlier variant of the Intel PT: https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/intel-pt.txt https://github.com/01org/processor-trace and especially Andi Kleen's blog on pt: http://halobates.de/blog/p/410 "Cheat sheet for Intel Processor Trace with Linux perf and gdb"
PT support in hardware: Broadwell (5th generation Core, Xeon v4) More overhead. No fine grained timing.
PS: Scholars who study SpecCPU for memory access worked with memory access dumps/traces, and dumps were generated slowly:
http://www.bu.edu/barc2015/abstracts/Karsli_BARC_2015.pdf - LLC misses recorded to offline analysis, no timing was recorded from tracing runs
http://users.ece.utexas.edu/~ljohn/teaching/382m-15/reading/gove.pdf - all load/stores instrumented by writing into additional huge tracing buffer to periodic (rare) online aggregation. Such instrumentation is from 2x slow or slower, especially for memory bandwidth / latency limited core.
http://www.jaleels.org/ajaleel/publications/SPECanalysis.pdf (by Aamer Jaleel of Intel Corporation, VSSAD) - Pin-based instrumentation - program code was modified and instrumented to write memory access metadata into buffer. Such instrumentation is from 2x slow or slower, especially for memory bandwidth / latency limited core. The paper lists and explains instrumentation overhead and Caveats:
Instrumentation Overhead: Instrumentation involves
injecting extra code dynamically or statically into the
target application. The additional code causes an
application to spend extra time in executing the original
application ... Additionally, for multi-threaded
applications, instrumentation can modify the ordering of
instructions executed between different threads of the
application. As a result, IDS with multi-threaded
applications comes at the lack of some fidelity
Lack of Speculation: Instrumentation only observes
instructions executed on the correct path of execution. As
a result, IDS may not be able to support wrong-path ...
User-level Traffic Only: Current binary instrumentation
tools only support user-level instrumentation. Thus,
applications that are kernel intensive are unsuitable for
user-level IDS.

Related

How can x86 bsr/bsf have fixed latency, not data dependent? Doesn't it loop over bits like the pseudocode shows?

I am on the hook to analyze some "timing channels" of some x86 binary code. I am posting one question to comprehend the bsf/bsr opcodes.
So high-levelly, these two opcodes can be modeled as a "loop", which counts the leading and trailing zeros of a given operand. The x86 manual has a good formalization of these opcodes, something like the following:
IF SRC = 0
THEN
ZF ← 1;
DEST is undefined;
ELSE
ZF ← 0;
temp ← OperandSize – 1;
WHILE Bit(SRC, temp) = 0
DO
temp ← temp - 1;
OD;
DEST ← temp;
FI;
But to my suprise, bsf/bsr instructions seem to have fixed cpu cycles. According to some documents I found here: https://gmplib.org/~tege/x86-timing.pdf, seems that they always take 8 CPU cycles to finish.
So here are my questions:
I am confirming that these instructions have fixed cpu cycles. In other words, no matter what operand is given, they always take the same amount of time to process, and there is no "timing channel" behind. I cannot find corresponding specifications in Intel's official documents.
Then why it is possible? Apparently this is a "loop" or somewhat, at least high-levelly. What is the design decision behind? Easier for CPU pipelines?
BSF/BSR performance is not data dependent on any modern CPUs. See https://agner.org/optimize/, https://uops.info/, or http://instlatx64.atw.hu/ for experimental timing results, as well as the https://gmplib.org/~tege/x86-timing.pdf you found.
On modern Intel, they decode to 1 uop with 3 cycle latency and 1/clock throughput, running only on port 1. Ryzen also runs them with 3c latency for BSF, 4c latency for BSR, but multiple uops. Earlier AMD is sometimes even slower.
(Prefer rep bsf aka tzcnt in code that might run on AMD CPUs, if you don't need the FLAGS difference between bsf and tzcnt for zero inputs. lzcnt and tzcnt are fast on AMD as well, like 1 cycle latency with 3/clock throughput for lzcnt on Zen 2 (https://uops.info/). Unfortunately lzcnt and bsr aren't compatible that way, so you can't use it in an "optimistic" forward-compatible way, you have to know which you're getting.)
Your "8 cycle" (latency and throughput) cost appears to be for 32-bit BSF on AMD K8, from Granlund's table that you linked. Agner Fog's table agrees, (and shows it decodes to 21 uops instead of having a dedicated bit-scan execution unit. But the microcoded implementation is presumably still branchless and not data-dependent). No clue why you picked that number; K8 doesn't have SMT / Hyperthreading so the opportunity for an ALU-timing side channel is much reduced.
Do note that they have an output dependency on the destination register, which they leave unmodified if the input was zero. AMD documents this behaviour, Intel implements it in hardware but documents it as an "undefined" result, so unfortunately compilers won't take advantage of it and human programmers maybe should be cautious. IDK if some ancient 32-bit only CPU had different behaviour, or if Intel is planning to ever change (doubtful!), but I wish Intel would document the behaviour at least for 64-bit mode (which excludes any older CPUs).
lzcnt/tzcnt and popcnt on Intel CPUs (but not AMD) have the same output dependency before Skylake and before Cannon Lake (respectively), even though architecturally the result is well-defined for all inputs. They all use the same execution unit. (How is POPCNT implemented in hardware?). AMD Bulldozer/Ryzen builds their bit-scan execution unit without the output dependency baked in, so BSF/BSR are slower than LZCNT/TZCNT (multiple uops to handle the input=0 case, and probably also setting ZF according to the input, not the result).
(Taking advantage of that with intrinsics isn't possible; not even with MSVC's _BitScanReverse64 which uses a by-reference output arg that you could set first. MSVC doesn't respect the previous value and assumes it's output-only. VS: unexpected optimization behavior with _BitScanReverse64 intrinsic)
The pseudocode in the manual is not the implementation
(i.e. it's not necessarily how hardware or microcode works).
It gives precisely the same result in all cases, so you can use it to understand exactly what will happen for any corner cases the text leaves you wondering about. That is all.
The point is to be simple and easy to understand, and that means modeling things in terms of simple 2-input operations which happen serially. C / Fortran / typical pseudocode doesn't have operators for many-input AND, OR, or XOR, but you can build that in hardware up to a point (limited by fan-in, the opposite of fan-out).
Integer addition can be modelled as bit-serial ripple carry, but that's not how it's implemented! Instead, we get single-cycle latency for 64-bit addition with far fewer than 64 gate delays using tricks like carry lookahead adders.
The actual implementation techniques used in Intel's bit-scan / popcnt execution unit are described in US Patent US8214414 B2.
Abstract
A merged datapath for PopCount and BitScan is described. A hardware
circuit includes a compressor tree utilized for a PopCount function,
which is reused by a BitScan function (e.g., bit scan forward (BSF) or
bit scan reverse (BSR)).
Selector logic enables the compressor tree to
operate on an input word for the PopCount or BitScan operation, based
on a microprocessor instruction. The input word is encoded if a
BitScan operation is selected.
The compressor tree receives the input
word, operates on the bits as though all bits have same level of
significance (e.g., for an N-bit input word, the input word is treated
as N one-bit inputs). The result of the compressor tree circuit is a
binary value representing a number related to the operation performed
(the number of set bits for PopCount, or the bit position of the first
set bit encountered by scanning the input word).
It's fairly safe to assume that Intel's actual silicon works similarly to this. Other Intel patents for things like out-of-order machinery (ROB, RS) do tend to match up with performance experiments we can perform.
AMD may do something different, but regardless we know from performance experiments that it's not data-dependent.
It's well known that fixed latency is a hugely beneficial thing for out-of-order scheduling, so it's very surprising when instructions don't have fixed latency. Sandybridge even went so far as to standardize uop latencies to simplify the scheduler and reduce the opportunities write-back conflicts. (e.g. a 3-cycle latency uop followed by a 2-cycle latency uop to the same port would produce 2 results in the same cycle). This meant making complex-LEA (with all 3 components: [disp + base + idx*scale]) take 3 cycles instead of just 2 for the 2 additions like on previous CPUs. There are no 2-cycle latency uops on Sandybridge-family. (There are some 2-cycle latency instructions, because they decode to 2 uops with 1c latency each. The scheduler schedules uops, not instructions).
One of the few exceptions to the rule of fixed latency for ALU uops is division / sqrt, which uses a not-fully-pipelined execution unit. Division is inherently iterative, unlike multiplication where you can make wide hardware that does the partial products and partial additions in parallel.
On Intel CPUs, variable-latency for L1d cache access can produce replays of dependent uops if the data wasn't ready when the scheduler optimistically hoped it would be.
Is there a penalty when base+offset is in a different page than the base?
Why does the number of uops per iteration increase with the stride of streaming loads?
Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?
The 80x86 manual has a good description of the expected behavior, but that has nothing to do with how it's actually implemented in silicon in any model from any manufacturer.
Let's say that there's been 50 different CPU designs from Intel, 25 CPU designs from AMD, then 25 more from other manufacturers (VIA, Cyrix, SiS/Vortex, NSC, ...). Out of those 100 different CPU designs, maybe there's 20 completely different ways that BSF has been implemented, and maybe 10 of them have fixed timing, 5 have timing that depends on every bit of the source operand, and 5 depend on groups of bits of the source operand (e.g. maybe like "if highest 32 bits of 64-bit operand are zeros { switch to 32-bit logic that's 2 cycles faster }").
I am confirming that these instructions have fixed cpu cycles. In other words, no matter what operand is given, they always take the same amount of time to process, and there is no "timing channel" behind. I cannot find corresponding specifications in Intel's official documents.
You can't. More specifically, you can test or research existing CPUs, but that's a waste of time because next week Intel (or AMD or VIA or someone else) can release a new CPU that has completely different timing.
As soon as you rely on "measured from existing CPUs" you're doing it wrong. You have to rely on "architectural guarantees" that apply to all future CPUs. There is no "architectural guarantee". You have to assume that there may be a timing side-channel (even if there isn't for current CPUs)
Then why it is possible? Apparently this is a "loop" or somewhat, at least high-levelly. What is the design decision behind? Easier for CPU pipelines?
Instead of doing a 64-bit BSF, why not split it into a pair of 32-bit pieces and do them in parallel, then merge the results? Why not split it into eight 8-bit pieces? Why not use a table lookup for each 8-bit piece?
The answers posted have explained well that the implementation is different from pseudocode. But if you are still curious why the latency is fixed and not data dependent or uses any loops for that matter, you need to see electronic side of things.
One way you could implement this feature in hardware is by using a Priority encoder.
A priority encoder will accept n input lines that can be one or off (0 or 1) and give out the index of the highest priority line that is on. Below is a table from the linked Wikipedia article modified for a most significant set bit function.
input | output index of first set bit
0000 | xx undefined
0001 | 00 0
001x | 01 1
01xx | 10 2
1xxx | 11 3
x denotes the bit value does not matter and can be anything
If you see the circuit diagram on the article, there are no loops of any kind, it is all parallel.

Store forwarding Address vs Data: What the difference between STD and STA in the Intel Optimization guide?

I'm wondering if any Intel experts out there can tell me the difference between STD and STA with respect to the Intel Skylake core.
In the Intel optimization guide, there's a picture describing the "super-scalar ports" of the Intel Cores.
Here's the PDF. The picture is on page 40.
.
Here's another picture from page 78, this picture describes "Store Address" and "Store Data":
Prepares the store forwarding and store retirement logic with the address of the data being stored.
Prepares the store forwarding and store retirement logic with the data being stored.
Considering that Skylake can perform #1 3x per clock cycle, but can only perform #2 once per clock cycle, I was curious what the difference was between these two.
It seems "natural" to me that store-forwarding would be done to the address of the data. But I can't understand when store-forwarding on the data (aka: STD / Port 4) would ever be done. Are there any assembly / optimization experts out there that can help me understand exactly the difference between STD and STA is?
Intel CPUs have been splitting stores into store-address and store-data since the first P6-family microarchitecture, Pentium Pro.
But store-address and store-data uops can micro-fuse into one fused-domain uop. On Sandy/IvyBridge, indexed addressing modes are un-laminated as described in Intel's optimization manual. But Haswell and later can keep them micro-fused even in the ROB, so they aren't un-laminated. See Micro fusion and addressing modes. (Intel doesn't mention this, and Agner Fog hasn't had time to test extensively for Haswell/Skylake so his usually-good microarch PDF doesn't even mention un-lamination at all. But you should still definitely read it to learn more about how uops work and how instructions are decoded and go through the pipeline. See also other x86 performance links in the x86 tag wiki)
Considering that Skylake can perform #1 3x per clock cycle, but can only perform #2 once per clock cycle
Ports 2 and 3 can also run load uops on their AGUs, leaving the load-data part of the port unused that cycle. Port7 only has a dedicated store-AGU for simple addressing modes.
Store addressing modes with an index register can't use port 7, only p2/p3. But if you do use "simple" addressing modes for stores, the peak throughput is 2 loads + 1 store per clock.
On Nehalem and earlier (P6 family), p2 was the only load port, p3 was the store-address port, and p4 was store-data.
On IvyBridge/Sandybridge, there weren't separate ports for store-address uops, they always just ran on the AGU (Address Generation Unit) in the load ports (p23). With 256b loads / stores, the AGU was only needed every other cycle (256b load or store uops occupy the load or store-data ports for 2 cycles, but the load ports can accept a store-address uop during that 2nd cycle). So 2 load / 1 store per clock was in theory sustainable on Sandybridge, but only if most of it was with AVX 256-bit vector loads / stores running as two 128-bit halves.
Haswell added the dedicated store-AGU on port7 and widened the load/store execution units to 256b, because there aren't spare cycles when the load ports don't need their AGUs if there's a steady supply of loads.
A store-address uop writes the address (and width, I guess) into the store buffer (aka Memory Order Buffer in Intel's terminology). Having this happen separately, and possibly before the data to be stored is even ready lets later loads (in program order) detect whether they overlap the store or not.
Out-of-order execution of loads when there are pending stores with unknown address is problematic: a wrong guess means having to roll back the pipeline. (I think the machine_clears.memory_ordering perf counter event includes this. It is possible to get non-zero counts for this from single-threaded code, but I forget if I had definite evidence that Skylake sometimes speculatively guesses that loads don't overlap unknown-address stores).
As David Kanter points out in his Haswell microarch writeup, a load uop also needs to probe the store buffer to check for forwarding / conflicts, so an execution unit that only runs store-address uops is cheaper to build.
Anyway, I'm not sure what the performance implications would be if Intel redesigned things so port7 had a full AGU that could handle indexed addressing modes, too, and made store-address uops only run on p7, not p2/p3.
That would stop store-address uops from "stealing" p23, which does happen and which reduces max sustained L1D bandwidth from 96 bytes / cycle (2 load + 1 store of 32-byte YMM vectors) down to ~81 bytes / cycle for Skylake according to a table in Intel's optimization manual. But under the right circumstances, Skylake can sustain 2 loads + 1 store per clock of 4-byte operands, so maybe that 81-byte / cycle number is limited by some other microarchitectural limit. The peak is 96B/clock, but apparently that can't happen back-to-back indefinitely.
One downside to stopping store-address uops from running on p23 is that it would take longer for store addresses to be known, maybe delaying loads more.
I can't understand when store-forwarding on the data (aka: STD / Port 4) would ever be done.
A store/reload can have the load take the data from the store buffer, instead of waiting for it to commit to L1D and reading it from there.
How does store to load forwarding happens in case of unaligned memory access?
Store-to-Load Forwarding and Memory Disambiguation in x86 Processors
Store/reload can happen when a function spills some registers before calling a function, of as part of passing args on the stack (especially with crappy stack-args calling conventions that pass all args on the stack). Or passing something by reference to a non-inline function. Or in a histogram, if the same bin is hit repeatedly, you're basically doing a memory-destination increment in a loop.
Its been a few days without a response, so here's my best guess at "answering my own question".
The raw x86 instruction set isn't executed directly by modern processors. Instead, the x86 instruction set is "compiled" down into Micro-ops (uOps) before being executed by the Intel core. This shouldn't be too surprising, because some x86 instructions can be complex. An example taken from the optimization guide is as follows:
Similarly, the following store instruction has three register sources and is broken into "generate store
address" and "generate store data" sub-components.
MOV [ESP+ECX*4+12345678], AL
This is currently found on page 50 of the optimization manual (2.3.2.4 Micro-op Queue and the Loop Stream Detector (LSD)).
In this case, the address of the store operation is complex, so it is its own uOp. So at very least, this singular x86 instruction gets converted into two uOps internally. The names of these two uOps are "Store Address" and "Store Data". The manual doesn't describe the internal uOps at all, so it may take even more than two uOps to accomplish.
Since there's only one "store data" port on Skylake systems, that means that Skylake can only modify at most one memory location per cycle. The three "Store Address" ports means that Skylake can calculate the effective address of many instructions simultaneously (possibly because some very complicated addresses may take more than one uOp to execute??).

Interpreting Intel VTune's Memory Bound Metric

I see the following when I run Intel VTune on my workload:
Memory Bound 50.8%
I read the Intel doc, which says (Intel doc):
Memory Bound measures a fraction of slots where pipeline could be stalled due to demand load or store instructions. This accounts mainly for incomplete in-flight memory demand loads that coincide with execution starvation in addition to less common cases where stores could imply back-pressure on the pipeline.
Does that mean that roughly half of the instructions in my app are stalled waiting for memory, or is it more subtle than that?
The pipeline slots concept used by VTune is explain e.g. here: https://software.intel.com/en-us/top-down-microarchitecture-analysis-method-win.
In short pipeline slot represents the hardware resources needed to process one uOp. So for 4-wide CPUs (most Intel processors) we can execute 4 Ops each cycle and the total number of slots will be measured as 4 * CPU_CLK_UNHALTED.THREAD by VTune.
The Memory Bound metric is built on CYCLE_ACTIVITY.STALLS_MEM_ANY event which gives you directly stalls due to memory. Taking into account out-of-order. Basically only if CPU is stalled and at the same time it has in-flight loads the counter is incremented. If there are loads in-flight but CPU is kept busy it is not accounted as memory stall.
So Memory Bound metric provides quite accurate estimation on how much the workload is bound by memory performance issues. The value of 50% means that half of the time was wasted waiting for data from memory.
A slot is an execution port of the pipeline. In general in the VTune documentation, a stall could either mean "not retired" or "not dispatched for execution". In this case, it refers to the number of cycles in which zero uops were dispatched.
According to the VTune include configuration files, Memory Bound is calculated as follows:
Memory_Bound = Memory_Bound_Fraction * BackendBound
Memory_Bound_Fraction is basically the fraction of slots mentioned in the documentation. However, according to the top-down method discussed in the optimization manual, the memory bound metric is relative to the backend bound metric. So this is why it is multiplied by BackendBound.
I'll focus on the first term of the formula, Memory_Bound_Fraction. The formula for the second term, BackendBound, is actually complicated.
Memory_Bound_Fraction is calculated as follows:
Memory_Bound_Fraction = (CYCLE_ACTIVITY.STALLS_MEM_ANY + RESOURCE_STALLS.SB) * NUM_OF_PORTS / Backend_Bound_Cycles * NUM_OF_PORTS
NUM_OF_PORTS is the number of execution ports of the microarchitecture of the target CPU. This can be simplified to:
Memory_Bound_Fraction = CYCLE_ACTIVITY.STALLS_MEM_ANY + RESOURCE_STALLS.SB / Backend_Bound_Cycles
CYCLE_ACTIVITY.STALLS_MEM_ANY and RESOURCE_STALLS.SB are performance events. Backend_Bound_Cycles is calculated as follows:
Backend_Bound_Cycles = CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - Few_Uops_Executed_Threshold - Frontend_RS_Empty_Cycles + RESOURCE_STALLS.SB
Few_Uops_Executed_Threshold is either UOPS_EXECUTED.CYCLES_GE_2_UOP_EXEC or UOPS_EXECUTED.CYCLES_GE_3_UOP_EXEC depending on some other metric. Frontend_RS_Empty_Cycles is either RS_EVENTS.EMPTY_CYCLES or zero depending on some metric.
I realize this answer still needs a lot of additional explanation and BackendBound needs to be expanded. But this early edit makes the answer accurate.

Is it possible to know the address of a cache miss?

Whenever a cache miss occurs, is it possible to know the address of that missed cache line? Are there any hardware performance counters in modern processors that can provide such information?
Yes, on modern Intel hardware there are precise memory sampling events that track not only the address of the instruction, but the data address as well. These events also includes a great deal of other information, such as what level of the cache hierarchy the memory access was satisfied it, the total latency and so on.
You can use perf mem to sample this information and produces a report.
For example, the following program:
#include <stddef.h>
#define SIZE (100 * 1024 * 1024)
int p[SIZE] = {1};
void do_writes(volatile int *p) {
for (size_t i = 0; i < SIZE; i += 5) {
p[i] = 42;
}
}
void do_reads(volatile int *p) {
volatile int sink;
for (size_t i = 0; i < SIZE; i += 5) {
sink = p[i];
}
}
int main(int argc, char **argv) {
do_writes(p);
do_reads(p);
}
compiled with:
g++ -g -O1 -march=native perf-mem-test.cpp -o perf-mem-test
and run with:
sudo perf mem record -U ./perf-mem-test && sudo perf mem report
Produces a report of memory accesses sorted by latency like this:
The Data Symbol column shows where address the load was targeting - most here show up as something like p+0xa0658b4 which means at an offset of 0xa0658b4 from the start of p which makes sense as the code is reading and writing p. The list is sorted by "local weight" which is the access latency in reference cycles1.
Note that the information recorded is only a sample of memory accesses: recording every miss would usually be way too much information. Furthermore, it only records loads with a latency of 30 cycles or more by default, but you can apparently tweak this with command line arguments.
If you're only interested in accesses that miss in all levels of cache, you're looking for the "Local RAM hit" lines2. Perhaps you can restrict your sampling to only cache misses - I'm pretty sure the Intel memory sampling stuff supports that, and I think you can tell perf mem to look at only misses.
Finally, note that here I'm using the -U argument after record which instructs perf mem to only record userspace events. By default it will include kernel events, which may or may not be useful for your. For the example program, there are many kernel events associated with copying the p array from the binary into writable process memory.
Keep in mind that I specifically arranged my program such that the global array p ended up in the initialized .data section (the binary is ~400 MB!), so that it shows up with the right symbol in the listing. The vast majority of the time your process is going to be accessing dynamically allocated or stack memory, which will just give you a raw address. Whether you can map this back to a meaningful object depends on if you track enough information to make that possible.
1 I think it's in reference cycles, but I could be wrong and the kernel may have already converted it to nanoseconds?
2 The "Local" and "hit" part here refer to the fact that we hit the RAM attached to the current core, i.e., we didn't have go to the RAM associated with another socket in a multi-socket NUMA configuration.
If you want to know the exact virtual or physical address of every cache miss on a particular processor, that would be very hard and sometimes impossible. But you are more likely to be interested in expensive memory access patterns; those patterns that incur large latencies because they miss in one or more levels of the cache subsystem. Note that it is important to keep in mind that a cache miss on one processor might be a cache hit on another depending on design details of each processor and depending also on the operating system.
There are several ways to find such patterns, two are commonly used. One is to use a simulator such as gem5 or Sniper. Another is to use hardware performance events. Events that represent cache misses are available but they do not provide any details on why or where a miss occurred. However, using a profiler, you can approximately associate cache misses as reported by the corresponding hardware performance events with the instructions that caused them which in turn can be mapped back to locations in the source code using debug information. Examples of such profilers include Intel VTune Amplifier and AMD CodeXL. The results produced by simulators and profilers may not be accurate and so you have to be careful when interpreting them.

Linux Multi-Threaded Performance Enhancements for File open()

I’m working on tuning performance on a high-performance, high-capacity data engine which ultimately services an end-user web experience. Specifically, the piece delegated to me revolves around characterizing multi-threaded file IO and memory mapping of the data to local cache. In writing test applications to isolate the timing tall-poles, several questions have been exposed. The code has been minimized to perform only a system file open (open(O_RDONLY)) call. I’m hoping that the result of this query helps us understand the fundamental low-level system processes so that a complete predictive (or at least relational) timing model can be understood. Suggestions are always welcome. We’ve seemed to hit a timing barrier, and would like to understand the behavior and determine whether that barrier can be broken.
The test program:
Is written in C, compiled using the gnu C compiler as noted below;
Is minimally written to isolate the discovered issues to a single system file “open()”;
Is configurable to simultaneously launch a requested number of pthreads;
loads a list of 1000 text files of ~8K size;
creates the threads (simply) with no attribute modifications;
each thread performs multiple, sequential file open() calls on the next available file from the pre-determined list of files until the file list is exhausted in such a way that a single thread should open all 1000 files, 2 threads should theoretically open 500 files (not proven as of yet), etc.);
We’ve run tests multiple times, parametrically varying the thread count, file sizes, and whether the files are located on a local or remote server. Several questions have come up.
Observed results (opening remote files):
File open times are higher the first time through (as expected, due to file caching);
Running the test app with one thread to load all the remote files takes X seconds;
It appears that running the app with a thread count between 1 and # of available CPUs on the machine results in times that are proportional to the number of CPUs (nX seconds).
Running the app using a thread count > #CPUs results in run times that seem to level out at the approx same value as the time is takes to run with #CPUs threads (is this coincidental, or a systematic limit, or what?).
Running multiple, concurrent processes (for example, 25 concurrent instances of the same test app) results in the times being approximately linear with number of processes for a selected thread count.
Running app on different servers shows similar results
Observed results (opening files residing locally):
Orders of magnitude faster times (as to be expected);
With increasing the thread count, a LOW timing inflection point occurs at around 4-5 active threads, then increases again until the number of threads equals the CPU count, then levels off again;
Running multiple, concurrent processes (same test) results in the times being approximately linear with number of processes for a constant thread count (same result as #5 above).
Also, we noticed that Local opens take about .01 ms and sequential network opens are 100x slower at 1ms. Opening network files, we get a linear throughput increase up to 8x with 8 threads, but 9+ threads do nothing. The network open calls seem to block after more than 8 simultaneous requests. What we expected was an initial delay equal to the network roundtrip, and then approximately the same throughput as local. Perhaps there is extra mutex locking done on the local and remote systems that takes 100x longer. Perhaps there is some internal queue of remote calls that only holds 8.
Expected results and questions to be answered either by test or by answers from forums like this one:
Running multiple threads would result in the same work done in shorter time;
Is there an optimal number of threads;
Is there a relationship between the number of threads and CPUs available?
Is there some other systematic reason that an 8-10 file limit is observed?
How does the system call to “open()” work in a multi-threading process?
Each thread gets its context-switched time-slice;
Does the open() call block and wait until the file is open/loaded into file cache? Or does the call allow context switching to occur while the operation is in progress?
When the open() completes, does the scheduler reprioritize that thread to execute sooner, or does the thread have to wait until its turn in round-robin way;
Would having the mounted volume on which the 1000 files reside set as read-only or read/write make a difference?
When open() is called with a full path, is each element in the path stat()ed? Would it make more sense to open() a common directory in the list of files tree, and then open() the files under that common directory by relative path?
Development test setup:
Red Hat Enterprise Linux Server release 5.4 (Tikanga)
8-CPUS, each with characteristics as shown below:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU X5460 # 3.16GHz
stepping : 6
cpu MHz : 1992.000
cache size : 6144 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 4
apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 lahf_lm
bogomips : 6317.47
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power management:
GNU C compiler, version:
gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46)
Not sure if this is one of your issues, but it may be of use.
The one thing that struck me, while optimizing thousands of random reads on a single SATA disk, was that performing non-blocking I/O isn't so easy to do in linux in a clean way, without extra threads.
It is (currently) impossible to issue a non-blocking read() on a block device; i.e. it will block for the 5 ms seek time the disk needs (and 5 ms is an eternity, at 3 GHz). Specifying O_NONBLOCK to open() only served some purpose for backward compatibility, with CD burners or something (this was a rather vague issue). Normally, open() doesn't block or cache anything, it's mostly just to get a handle on a file to do some data I/O later.
For my purposes, mmap() seemed to get me as close to the kernel handling of the disk as possible. Using madvise() and mincore() I was able to fully exploit the NCQ capabilities of the disk, which was simply proved by varying the queue depth of outstanding requests, which turned out to be inversely proportional to the total time taken to issue 10k reads.
Thanks to 64 bit memory addressing, using mmap() to map an entire disk to memory is no problem at all. (on 32 bit platforms, you would need to map the parts of the disk you need using mmap64())

Resources