Variable event count based sampling using perf - performance

I am trying to read the PMU event counters whenever a particular event counter overflows using perf. I know that perf works with fixed sample period. What i am looking for is the possibility to read PMU counters each time with a different sample period within an application.? Yes, i have a requirement that demands variable sampling period. E.g If an application has 1000 instructions. I want to read PMU event counters at 200, 150, 300, 50, 150, 100, 50 instructions. Any directions would be really helpful.

Related

Is it possible to do a perf stat with an interval which would be not a time?

I would like to know if it is possible to modify easily perf linux with stat module to create an interval of cycles (or instructions by cycle) instead of an interval of time ? The goal is to optimize the precision of the counters got by interval. The time unit is not accurate enough.
I have a friend which submits this idea but I looked the source code a little and I understood (maybe I have wrong) that we :
create a condition for a time calculated with the rdtsc library (some clock_gettime)
create a "wait" in the perf processus
launch the program to test
test if we respect the time condition : we continue or we break the wait function with a save on the mapped register system information in perf (and call the wait function if it is not over)
I would like this result :
cycles counts unit events
10000 25000 instructions
10000 450 branch-misses
20000 21000 instructions
20000 850 branch-misses
Unfortunately, I'm seeing a big problem if I want to use the result of a counter like a condition I have not yet. Or should I get all the time this or these counter(s) which define my "interval condition" ? I also saw that for a time interval, we shouldn't get counters with a frequency lower than 100ms because it generates overhead. If I get some counters every 10000 cycles, would I have the same problems ? I don't know where is this overhead (calls system ?).

Performance Counters and IMC Counter Not Matching

I have an Intel(R) Core(TM) i7-4720HQ CPU # 2.60GHz (Haswell) processor. In a relatively idle situation, I ran the following Perf commands and their outputs are shown, below. The counters are offcore_response.all_data_rd.l3_miss.any_response and mem_load_uops_retired.l3_miss:
sudo perf stat -a -e offcore_response.all_data_rd.l3_miss.any_response,mem_load_uops_retired.l3_miss sleep 10
Performance counter stats for 'system wide':
3,713,037 offcore_response.all_data_rd.l3_miss.any_response
2,909,573 mem_load_uops_retired.l3_miss
10.016644133 seconds time elapsed
These two values seem consistent, as the latter excludes prefetch requests and those not targeted at DRAM. But they do not match the read counter in the IMC. This counter is called UNC_IMC_DRAM_DATA_READS and documented here. I read the counter reread it 1 second later. The difference was around 30,000,000 (EDITED). If multiplied by 10 (to estimate for 10 seconds) the resulting value will be around 300 million (EDITED), which is 100 times the value of the above-mentioned performance counters (EDITED). It is nowhere near 3 million! What am I missing?
P.S.: The difference is much smaller (but still large), when the system has more load.
The question is also asked, here:
https://community.intel.com/t5/Software-Tuning-Performance/Performance-Counters-and-IMC-Counter-Not-Matching/m-p/1288832
UPDATE:
Please note that PCM output matches my IMC counter reads.
This is the relevant PCM output:
The values for columns READ, WRITE and IO are calculated based on UNC_IMC_DRAM_DATA_READS, UNC_IMC_DRAM_DATA_WRITES and UNC_IMC_DRAM_IO_REQUESTS, respectively. It seems that requests classified as IO will be either READ or WRITE. In other words, during the depicted one second interval, almost (because of the inaccuracy reported in the above-mentioned doc) 2.01GB of the 2.42GB READ and WRITE requests belong to IO. Based on this explanation, the above three columns seem consistent with each other.
The problem is that there still exists a LARGE gap between the IMC and PMC values!
The situation is the same when I boot in runlevel 1. The processes on the scheduler are one of swapper, kworker and migration. Disk IO is almost 85KB/s. I'm wondering what leads to such a (relatively) huge amount of IO. Is it possible to detect that (e.g., using a counter or a tool)?
UPDATE 2:
I think that there is something wrong with the IO column. It is always something in the range [1.99,2.01], regardless of the amount of load in the system!
UPDATE 3:
In runlevel 1, the average number of occurrences of the uops_retired.all event in a 1-second interval is 15,000,000. During the same period, the number of read requests recorded by the associated IMC counter is around 30,000,000. In other words, assuming that all memory accesses are directly caused by cpu instructions, for each retired micro-operation, there exists two memory accesses. This seems impossible specially concerning the fact that there exist multiple levels of caches. Therefore, in the idle scenario, perhaps, the read accesses are caused by IO.
Actually, it was mostly caused by the GPU device. This was the reason for exclusion from performance counters. Here is the relevant output for a sample execution of PCM on a relatively idle system with resolution 3840x2160 and refresh rate 60 using xrandr:
And this is for the situation with resolution 800x600 and the same refresh rate (i.e., 60):
As can be seen, changing screen resolution reduced read and IO traffic considerably (more than 100x!).

What’s the meaning of `Duration: 30.18s, Total samples = 26.26s (87.00%)` in go pprof?

As my understanding, pprof stops and samples go program at every 10ms. So a 30s program should got 3000 samples, but what’s the meaning of the 26.26s? How can the samples count be shown as time duration?
What’s more, I even ever got such output shows that the sample time is bigger than wall time, how could it be such result?
Duration: 5.13s, Total samples = 5.57s (108.58%)
That confusing wording was reported in google/pprof issue 128
The "Total samples" part is confusing.
Milliseconds are continuous, but samples are discrete — they're individual points, so how can you sum them up into a quantity of milliseconds?
The "sum" is the sum of a discrete number (a quantity of samples), not a continuous range (a time interval).
Reporting the sums makes perfect sense, but reporting discrete numbers using continuous units is just plain confusing.
Please update the formatting of the Duration line to give a clearer indication of what a quantity of samples reported in milliseconds actually means.
Raul Silvera's answer:
Each callstack in a profile is associated to a set of values. What is reported here is the sum of these values for all the callstacks in the profile, and is useful to understand the weight of individual frames over the full profile.
We're reporting the sum using the unit described in the profile.
Would you have a concrete suggestion for this example?
Maybe just a rewording would help, like:
Duration: 1.60s, Samples account for 14.50ms (0.9%)
There are still pprof improvements discussed for pprof in golang/go issue 36821
pprof CPU profiles lack accuracy (closeness to the ground truth) and precision (repeatability across different runs).
The issue is with the use of OS timers used for sampling; OS timers are coarse-grained and have a high skid.
I will propose a design to extend CPU profiling by sampling CPU Performance Monitoring Unit (PMU) aka hardware performance counters
It includes examples where Total samples exceeds duration.
Dependence on the number of cores and length of test execution:
The results of goroutine.go test depend on the number of CPU cores available.
On a multi-core CPU, if you set GOMAXPROCS=1, goroutine.go will not show a huge variation, since each goroutine runs for several seconds.
However, if you set GOMAXPROCS to a larger value, say 4, you will notice a significant measurement attribution problem.
One reason for this problem is that the itimer samples on Linux are not guaranteed to be delivered to the thread whose timer expired.
Since Go 1.17 (and improved in Go 1.18), you can add pprof labels to know more:
A cool feature of Go's CPU profiler is that you can attach arbitrary key value pairs to a goroutine. These labels will be inherited by any goroutine spawned from that goroutine and show up in the resulting profile.
Let's consider the example below that does some CPU work() on behalf of a user.
By using the pprof.Labels() and pprof.Do() API, we can associate the user with the goroutine that is executing the work() function.
Additionally the labels are automatically inherited by any goroutine spawned within the same code block, for example the backgroundWork() goroutine.
func work(ctx context.Context, user string) {
labels := pprof.Labels("user", user)
pprof.Do(ctx, labels, func(_ context.Context) {
go backgroundWork()
directWork()
})
}
How you use these labels is up to you.
You might include things such as user ids, request ids, http endpoints, subscription plan or other data that can allow you to get a better understanding of what types of requests are causing high CPU utilization, even when they are being processed by the same code paths.
That being said, using labels will increase the size of your pprof files. So you should probably start with low cardinality labels such as endpoints before moving on to high cardinality labels once you feel confident that they don't impact the performance of your application.

Good resources on how to program PEBS (Precise event based sampling) counters?

I have been trying to log all memory accesses of a program, which as I read seems to be impossible. I have been trying to see to what extent can I go to log at least a major portion of the memory accesses, if not all. So I was looking to program the PEBS counters in such a way that I could see changes in the number of memory access samples collected. I wanted to know if I can do this by modifying the counter-reset value of PEBS counters. (Usually this goes to zero, but I want to set it to a higher value)
So I was looking to program these PEBS counters on my own. Has anybody had experience manipulating the PEBS counters ? Specifically I was looking for good sources to see how to program them. I have gone through the Intel documentation and understood the steps. But I wanted to understand some sample programs. I have gone through the below github repo :-
https://github.com/pyrovski/powertools
But I am not quite sure, how and where to start. Are there any other good sources that I need to look ? Any suggestion for good resources to understand and start programming will be very helpful.
Please, don't mix tracing and timing measurements in single run.
It is just impossible both to have fastest run of Spec and all memory accesses traced. Do one run for timing and other (longer,slower) for memory access tracing.
In https://github.com/pyrovski/powertools the frequency of collected events is controlled by reset_val argument of pebs_init:
https://github.com/pyrovski/powertools/blob/0f66c5f3939a9b7b88ec73f140f1a0892cfba235/msr_pebs.c#L72
void
pebs_init(int nRecords, uint64_t *counter, uint64_t *reset_val ){
// 1. Set up the precise event buffering utilities.
// a. Place values in the
// i. precise event buffer base,
// ii. precise event index
// iii. precise event absolute maximum,
// iv. precise event interrupt threshold,
// v. and precise event counter reset fields
// of the DS buffer management area.
//
// 2. Enable PEBS. Set the Enable PEBS on PMC0 flag
// (bit 0) in IA32_PEBS_ENABLE_MSR.
//
// 3. Set up the IA32_PMC0 performance counter and
// IA32_PERFEVTSEL0 for an event listed in Table
// 18-10.
// IA32_DS_AREA points to 0x58 bytes of memory.
// (11 entries * 8 bytes each = 88 bytes.)
// Each PEBS record is 0xB0 byes long.
...
pds_area->pebs_counter0_reset = reset_val[0];
pds_area->pebs_counter1_reset = reset_val[1];
pds_area->pebs_counter2_reset = reset_val[2];
pds_area->pebs_counter3_reset = reset_val[3];
...
write_msr(0, PMC0, reset_val[0]);
write_msr(1, PMC1, reset_val[1]);
write_msr(2, PMC2, reset_val[2]);
write_msr(3, PMC3, reset_val[3]);
This project is library to access PEBS, and there are no examples of its usage included in project (as I found there is only one disabled test in other projects by tpatki).
Check intel SDM Manual Vol 3B (this is the only good resource for PEBS programming) for meaning of the fields and PEBS configuration and output:
https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-734.html
18.15.7 Processor Event-Based Sampling
PEBS permits the saving of precise architectural information associated with one or more performance events in the precise event records buffer, which is part of the DS save area (see Section 17.4.9, “BTS and DS Save Area”).
To use this mechanism, a counter is configured to overflow after it has counted a preset number of events. After the counter overflows, the processor copies the current state of the general-purpose and EFLAGS registers and instruction pointer into a record in the precise event records buffer. The processor then resets the count in the performance counter and restarts the counter. When the precise event records buffer is nearly full, an interrupt is generated, allowing the precise event records to be saved. A circular buffer is not supported for precise event
records.
... After the PEBS-enabled counter has overflowed, PEBS
record is recorded
(So, reset value is probably negative, equal to -1000 to get every 1000th event, -10 to get every 10th event. Counter will increment and PEBS is written at counter overflow.)
and https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-656.html 18.4.4 Processor Event Based Sampling (PEBS) "Table 18-10" - only L1/L2/DTLB misses have PEBS event in Intel Core. (Find PEBS section for your CPU and search for memory events. PEBS-capable events are really rare.)
So, to have more event recorded you probably want to set reset part of this function to smaller absolute value, like -50 or -10. With PEBS this may work (and try perf -e cycles:upp -c 10 - don't ask to profile kernel with so high frequency, only user-space :u and ask for precise with :pp and ask for -10 counter with -c 10. perf has all PEBS mechanics implemented both for MSR and for buffer parsing).
Another good resource for PMU (hardware performance monitoring unit) are also from Intel, PMU Programming Guides. They have short and compact description both of usual PMU and PEBS too. There is public "Nehalem Core PMU", most of it still useful for newer CPUs - https://software.intel.com/sites/default/files/m/5/2/c/f/1/30320-Nehalem-PMU-Programming-Guide-Core.pdf (And there are uncore PMU guides: E5-2600 Uncore PMU Guide, 2012 https://www.intel.com/content/dam/www/public/us/en/documents/design-guides/xeon-e5-2600-uncore-guide.pdf)
External pdf about PEBS: https://www.blackhat.com/docs/us-15/materials/us-15-Herath-These-Are-Not-Your-Grand-Daddys-CPU-Performance-Counters-CPU-Hardware-Performance-Counters-For-Security.pdf#page=23 PMCs: Setting Up for PEBS - from "Black Hat USA 2015 - These are Not Your Grand Daddy's CPU Performance Counters"
You may start from short and simple program (not the ref inputs of recent SpecCPU) and use perf linux tool (perf_events) to find acceptable ratio of memory requests recorded to all memory requests. PEBS is used with perf by adding :p and :pp suffix to the event specifier record -e event:pp. Also try pmu-tools ocperf.py for easier intel event name encoding.
Try to find the real (maximum) overhead with different recording ratios (1% / 10% / 50%) on the memory tests like (worst case of memory recording overhead, left part on the Arithmetic Intensity scale of Roofline model - STREAM is BLAS1, GUPS and memlat are almost SpMV; real tasks are usually not so left on the scale):
STREAM test (linear access to memory),
RandomAccess (GUPS) test
some memory latency test (memlat of 7z, lat_mem_rd of lmbench).
Do you want to trace every load/store commands or you only want to record requests that missed all (some) caches and were sent to main RAM memory of PC (to L3)?
Why you want no overhead and all memory accesses recorded? It is just impossible as every memory access have tracing of several bytes to be recorded to the memory. So, having memory tracing enabled (more than 10% or mem.access tracing) clearly will limit available memory bandwidth and the program will run slower. Even 1% tracing can be noted, but it effect (overhead) is smaller.
Your CPU E5-2620 v4 is Broadwell-EP 14nm so it may have also some earlier variant of the Intel PT: https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/intel-pt.txt https://github.com/01org/processor-trace and especially Andi Kleen's blog on pt: http://halobates.de/blog/p/410 "Cheat sheet for Intel Processor Trace with Linux perf and gdb"
PT support in hardware: Broadwell (5th generation Core, Xeon v4) More overhead. No fine grained timing.
PS: Scholars who study SpecCPU for memory access worked with memory access dumps/traces, and dumps were generated slowly:
http://www.bu.edu/barc2015/abstracts/Karsli_BARC_2015.pdf - LLC misses recorded to offline analysis, no timing was recorded from tracing runs
http://users.ece.utexas.edu/~ljohn/teaching/382m-15/reading/gove.pdf - all load/stores instrumented by writing into additional huge tracing buffer to periodic (rare) online aggregation. Such instrumentation is from 2x slow or slower, especially for memory bandwidth / latency limited core.
http://www.jaleels.org/ajaleel/publications/SPECanalysis.pdf (by Aamer Jaleel of Intel Corporation, VSSAD) - Pin-based instrumentation - program code was modified and instrumented to write memory access metadata into buffer. Such instrumentation is from 2x slow or slower, especially for memory bandwidth / latency limited core. The paper lists and explains instrumentation overhead and Caveats:
Instrumentation Overhead: Instrumentation involves
injecting extra code dynamically or statically into the
target application. The additional code causes an
application to spend extra time in executing the original
application ... Additionally, for multi-threaded
applications, instrumentation can modify the ordering of
instructions executed between different threads of the
application. As a result, IDS with multi-threaded
applications comes at the lack of some fidelity
Lack of Speculation: Instrumentation only observes
instructions executed on the correct path of execution. As
a result, IDS may not be able to support wrong-path ...
User-level Traffic Only: Current binary instrumentation
tools only support user-level instrumentation. Thus,
applications that are kernel intensive are unsuitable for
user-level IDS.

How do I get the sampling rate of a canvas element?

I am currently investigating preprocessing for on-line handwriting recognition (see http://write-math.com). One interesting property of input devices is the sampling rate, that means the rate at which I get onmousemove and similar events.
I can record them and see the time delta between two events is varies from 1.00 ms - 700.00 ms, but is in average 27.34 ms for this recording.
(sampling rate is measured in Points / second. So the sampling rate would be or for the average case )
Is there any possibility to get this information from the client directly? Are there devices where the sampling rate is known? How does Javascript internally decide how often to fire those events? Can the "event firing rate" be increased / decreased?
Mousemove firing rates for a particular browser are stable over a sufficient sample size, but the firing rate of any individual mouse event is affected by non-canvas activities (garbage collection, background tasks, etc).
I don't know of any browser that allows adjustment of the mousemove timing--all internally defined.
Interesting project!

Resources