Using perf to monitor memory access of every CPU - performance

I'm trying to use the linux perf tool to sample the memory accesses in my program. Specifically, I'm using perf to monitor read/write access of every CPU in NUMA.
Now, I can monitor every single CPU's read and write memory access, but I also have to know whether the access is a local memory access or a remote memory access.
I have used perf list to go through the events list, but I just find out some events about socket's memory access.
Questions
Is there any way to get every single CPU's remote memory access, when using perf ?
Is there a better option than perf ?

Yes, the PMU unit in your CPU can probably do what you want through the various uncore counters - in particular they can count the various offcore responses for non-local memory access. This blog post is a reasonable starting point.
The main problem is that often the perf tool, which is tied to the specific kernel version, will lag behind in its support of modern processors1, especially when it comes to uncore and NUMA related events2.
To work around that, you can use Andi Kleen's pmu-tools, which provides an ocperf wrapper script that uses whatever underlying perf you have on your system but with up-to-date event ids downloaded directly from Intel. That will usually give you access to the uncore events you need.
Of course, even when you get that working, these events are often very tough to interpret, especially because the mental model you have of demand-memory requests is complicated by a ton of factors such as prefetch behavior, request-for-ownership, accesses that "hit" in a line-buffer in the process of being filled, etc, etc.
1 Both because adding new processors/events as some lag, but especially because the tool is tied to the kernel, and you likely aren't on a bleeding edge kernel, so even though mainline perf might have support, you are stuck with the perf version associated with your kernel.
2 Probably because most kernel developers, like developers in general, aren't working on NUMA systems.

Related

Is there a way to measure cache coherence misses

Given a program running on multiple cores, if two or more cores are operating on the same cache line, is there a way to measure the number of cache coherence invalidations/misses there are (i.e. when Core1 writes to the cache line, which then forces Core2 to refresh its copy of the cache line so that both cores are consistent)?
Let me know if I'm using the wrong terminology for this concept.
Yes, hardware performance counters can be used to do so.
However, the way to fetch them is use to be dependent of the operating system and your processor. On Linux, the perf too can be used to track performance counters (more especially perf stat -e COUNTER_NAME_1,COUNTER_NAME_2,etc.). Alternatively, on both Linux & Windows, Intel VTune can do this too.
The list of the hardware counters can be retrieved using perf list (or with PMU-Tools).
The kind of metric you want to measure looks like Request For Ownership (RFO) in the MESI cache-coherence protocol. Hopefully, most modern (x86_64) processors include hardware events to measure RFOs. On Intel Skylake processors, there are hardware events called l2_rqsts.all_rfo, and more precisely l2_rqsts.code_rd_hit and l2_rqsts.code_rd_miss to do this at the L2-cache level. Alternatively, there are many more-advanced RFO-related hardware events that can be used at the offcore level.

how to watch and record the information of all the registers such as `eax`,`ecx` and instructions cpu was executing?

We would like to write code to watch and record the information of all the registers such as eax,ecx and instructions (we need to record all the instructions that the cpu is executing) so that we can use Machine Learning method to identify whether some instruction sequences are the Malicious instructions.
We used to alter translate.c from QEMU to record intermediate information including registers and instructions,that is to say ,we would record all the information while QEMU was translating instructions from the virtual machine on QEMU to real computer.
But collecting information from the virtual machine QEMU is more inefficient than the real machine,so we plan to write code so that we can collect all the information at Win10 on real computer.
The problem is that when we write code to obtain the value of PC register,the value is always the address of next line in our code,We don't know how to watch instructions(or code) of other parallel execution programs that CPU is executing?
would you mind to offer some ideas,thanks!
You can use the Precise Event Based Sampling (PEBS) feature of the Intel's CPUs.
PEBS enable storing architectural information, like the GP registers content, in a designed buffer each time a Performance Monitor Counter (PMC) trigger a Performance Monitor Interrupt (PMI).
PMCs can be set to trigger PMI base on a threshold.
PEBS can be enabled only on IA32_PMC0 but that isn't a limitation.
It was introduced with the Intel Netburst architecture (Pentium 4) so it's available on each modern Intel's CPU.
Of the events for which PEBS is enabled, INSTR_RETIRED.ANY_P is probably the one you are looking for (Note: I don't think this counter increments in steps by one everytime but that should be minor noise/irrelevant for your analysis).
The PMI is dispatched through the Local APIC, so you have to look at it to translate it to an interrupt vector in order to attach an ISR to it.
PEBS requires some setup, particularly the Debug Store (DS) area and some meta-structure for storing the recordings.
By programming PMC0 to count the instructions retired (bonus, you can use different thresholds to tune the granularity of the recordings), by installing a PMI ISR that read PEBS records and save it to somewhere and finally by enabling PEBS you will be able to record the content of the registers.
Note that this must be done inside the OS, so you must be proficient with Windows internals and particularly the scheduler.
Plus the amount of information collected will be huge, thus there will still be a considerable slowdown of the system.
A nice side-effect of using PMCs is that you can enable them only for user-code, kernel-code or both if needed.
A complete reference of the PEBS feature can be found on Chapter 18 of the Intel's Manual Volume 3.
Chapter 17 is propedeutic.
If PEBS is too cumbersome you can try experimenting with the TF that force the CPU traps on every instruction.
Combined with the Last Branch Recording (LBR) feature you can step jump-by-jump instead of instruction-by-instruction.
It also can be combined with the Task State Segment (TSS) & C. to automate some recording (like spilling the context into memory) upon the Trap interrupt.
I'm not aware, off the top of my head, of any other means to trace registers.
The Virtual Machine Extensions (VMX) are not useful since changing the GP registers doesn't use critical instructions.
Under Linux, one could look into rr but this is not your case.
If I may, maybe tracing register content at the system call boundary (e.g. by redirecting syscall, this is easy for a driver) is a better idea.
Processes change their registers constantly but ultimately only what they pass to system calls affect the system (this neglects 0-days and shared memory).

Will gettimeofday() be slowed due to the fix to the recently announced Intel bug?

I have been estimating the impact of the recently announced Intel bug on my packet processing application using netmap. So far, I have measured that I process about 50 packets per each poll() system call made, but this figure doesn't include gettimeofday() calls. I have also measured that I can read from a non-existing file descriptor (which is about the cheapest thing that a system call can do) 16.5 million times per second. My packet processing rate is 1.76 million packets per second, or in terms of system calls, 0.0352 million system calls per second. This means performance reduction would be 0.0352 / 16.5 = 0.21333% if system call penalty doubles, hardly something I should worry about.
However, my application may use gettimeofday() system calls quite often. My understanding is that these are not true system calls, but rather implemented as virtual system calls, as described in What are vdso and vsyscall?.
Now, my question is, does the fix to the recently announced Intel bug (that may affect ARM as well and that probably won't affect AMD) slow down gettimeofday() system calls? Or is gettimeofday() an entirely different animal due to being implemented as a different kind of virtual system call?
In general, no.
The current patches keep things like the vDSO pages mapped in user-space, and only change the behavior for the remaining vast majority of kernel-only pages which will no longer be mapped in user-space.
On most architectures, gettimeofday() is implemented as a purely userspace call, and never enters the kernel, doesn't include the TLB flush or CR3 switch that KPTI implies, so you shouldn't see a performance impact.
Exceptions include unusual kernel or hardware configurations that don't use the vDSO mechanisms, e.g., if you don't have a constant rdtsc or if you have explicitly disabled rdtsc timekeeping via a boot parameter. You'd probably already know if that was the case since it means that gettimeofday would take 100-200ns rather than 15-20ns since it's already making a kernel call.
Good question, the VDSO pages are kernel memory mapped into user space. If you single-step into gettimeofday(), you see a call into the VDSO page where some code there uses rdtsc and scales the result with scale factors it reads from another data page.
But these pages are supposed to be readable from user-space, so Linux can keep them mapped without any risk. The Meltdown vulnerability is that the U/S bit (user/supervisor) in page-table / TLB entries doesn't stop unprivileged loads (and further dependent instructions) from happening microarchitecturally, producing a change in the microarchitectural state which can then be read with cache-timing.

Profiling CPU Cache/Memory from the OS/Application?

I wish to write software which could essentially profile the CPU cache (L2,L3, possibly L1) and the memory, to analyze performance.
Am I right in thinking this is un-doable because there is no access for the software to the cache content?
Another way of wording my Q: is there any way to know, from the OS/Application level, what data has been loaded into cache/memory?
EDIT: Operating System Windows or Linux and CPU Intel Desktop/Xeon
You might want to look at Intel's PMU i.e. Performance Monitoring Unit. Some processors have one. It is a bunch of special purpose registers (Intel calls them Model Specific Registers, or MSRs) which you can program to count events, like cache misses, using the RDMSR and WRMSR instructions.
Here is a document about Performance Analysis on i7 and Xeon 5500.
You might want to check out Intel's Performance Counter Monitor, which is basically some routines that abstract the PMU, which you can use in a C++ application to measure several performance metrics live, including cache misses. It also has some GUI/Commandline tools for standalone use.
Apparently, the Linux kernel has a facility for manipulating MSRs.
There are other utilities/APIs that also use the PMU: perf, PAPI.
Cache performance is generally measured in terms of hit rate and miss rate.
There are many tools to do this for you. Check how Valgrind does cache profiling.
Also cache performance is generally measured on a per program basis. Well written programs will result in a fewer cache misses and better cache performance and vice versa for poorly written code.
Measuring the actual cache speed is the headache of the hardware manufacturers and you can refer their manuals to know this value.
Callgrind/Cachegrind combination can help you track cache hits/misses
This has some examples.
TAU, an open-source profiler which works using PAPI can also be used.
If however, you want to write a code to measure the cache statistics you can write a program using PAPI. PAPI allows the user to access the hardware counters without any need to know system architecture.
PMU uses Model Specific Registers, hence you must have the knwoledge of the registers to be used.
Perf allows for measurement of L1 and LLC (which is L2), Cachegrind, on the other hand allows the user to measure L1 and LLC (which can be L2 or L3, whichever the highest level cache is). Use Cachegrind only if you have no need of faster results because Cachegrind runs the program about 10X slower.

How to measure memory bandwidth utilization on Windows?

I have a highly threaded program but I believe it is not able to scale well across multiple cores because it is already saturating all the memory bandwidth.
Is there any tool out there which allows to measure how much of the memory bandwidth is being used?
Edit: Please note that typical profilers show things like memory leaks and memory allocation, which I am not interested in.
I am only whether the memory bandwidth is being saturated or not.
If you have a recent Intel processor, you might try to use Intel(r) Performance Counter Monitor: http://software.intel.com/en-us/articles/intel-performance-counter-monitor/ It can directly measure consumed memory bandwidth from the memory controllers.
I'd recommend the Visual Studio Sample Profiler which can collect sample events on specific hardware counters. For example, you can choose to sample on cache misses. Here's an article explaining how to choose the CPU counter, though there are other counters you can play with as well.
it would be hard to find a tool that measured memory bandwidth utilization for your application.
But since the issue you face is a suspected memory bandwidth problem, you could try and measure if your application is generating a lot of page faults / sec, which would definitely mean that you are no where near the theoretical memory bandwidth.
You should also measure how cache friendly your algorithms are. If they are thrashing the cache, your memory bandwidth utilization will be severely hampered. Google "measuring cache misses" on good sources that tells you how to do this.
It isn't possible to properly measure memory bus utilisation with any kind of software-only solution. (it used to be, back in the 80's or so. But then we got piplining, cache, out-of-order execution, multiple cores, non-uniform memory architectues with multiple busses, etc etc etc).
You absolutely have to have hardware monitoring the memory bus, to determine how 'busy' it is.
Fortunately, most PC platforms do have some, so you just need the drivers and other software to talk to it:
wenjianhn comments that there is a project specficially for intel hardware (which they call the Processor Counter Monitor) at https://github.com/opcm/pcm
For other architectures on Windows, I am not sure. But there is a project (for linux) which has a grab-bag of support for different architectures at https://github.com/RRZE-HPC/likwid
In principle, a computer engineer could attach a suitable oscilloscope to almost any PC and do the monitoring 'directly', although this is likely to require both a suitably-trained computer engineer as well as quite high performance test instruments (read: both very costly).
If you try this yourself, know that you'll likely need instruments or at least analysis which is aware of the protocol of the bus you're intending to monitor for utilisation.
This can sometimes be really easy, with some busses - eg old parallel FIFO hardware, which usually has a separate wire for 'fifo full' and another for 'fifo empty'.
Such chips are used usually between a faster bus and a slower one, on a one-way link. The 'fifo full' signal, even it it normally occasionally triggers, can be monitored for excessively 'long' levels: For the example of a USB 2.0 Hi-Speed link, this happens when the OS isn't polling the USB fifo hardware on time. Measuring the frequency and duration of these 'holdups' then lets you measure bus utilisation, but only for this USB 2.0 bus.
For a PC memory bus, I guess you could also try just monitoring how much power your RAM interface is using - which perhaps may scale with use. This might be quite difficult to do, but you may 'get lucky'. You want the current of the supply which feeds VccIO for the bus. This should actually work much better for newer PC hardware than those ancient 80's systems (which always just ran at full power when on).
A fairly ordinary oscilloscope is enough for either of those examples - you just need one that can trigger only on 'pulses longer than a given width', and leave it running until it does, which is a good way to do 'soak testing' over long periods.
You monitor utiliation either way by looking for the change in 'idle' time.
But modern PC memory busses are quite a bit more complex, and also much faster.
To do it directly by tapping the bus, you'll need at least an oscilloscope (and active probes) designed explicitly for monitoring the generation of DDR bus your PC has, along with the software analysis option (usually sold separately) to decode the protocol enough to figure out the kind of activity which is occuring on it, from which you can figure out what kind of activity you want to measure as 'idle'.
You may even need a motherboard designed to allow you to make those measurements also.
This isn't so staightfoward as just looking for periods of no activity - all DRAM needs regular refresh cycles at the very least, which may or may not happen along with obvious bus activity (some DRAM's do it automatically, some need a specific command to trigger it, some can continue to address and transfer data from banks not in refresh, some can't, etc).
So the instrument needs to be able to analyse the data deeply enough for you extract how busy it is.
Your best, and simplest bet is to find a PC hardware (CPU) vendor who has tools which do what you want, and buy that hardware so you can use those tools.
This might even involve running your application in a VM, so you can benefit from better tools in a different OS hosting it.
To this end, you'll likely want to try Linux KVM (yes, even for Windows - there are windows guest drivers for it), and also pin down your VM to specific CPUs, whilst you also configure linux to avoid putting other jobs on those same CPUs.

Resources