Reading performance registers from the kernel

Reading performance registers from the kernel - linux-kernel

I want to read certain performance counters. I know that there are tools like perf, that can do it for me in the user space itself, I want the code to be inside the Linux kernel.
I want to write a mechanism to monitor performance counters on Intel(R) Core(TM) i7-3770 CPU. On top of using I am using Ubuntu kernel 4.19.2. I have gotten the following method from easyperf
Here's part of my code to read instructions.
struct perf_event_attr *attr
memset (&pe, 0, sizeof (struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof (struct perf_event_attr);
pe.config = PERF_COUNT_HW_INSTRUCTIONS;
pe.disabled = 0;
pe.exclude_kernel = 0;
pe.exclude_user = 0;
pe.exclude_hv = 0;
pe.exclude_idle = 0;
fd = syscall(__NR_perf_event_open, hw, pid, cpu, grp, flags);
uint64_t perf_read(int fd) {
uint64_t val;
int rc;
rc = read(fd, &val, sizeof(val));
assert(rc == sizeof(val));
return val;
}
I want to put the same lines in the kernel code (in the context switch function) and check the values being read.
My end goal is to figure out a way to read performance counters for a process, every time it switches to another, from the kernel(4.19.2) itself.
To achieve this I check out the code for the system call number __NR_perf_event_open. It can be found here
To make to usable I copied the code inside as a separate function, named it perf_event_open() in the same file and exported.
Now the problem is whenever I call perf_event_open() in the same way as above, the descriptor returned is -2. Checking with the error codes, I figured out that the error was ENOENT. In the perf_event_open() man page, the cause of this error is defined as wrong type field.
Since file descriptors are associated to the process that's opened them, how can one use them from the kernel? Is there an alternative way to configure the pmu to start counting without involving file descriptors?

You probably don't want the overhead of reprogramming a counter inside the context-switch function.
The easiest thing would be to make system calls from user-space to program the PMU (to count some event, probably setting it to count in kernel mode but not user-space, just so the counter overflows less often).
Then just use rdpmc twice (to get start/stop counts) in your custom kernel code. The counter will stay running, and I guess the kernel perf code will handle interrupts when it wraps around. (Or when its PEBS buffer is full.)
IDK if it's possible to program a counter so it just wraps without interrupting, for use-cases like this where you don't care about totals or sample-based profiling, and just want to use rdpmc. If so, do that.
Old answer, addressing your old question which was based on a buggy printf format string that was printing non-zero garbage even though you weren't counting anything in user-space either.
Your inline asm looks correct, so the question is what exactly that PMU counter is programmed to count in kernel mode in the context where your code runs.
perf virtualizes the PMU counters on context-switch, giving the illusion of perf stat counting a single process even when it migrates across CPUs. Unless you're using perf -a to get system-wide counts, the PMU might not be programmed to count anything, so multiple reads would all give 0 even if at other times it's programmed to count a fast-changing event like cycles or instructions.
Are you sure you have perf set to count user + kernel events, not just user-space events?
perf stat will show something like instructions:u instead of instructions if it's limiting itself to user-space. (This is the default for non-root if you haven't lowered sysctl kernel.perf_event_paranoid to 0 or something from the safe default that doesn't let user-space learn anything about the kernel.)
There's HW support for programming a counter to only count when CPL != 0 (i.e. not in ring 0 / kernel mode). Higher values for kernel.perf_event_paranoid restrict the perf API to not allow programming counters to count in kernel+user mode, but even with paranoid = -1 it's possible to program them this way. If that's how you programmed a counter, then that would explain everything.
We need to see your code that programs the counters. That doesn't happen automatically.
The kernel doesn't just leave the counters running all the time when no process has used a PAPI function to enable a per-process or system-wide counter; that would generate interrupts that slow the system down for no benefit.

Related

Allocate swappable memory in linux kernel

Memory in the Linux kernel is usually unswappable (Do Kernel pages get swapped out?). However, sometimes it is useful to allow memory to be swapped out. Is it possible to explicitly allocate swappable memory inside the Linux kernel? One method I thought of was to create a user space process and use its memory. Is there anything better?

You can create a file in the internal shm shared memory filesystem.
const char *name = "example";
loff_t size = PAGE_SIZE;
unsigned long flags = 0;
struct file *filp = shmem_file_setup(name, size, flags);
/* assert(!IS_ERR(filp)); */
The file isn't actually linked, so the name isn't visible. The flags may include VM_NORESERVE to skip accounting up-front, instead accounting as pages are allocated. Now you have a shmem file. You can map a page like so:
struct address_space *mapping = filp->f_mapping;
pgoff_t index = 0;
struct page *p = shmem_read_mapping_page(mapping, index);
/* assert(!IS_ERR(filp)); */
void *data = page_to_virt(p);
memset(data, 0, PAGE_SIZE);
There is also shmem_read_mapping_page_gfp(..., gfp_t) to specify how the page is allocated. Don't forget to put the page back when you're done with it.
put_page(p);
Ditto with the file.
fput(filp);

Answer to your question is a simple No, or Yes with a complex modification to kernel source.
First, to enable swapping out, you have to ask yourself what is happening when kswapd is swapping out. Essentially it will walk through all the processes and make a decision whether its memory can be swapped out or not. And all these memory have the hardware mode of ring 3. So SMAP essentially forbid it from being read as data or executed as program in the kernel (ring 0):
https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention
And check your distros "CONFIG_X86_SMAP", for mine Ubuntu it is default to "y" which is the case for past few years.
But if you keep your memory as a kernel address (ring 0), then you may need to consider changing the kswapd operation to trigger swapout of kernel addresses. Whick kernel addresses to walk first? And what if the address is part of the kswapd's kernel operation? The complexities involved is huge.
And next is to consider the swap in operation: When the memory read is attempted and it's "not present" bit is enabled, then hardware exception will trigger linux kernel memory fault handler (which is __do_page_fault()).
And looking into __do_page_fault:
https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1477
and there after how it handler the kernel addresses (do_kern_address_fault()):
https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1174
which essentially is just reporting as error for possible scenario. If you want to enable kernel address pagefaulting, then this path has to be modified.
And note too that the SMAP check (inside smap_violation) is done in the user address pagefaulting (do_usr_addr_fault()).

Recompile Linux Kernel not to use specific CPU register

I'm doing an experiment that write the index of loop into a CPU register R11, then building it with gcc -ffixed-r11 try to let compiler know do not use that reg, and finally using perf to measure it.
But when I check the report (using perf script), the R11 value of most record entry is not what I expected, it supposed to be the number sequence like 1..2..3 or 1..4..7, etc. But actually it just a few fixed value. (possibly affected by system call overwriting?)
How can I let perf records the value I set to the register in my program? Or I must to recompile the whole kernel with -ffixed-r11 to achieve?
Thanks everyone.

You should not try to recompile kernel when you just want to sample some register with perf. As I understand, kernel has its own set of registers and will not overwrite user R11. syscall interface uses some fixed registers which can't be changed (can you try different reg?) and there are often glibc gateways to syscall which may use some additional registers (they are not in kernel, they are user-space code; often generated or written in assembler). You may try using gdb to monitor the register to change to find who did it. It can do this (hmm, one more link to the same user on SO): gdb: breakpoint when register will have value 0xffaa like gdb ./program then gdb commands start; watch $r11; continue; where.
Two weeks age there was question perf-report show value of CPU register about register value sampling with perf:
I follow this document and using perf record with --intr-regs=ax,bx,r15, trying to log additional CPU register information with PEBS record.
While there was x86 & PEBS, ARM may have --intr-regs implemented too. Check output of perf record --intr-regs=\? (man perf-record: "To list the available registers use --intr-regs=\?") to find support status and register names.
To print registers, use perf script -F ip,sym,iregs command. There was example in some linux commits:
# perf record --intr-regs=AX,SP usleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.016 MB perf.data (8 samples) ]
# perf script -F ip,sym,iregs | tail -5
ffffffff8105f42a native_write_msr_safe AX:0xf SP:0xffff8802629c3c00
ffffffff8105f42a native_write_msr_safe AX:0xf SP:0xffff8802629c3c00
ffffffff81761ac0 _raw_spin_lock AX:0xffff8801bfcf8020 SP:0xffff8802629c3ce8
ffffffff81202bf8 __vma_adjust_trans_huge AX:0x7ffc75200000 SP:0xffff8802629c3b30
ffffffff8122b089 dput AX:0x101 SP:0xffff8802629c3c78
#

If you need cycle accurate profile of to the metal CPU activity then perf is not the right tool, as it is at best an approximation due to the fact it only samples the program at select points. See this video on perf by Clang developer Chandler Carruth.
Instead, you should single step through the program in order to monitor exactly what is happening to the registers. Or you could program your system bare metal without an OS, but that is probably outside the scope here.

Linux Multiprocessings in kernel space

I do multiprocessing in user space by creating multiple threads and pinning them to specific cpu cores. They process data in a specific place in memory, which is never swapped and it is constantly filled with new data by a streaming device driver. I want to get the best in term of speed so I thought to move everything in kernel space to get rid of all the memory pointer conversion and the kernel->user communications and vice versa. The process will be started with an ioctl to the driver. Then multiple threads will be spawned and pinned to specific cores. The process is terminated with another ioctl by the user.
So far I gathered info about kthreads, which is not so well documented, and I wrote this in the ioctl function:
thread_ch1 = kthread_create(&Channel1_thread,(void *)buffer,"Channel1_thread");
thread_ch2 = kthread_create(&Channel2_thread,(void *)buffer,"Channel2_thread");
thread_ch3 = kthread_create(&Channel3_thread,(void *)buffer,"Channel3_thread");
thread_ch4 = kthread_create(&Channel4_thread,(void *)buffer,"Channel4_thread");
printk(KERN_WARNING "trd1: %p, trd2: %p, trd3: %p, trd4: %p\n",thread_ch1,thread_ch2,thread_ch3,thread_ch4);
kthread_bind(thread_ch1,0);
kthread_bind(thread_ch2,1);
kthread_bind(thread_ch3,2);
kthread_bind(thread_ch4,3);
wake_up_process(thread_ch1);
wake_up_process(thread_ch2);
wake_up_process(thread_ch3);
wake_up_process(thread_ch4);
and the ioctl returns.
Each Channeln_thread is just a for loop:
int Channel1_thread(void *buffer)
{
uint64_t i;
for(i=0;i<10000000000;i++);
return 0;
}
The threads seem to never be executed and thread_ch1-4 have non NULL pointers. If I add a small delay before ioctl returns and I can see the threads running.
Can someone shine some light?
Thanks

mwait x86 instruction doesn't wait for DMA

I'm trying to use monitor/mwait instructions to monitor DMA writes from a device to a memory location. In a kernel module (char device) I have the following code (very similar to this piece of kernel code) that runs in a kernel thread:
static int do_monitor(void *arg)
{
struct page *p = arg; // p is a 'struct page *'; it's also remapped to user space
uint32_t *location_p = phys_to_virt(page_to_phys(p));
uint32_t prev = 0;
int i = 0;
while (i++ < 20) // to avoid infinite loop
{
if (*location_p == prev)
{
__monitor(location_p, 0, 0);
if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
clflush(location_p);
if (*location_p == prev)
__mwait(0, 0);
}
prev = *location_p;
printk(KERN_NOTICE "%d", prev);
}
}
In user space I have the following test code:
int fd = open("/dev/mon_test_dev", O_RDWR);
unsigned char *mapped = (unsigned char *)mmap(0, mmap_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
for (int i = 1; i <= 5; ++i)
*mapped = i;
munmap(mapped, mmap_size);
close(fd);
And the kernel log looks like this:
1
2
3
4
5
5
5
5
5
5
5 5 5 5 5 5 5 5 5 5
I.e. it seems that mwait doesn't wait at all.
What could be the reason?

The definition of MONITOR/MWAIT semantics does not specify explicitly whether DMA transactions may or may not trigger it. It is supposed that triggering happens for logical processor's stores.
Current descriptions of MONITOR and MWAIT in the Intel's official Software Developer Manual are quite vague to that respect. However, there are two clauses in the MONITOR section that caught my attention:
The content of EAX is an effective address (in 64-bit mode, RAX is used). By default, the DS segment is used to create a linear address that is monitored.
The address range must use memory of the write-back type. Only write-back memory will correctly trigger the
monitoring hardware.
The first clause states that MONITOR is meant to be used with linear addresses, not physical ones. Devices and their DMA are meant to work with physical addresses only. So basically this means that all agents relying on the same MONITOR range should operate in the same domain of virtual memory space.
The second clause requires the monitored memory region to be cacheable (write-back, WB). For DMA, respective memory range is usually has to be marked as uncacheable, or write-combining at best (UC or WC). This is even a stronger indicator that your intent to use MONITOR/MWAIT to be triggered by DMA is very unlikely to work on current hardware.
Considering your high-level goal - to be able to tell when a device has written to given memory range - I cannot remember any robust method to achieve it, besides using virtualization for devices (VTd, IOMMU etc.) Basically, the classic approach for a peripheral device is to issue an interrupt when it is done with writing to memory. Until an interrupt arrives, there is no way for CPU to tell if all DMA bytes have successfully reached their destination in memory.
Device virtualization allows to abstract physical addresses from a device in a transparent manner, and have an equivalent of a page fault when it attempts to write/read from memory.

Why doesn't gcc handle volatile register?

I'm working on a timing loop for the AVR platform where I'm counting down a single byte inside an ISR. Since this task is a primary function of my program, I'd like to permanently reserve a processor register so that the ISR doesn't have to hit a memory barrier when its usual code path is decrement, compare to zero, and reti.
The avr-libc docs show how to bind a variable to a register, and I got that working without a problem. However, since this variable is shared between the main program (for starting the timer countdown) and the ISR (for actually counting and signaling completion), it should also be volatile to ensure that the compiler doesn't do anything too clever in optimizing it.
In this context (reserving a register across an entire monolithic build), the combination volatile register makes sense to me semantically, as "permanently store this variable in register rX, but don't optimize away checks because the register might be modified externally". GCC doesn't like this, however, and emits a warning that it might go ahead and optimize away the variable access anyway.
The bug history of this combination in GCC suggests that the compiler team is simply unwilling to consider the type of scenario I'm describing and thinks it's pointless to provide for it. Am I missing some fundamental reason why the volatile register approach is in itself a Bad Idea, or is this a case that makes semantic sense but that the compiler team just isn't interested in handling?

The semantics of volatile are not exactly as you describe "don't optimize away checks because the register might be modified externally" but are actually more narrow: Try to think of it as "don't cache the variable's value from RAM in a register".
Seen this way, it does not make any sense to declare a register as volatile because the register itself cannot be 'cached' and therefore cannot possibly be inconsistent with the variable's 'actual' value.
The fact that read accesses to volatile variables are usually not optimzed away is merely a side effect of the above semantics, but it's not guaranteed.
I think GCC should assume by default that a value in a register is 'like volatile' but I have not verified that it actually does so.
Edit:
I just did a small test and found:
avr-gcc 4.6.2 does not treat global register variables like volatiles with respect to read accesses, and
the Naggy extension for Atmel Studio detects an error in my code: "global register variables are not supported".
Assuming that global register variables are actually considered "unsupported" I am not surprised that gcc treats them just like local variables, with the known implications.
My test code looks like this:
uint8_t var;
volatile uint8_t volVar;
register uint8_t regVar asm("r13");
#define NOP asm volatile ("nop\r\n":::)
int main(void)
{
var = 1; // <-- kept
if ( var == 0 ) {
NOP; // <-- optimized away, var is not volatile
}
volVar = 1; // <-- kept
if ( volVar == 0 ) {
NOP; // <-- kept, volVar *is* volatile
}
regVar = 1; // <-- optimized away, regVar is treated like a local variable
if ( regVar == 0 ) {
NOP; // <-- optimized away consequently
}
for(;;){}
}

The reason you would use the volatile keyword on AVR variables is to, as you said, avoid the compiler optimizing access to the variable. The question now is, how does this happen though?
A variable has two places it can reside. Either in the general purpose register file or in some location in RAM. Consider the case where the variable resides in RAM. To access the latest value of the variable, the compiler loads the variable from RAM, using some form of the ld instruction, say lds r16, 0x000f. In this case, the variable was stored in RAM location 0x000f and the program made a copy of this variable in r16. Now, here is where things get interesting if interrupts are enabled. Say that after loading the variable, the following occurs inc r16, then an interrupt triggers and its corresponding ISR is run. Within the ISR, the variable is also used. There is a problem, however. The variable exists in two different versions, one in RAM and one in r16. Ideally, the compiler should use the version in r16, but this one is not guaranteed to exist, so it loads it from RAM instead, and now, the code does not operate as needed. Enter then the volatile keyword. The variable is still stored in RAM, however, the compiler must ensure that the variable is updated in RAM before anything else happens, thus the following assembly may be generated:
cli
lds r16, 0x000f
inc r16
sei
sts 0x000f, r16
First, interrupts are disabled. Then, the the variable is loaded into r16. The variable is increased, interrupts are enabled and then the variable is stored. It may appear confusing for the global interrupt flag to be enabled before the variable is stored back in RAM, but from the instruction set manual:
The instruction following SEI will be executed before any pending interrupts.
This means that the sts instruction will be executed before any interrupts trigger again, and that the interrupts are disabled for the minimum amount of time possible.
Consider now the case where the variable is bound to a register. Any operations done on the variable are done directly on the register. These operations, unlike operations done to a variable in RAM, can be considered atomic, as there is no read -> modify -> write cycle to speak of. If an interrupt triggers after the variable is updated, it will get the new value of the variable, since it will read the variable from the register it was bound to.
Also, since the variable is bound to a register, any test instructions will utilize the register itself and will not be optimized away on the grounds the compiler may have a "hunch" it is a static value, given that registers by their very nature are volatile.
Now, from experience, when using interrupts in AVR, I have sometimes noticed that the global volatile variables never hit RAM. The compiler kept them on the registers all the time, bypassing the read -> modify -> write cycle alltogether. This was due, however, to compiler optimizations, and it should not be relied on. Different compilers are free to generate different assembly for the same piece of code. You can generate a disassembly of your final file or any particular object files using the avr-objdump utility.
Cheers.

Reserving a register for one variable for a complete compilation unit is probably too restrictive for a compiler's code generator. That is, every C routine would have to NOT use that register.
How do you guarantee that other called routines do NOT use that register once your code goes out of scope? Even stuff like serial i/o routines would have to NOT use that reserved register. Compilers do NOT recompile their run-time libraries based on a data definition in a user program.
Is your application really so time sensitive that the extra delay for bringing memory up from L2 or L3 can be detected? If so, then your ISR might be running so frequently that the required memory location is always available (i.e. it doesn't get paged back down thru the cache) and thus does NOT hit a memory barrier (I assume by memory barrier you are referring to how memory in a cpu really operates, through caching, etc.). But for this to really be true the up would have to have a fairly large L1 cache and the ISR would have to run at a very high frequency.
Finally, sometimes an application's requirements make it necessary to code it in ASM in which case you can do exactly what you are requesting!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio