How can I capture combined kernel and userspace stacks with perf - linux-kernel

I'm trying to capture combined user and kernel stacks with perf, so I can see which user space code produces are particular kernel call chain.
Basically I want to create a flamegraph looking like this:
Unfortunately all my kernel stacks end at entry_SYSCALL_64_fastpath and there is no connection to the userspace stacks.
I'm using perf record -g --call-graph dwarf -F 99 --pid 12345 to capture. I have debug symbols for the kernel, libc and my program.
This is kernel 4.8.14 on a Fedora 25 system.

Try bcc utilities that use BPF technology. Take a look at profile util.
https://github.com/iovisor/bcc/blob/master/docs/tutorial.md

Related

what are the two numbers for an instruction location in objdump of a kernel module? [duplicate]

Consider the following Linux kernel dump stack trace; e.g., you can trigger a panic from the kernel source code by calling panic("debugging a Linux kernel panic");:
[<001360ac>] (unwind_backtrace+0x0/0xf8) from [<00147b7c>] (warn_slowpath_common+0x50/0x60)
[<00147b7c>] (warn_slowpath_common+0x50/0x60) from [<00147c40>] (warn_slowpath_null+0x1c/0x24)
[<00147c40>] (warn_slowpath_null+0x1c/0x24) from [<0014de44>] (local_bh_enable_ip+0xa0/0xac)
[<0014de44>] (local_bh_enable_ip+0xa0/0xac) from [<0019594c>] (bdi_register+0xec/0x150)
In unwind_backtrace+0x0/0xf8 what does +0x0/0xf8 stand for?
How can I see the C code of unwind_backtrace+0x0/0xf8?
How to interpret the panic's content?
It's just an ordinary backtrace, those functions are called in reverse order (first one called was called by the previous one and so on):
unwind_backtrace+0x0/0xf8
warn_slowpath_common+0x50/0x60
warn_slowpath_null+0x1c/0x24
ocal_bh_enable_ip+0xa0/0xac
bdi_register+0xec/0x150
The bdi_register+0xec/0x150 is the symbol + the offset/length there's more information about that in Understanding a Kernel Oops and how you can debug a kernel oops. Also there's this excellent tutorial on Debugging the Kernel
Note: as suggested below by Eugene, you may want to try addr2line first, it still needs an image with debugging symbols though, for example
addr2line -e vmlinux_with_debug_info 0019594c(+offset)
Here are two alternatives for addr2line. Assuming you have the proper target's toolchain, you can do one of the following:
Use objdump:
locate your vmlinux or the .ko file under the kernel root directory, then disassemble the object file :
objdump -dS vmlinux > /tmp/kernel.s
Open the generated assembly file, /tmp/kernel.s. with a text editor such as vim. Go to
unwind_backtrace+0x0/0xf8, i.e. search for the address of unwind_backtrace + the offset. Finally, you have located the problematic part in your source code.
Use gdb:
IMO, an even more elegant option is to use the one and only gdb. Assuming you have the suitable toolchain on your host machine:
Run gdb <path-to-vmlinux>.
Execute in gdb's prompt: list *(unwind_backtrace+0x10).
For additional information, you may checkout the following resources:
Kernel Debugging Tricks.
Debugging The Linux Kernel Using Gdb
In unwind_backtrace+0x0/0xf8 what the +0x0/0xf8 stands for?
The first number (+0x0) is the offset from the beginning of the function (unwind_backtrace in this case). The second number (0xf8) is the total length of the function. Given these two pieces of information, if you already have a hunch about where the fault occurred this might be enough to confirm your suspicion (you can tell (roughly) how far along in the function you were).
To get the exact source line of the corresponding instruction (generally better than hunches), use addr2line or the other methods in other answers.

Recompile Linux Kernel not to use specific CPU register

I'm doing an experiment that write the index of loop into a CPU register R11, then building it with gcc -ffixed-r11 try to let compiler know do not use that reg, and finally using perf to measure it.
But when I check the report (using perf script), the R11 value of most record entry is not what I expected, it supposed to be the number sequence like 1..2..3 or 1..4..7, etc. But actually it just a few fixed value. (possibly affected by system call overwriting?)
How can I let perf records the value I set to the register in my program? Or I must to recompile the whole kernel with -ffixed-r11 to achieve?
Thanks everyone.
You should not try to recompile kernel when you just want to sample some register with perf. As I understand, kernel has its own set of registers and will not overwrite user R11. syscall interface uses some fixed registers which can't be changed (can you try different reg?) and there are often glibc gateways to syscall which may use some additional registers (they are not in kernel, they are user-space code; often generated or written in assembler). You may try using gdb to monitor the register to change to find who did it. It can do this (hmm, one more link to the same user on SO): gdb: breakpoint when register will have value 0xffaa like gdb ./program then gdb commands start; watch $r11; continue; where.
Two weeks age there was question perf-report show value of CPU register about register value sampling with perf:
I follow this document and using perf record with --intr-regs=ax,bx,r15, trying to log additional CPU register information with PEBS record.
While there was x86 & PEBS, ARM may have --intr-regs implemented too. Check output of perf record --intr-regs=\? (man perf-record: "To list the available registers use --intr-regs=\?") to find support status and register names.
To print registers, use perf script -F ip,sym,iregs command. There was example in some linux commits:
# perf record --intr-regs=AX,SP usleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.016 MB perf.data (8 samples) ]
# perf script -F ip,sym,iregs | tail -5
ffffffff8105f42a native_write_msr_safe AX:0xf SP:0xffff8802629c3c00
ffffffff8105f42a native_write_msr_safe AX:0xf SP:0xffff8802629c3c00
ffffffff81761ac0 _raw_spin_lock AX:0xffff8801bfcf8020 SP:0xffff8802629c3ce8
ffffffff81202bf8 __vma_adjust_trans_huge AX:0x7ffc75200000 SP:0xffff8802629c3b30
ffffffff8122b089 dput AX:0x101 SP:0xffff8802629c3c78
#
If you need cycle accurate profile of to the metal CPU activity then perf is not the right tool, as it is at best an approximation due to the fact it only samples the program at select points. See this video on perf by Clang developer Chandler Carruth.
Instead, you should single step through the program in order to monitor exactly what is happening to the registers. Or you could program your system bare metal without an OS, but that is probably outside the scope here.

change smp_affinity from linux device driver

If I examine the
cat /proc/interrupts
command, all the IRQs are listed under cpu0 in SMP system.
I can change the smp_affinity mask to tag the IRQ to particular CPU using following command.
echo "4" > /proc/irq/230/smp_affinity
Above command sets the affinity mask of the interrupt 230 to CPU 2.
I would like achieve same from linux kernel module. How can I do this?
I see create_proc_entry method which allows to create new proc entry.
Is there any method which we can use to write existing proc entry?
In a kernel module you can just call the kernel API function irq_set_affinity(...) directly. No need to go through /proc. See: http://lxr.free-electrons.com/source/kernel/irq/manage.c#L189

How to switch from user mode to kernel mode?

I'm learning about the Linux kernel but I don't understand how to switch from user mode to kernel mode in Linux. How does it work? Could you give me some advice or give me some link to refer or some book about this?
The only way an user space application can explicitly initiate a switch to kernel mode during normal operation is by making an system call such as open, read, write etc.
Whenever a user application calls these system call APIs with appropriate parameters, a software interrupt/exception(SWI) is triggered.
As a result of this SWI, the control of the code execution jumps from the user application to a predefined location in the Interrupt Vector Table [IVT] provided by the OS.
This IVT contains an adress for the SWI exception handler routine, which performs all the necessary steps required to switch the user application to kernel mode and start executing kernel instructions on behalf of user process.
To switch from user mode to kernel mode you need to perform a system call.
If you just want to see what the stuff is going on under the hood, go to TLDP is your new friend and see the code (it is well documented, no need of additional knowledge to understand an assembly code).
You are interested in:
movl $len,%edx # third argument: message length
movl $msg,%ecx # second argument: pointer to message to write
movl $1,%ebx # first argument: file handle (stdout)
movl $4,%eax # system call number (sys_write)
int $0x80 # call kernel
As you can see, a system call is just a wrapper around the assembly code, that performs an interruption (0x80) and as a result a handler for this system call will be called.
Let's cheat a bit and use a C preprocessor here to build an executable (foo.S is a file where you put a code from the link below):
gcc -o foo -nostdlib foo.S
Run it via strace to ensure that we'll get what we write:
$ strace -t ./foo
09:38:28 execve("./foo", ["./foo"], 0x7ffeb5b771d8 /* 57 vars */) = 0
09:38:28 stat(NULL, Hello, world!
NULL) = 14
09:38:28 write(0, NULL, 14)
I just read through this, and it's a pretty good resource. It explains user mode and kernel mode, why changes happen, how expensive they are, and gives some interesting related reading.
https://blog.codinghorror.com/understanding-user-and-kernel-mode
Here's a short excerpt:
Kernel Mode
In Kernel mode, the executing code has complete and unrestricted access to the underlying hardware. It can execute any CPU instruction and reference any memory address. Kernel mode is generally reserved for the lowest-level, most trusted functions of the operating system. Crashes in kernel mode are catastrophic; they will halt the entire PC.
User Mode
In User mode, the executing code has no ability to directly access hardware or reference memory. Code running in user mode must delegate to system APIs to access hardware or memory. Due to the protection afforded by this sort of isolation, crashes in user mode are always recoverable. Most of the code running on your computer will execute in user mode.

What options do we have for communication between a user program and a Linux Kernel Module?

I am a new comer to Linux Kernel Module programming. From the material that I have read so far, I have found that there are 3 ways for a user program to request services or to communicate with a Linux Kernel Module
a device file in /dev
a file in /proc file system
ioctl() call
Question: What other options do we have for communication between user program and linux kernel module?
Your option 3) is really a sub-option of option 1) - ioctl() is one way of interacting with a device file (read() and write() being the usual ways).
Two other ways worth considering are:
The sysfs filesystem;
Netlink sockets.
Basically, many standard IPC mechanisms — cf. http://en.wikipedia.org/wiki/Inter-process_communication — can be used:
File and memory-mapped file: a device file (as above) or similarly special file in /dev, procfs, sysfs, debugfs, or a filesystem of your own, cartesian product with read/write, ioctl, mmap
Possibly signals (for use with a kthread)
Sockets: using a protocol of choice: TCP, UDP (cf. knfsd, but likely not too easy), PF_LOCAL, or Netlink (many subinterfaces - base netlink, genetlink, Connector, ...)
Furthermore,
 4. System calls (not really usable from modules though)
 5. Network interfaces (akin to tun).
Working examples of Netlink — just to name a few — can be found for example in
git://git.netfilter.org/libmnl (userspace side)
net/core/rtnetlink.c (base netlink)
net/netfilter/nf_conntrack_netlink.c (nfnetlink)
fs/quota/netlink.c (genetlink)
This includes all types with examples :)
http://people.ee.ethz.ch/~arkeller/linux/kernel_user_space_howto.html
Runnable examples of everything
Too much talk is making me bored!
file operations:
file types that implement file operations:
procfs. See also: proc_create() example for kernel module
debugfs
character devices. See also: https://unix.stackexchange.com/questions/37829/how-do-character-device-or-character-special-files-work/371758#371758
sysfs. See also: How to attach file operations to sysfs attribute in platform driver?
file operation syscalls themselves
open, read, write, close, lseek. See also: How to add poll function to the kernel module code?
poll. See also: How do I use ioctl() to manipulate my kernel module?
ioctl. See also: How do I use ioctl() to manipulate my kernel module?
mmap. See also: How to mmap a Linux kernel buffer to user space?
anonymous inodes. See also: What is an anonymous inode in Linux?
netlink sockets. See also: How to use netlink socket to communicate with a kernel module?
This Linux document gives some of the ways in which the kernel and user space can interact(communicate). They are the following.
Procfs, sysfs, and similar mechanisms. This includes /dev entries as well, and all the methods in which kernel space exposes a file in user space (/proc, /dev, etc. entries are basically files exposed from the kernel space).
Socket based mechanisms. Netlink is a type of socket, which is meant specially for communication between user space and kernel space.
System calls.
Upcalls. The kernel executes a code in user space. For example spawning a new process.
mmap - Memory mapping a region of kernel memory to user space. This allows both the kernel, and the user space to read/write to the same memory area.
Other than these, the following list adds some other mechanisms I know.
Interrupts. The user space can raise interrupts to talk to kernel space. For example some CPUs use int80 to make system calls (while others may use a different mechanism like syscall instruction). The kernel has to define the corresponding interrupt handler in advance.
vDSO/vsyscall - These are mechanisms in Linux kernel to optimize execution of some system calls. The idea is to have a shared memory region, and when a process makes a system call, the user space library gets data from this region, instead of actually calling the corresponding system call. This saves context switch overhead.

Resources