I know that you can't use malloc inside a kernel module because all functions used in the kernel must be defined in the kernel, but how exactly does the kernel achieve this lock-down?
It's not so much that it's locked down. It's just that your kernel module has no idea where malloc() is. The malloc() function is part of the C standard library, which is loaded alongside programs in userspace. When a userland program is executed, the linker will load the shared libraries needed by the program and figure out where the needed functions are. SO it will load libc at an address, and malloc() will be at some offset of that. So when your program goes to call malloc() it actually calls into libc.
Your kernel module isn't linked against libc or any other userspace components. It's linked against the kernel, which doesn't include malloc. Your kernel driver can't depend on the address of anything in userspace, because it may have to run in the context of any userspace program or even in no context, like in an interrupt. So the code for malloc() may not even be in memory anywhere when your module runs. Now if you knew that you were running in the context of a process that had libc loaded, and knew the address that malloc() was located at, you could potentially call that address by storing it in a function pointer. Bad things would probably happen though, possibly including a kernel panic. You don't want to cross userspace and kernelspace boundaries except through sane, well defined interfaces.
When you write a module for the kernel you just don't have this functions in your header files. And you don't have it in the files you are linking with.
Also the malloc's implementation is a procedure calling system calls. System calls moves you to hypervisor and calls the kernel's code. There is no point doing it while in hypervisor mode.
You can see it here in more details.
Related
How can I read from the PMU from inside Kernel space?
For a profiling task I need to read the retired instructions provided by the PMU from inside the kernel. The perf_event_open systemcall seems to offer this capability. In my source code I
#include <linux/syscalls.h>
set my parameters for the perf_event_attr struct and call the sys_perf_event_open(). The mentioned header contains the function declaration. When checking "/proc/kallsyms", it is confirmed that there is a systemcall with the name sys_perf_event_open. The symbol is globally available indicated by the T:
ffffffff8113fe70 T sys_perf_event_open
So everything should work as far as I can tell.
Still, when compiling or inserting the LKM I get a warning/error that sys_perf_event_open does not exist.
WARNING: "sys_perf_event_open" [/home/vagrant/mods/lkm_read_pmu/read_pmu.ko] undefined!
What do I need to do in order to get those retired instructions counter?
The /proc/kallsyms file shows all kernel symbols defined in the source. Right, the capital T indicates a global symbol in the text section of the kernel binary, but the meaning of "global" here is according to the C language. That is, it can be used in other files of the kernel itself. You can't call a kernel function from a kernel module just because it's global.
Kernel modules can only use kernel symbols that are exported with EXPORT_SYMBOL in the kernel source code. Since kernel 2.6.0, none of the system calls are exported, so you can't call any of them from a kernel module, including sys_perf_event_open. System calls are really designed to be called from user space. What this all means is that you can't use the perf_event subsystem from within a kernel module.
That said, I think you can modify the kernel to add EXPORT_SYMBOL to sys_perf_event_open. That will make it an exported symbol, which means it can be used from a kernel module.
Imagine a situation like this: I'll take a function pointer, which is located in the user space, from a syscall, and the kernel module calls back this function.
(It would be important for this function to run in user space)
Will the kenel module see the same memory address (acquired function pointer) as the user space application? (I mean user's virtual address space or liner address space)
First of, you are trying to do something wrong. If you need custom code in the kernel, you provide it as a kernel module.
The answer in the linked duplicate ( Executing a user-space function from the kernel space ) is largely crap. This would "work" on certain architectures as long as no syscalls are used and no tls/whatever other stuff is used. In fact this is how plenty of exploits do it.
I'll take a function pointer, which is located in the user space, from
a syscall, and the kernel module calls back this function.
It really sounds like you are trying to do something backwards. If you need a userspace component, that's the thing which should have all the logic. Then you call the kernel telling it what to do.
(It would be important for this function to run in user space?)
Who are you asking? I can only state that calling a function which was planted by userspace does not mean it starts "running in user space". Switching to userspace is a lot of work, definitely not done by calling a function.
Will the kenel module see the same memory address (acquired function pointer) as the user space application?
Depends on the architecture, typically it will. But even then there are hardware protections from using this "feature" which have to explicitly turned off.
But again, you DON'T want to do it. I strongly suggest you state the actual problem.
Is it possible to pin a softirq, or any other bottom half to a processor. I have a doubt that this could be done from within a softirq code.
But then inside a driver is it possible to pin a particular IRQ to a
core.
From user mode, you can easily do this by writing to /proc/irq/N/smp_affinity to control which processor(s) an interrupt is directed to. The symbols for the code implementing this are not exported though, so it's difficult to do from the kernel (at least for a loadable module which is how most drivers are structured).
The fact that the implementing function symbols aren't exported is a sign that the kernel developers don't want to encourage this. Presumably that's because it takes control away from the user. And also embeds assumptions about number of processors and so forth into the driver.
So, to answer your question, yes, it's possible, but it's discouraged, and you would need to do one of several "ugly" things to implement it ((a) change kernel exports, (b) link your driver statically into main kernel, or (c) open/write to the proc file from kernel mode).
The usual way to achieve this is by writing a user-mode program (can even be a shell script) that programs core numbers/masks into the appropriate proc file. See Documentation/IRQ-affinity.txt in the kernel source directory for details.
I am try to understand the mechanism used by Linux to invoke a system call. In particular, I am struggling to understand the VSDO mechanism. Can it be used to invoke all system calls? And what the difference between the vsdo page and vsyscall page within the process memory? are they always there?
For example using cat /proc/self/maps :
7fff32938000-7fff32939000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Best,
The vsyscall and vDSO segments are two mechanisms used to accelerate certain system calls in Linux. For instance, gettimeoftheday is usually invoked through this mechanism. The first mechanism introduced was vsyscall, which was added as a way to execute specific system calls which do not need any real level of privilege to run in order to reduce the system call overhead. Following the previous example, all gettimeofday needs to do is to read the kernel's the current time. There are applications that call gettimeofday frequently (e.g to generate timestamps), to the point that they care about even a little bit of overhead. To address this concern, the kernel maps into user space a page containing the current time and a fast gettimeofday implementation (i.e. just a function which reads the time saved into vsyscall). Using this virtual system call, the C library can provide a fast gettimeofday which does not have the overhead introduced by the context switch between kernel space and user space usually introduced by the classic system call model INT 0x80 or SYSCALL.
However, this vsyscall mechanism has some limitations: the memory allocated is small and allows only 4 system calls, and, more important and serious, the vsyscall page is statically allocated to the same address in each process, since the location of the vsyscall page is nailed down in the kernel ABI. This static allocation of the vsyscall compromises the benefit introduced by the memory space randomisation commonly used by Linux. An attacker, after compromising an application by exploiting a stack-overflow, can invoke a system call from the vsyscall page with arbitrary parameters. All he needs is the address of the system call, which is easily predicable as it is statically allocated (if you try to run again your command even with different applications, you'll notice that the address of the vsyscall does not change).
It would be nice to remove or at least randomize the location of the vsyscall page to thwart this type of attack. Unfortunately, applications depend on the existence and exact address of that page, so nothing can be done.
This security issue has been addresses by replacing all system call instructions at fixed addressed by a special trap instruction. An application trying to call into the vsyscall page will trap into the kernel, which will then emulate the desired virtual system call in kernel space. The result is a kernel system call emulating a virtual system call which was put there to avoid the kernel system call in the first place. The result is a vsyscall which takes longer to execute but, crucially, does not break the existing ABI. In any case, the slowdown will only be seen if the application is trying to use the vsyscall page instead of the vDSO. The vDSO offers the same functionality as the vsyscall, while overcoming its limitations. The vDSO (Virtual Dynamically linked Shared Objects) is a memory area allocated in user space which exposes some kernel functionalities at user space in a safe manner.
This has been introduced to solve the security threats caused by the vsyscall.
The vDSO is dynamically allocated which solves security concerns and can have more than 4 system calls. The vDSO links are provided via the glibc library. The linker will link in the glibc vDSO functionality, provided that such a routine has an accompanying vDSO version, such as gettimeofday. When your program executes, if your kernel does not have vDSO support, a traditional syscall will be made.
Credits and useful links :
Awesome tutorial, how to create your own vDSO.
vsyscall andvDSO, nice article
useful article and links
What is linux-gate.so.1?
Ultimately I am just trying to figure out how to dynamically allocate heap memory from within assembly.
If I call Linux sbrk() from assembly code, can I use the address returned as I would use an address of a statically (ie in the .data section of my program listing) declared chunk of memory?
I know Linux uses the hardware MMU if present, so I am not sure if what sbrk returns is a 'raw' pointer to real RAM, or is it a cooked pointer to RAM that may be modified by Linux's VM system?
I read this: How are sbrk/brk implemented in Linux?. I suspect I can not use the return value from sbrk() without worry: the MMU fault on access-non-allocated-address must cause the VM to alter the real location in RAM being addressed. Thus assy, not linked against libc or what-have-you, would not know the address has changed.
Does this make sense, or am I out to lunch?
Unix user processes live in virtual memory, no matter if written in assembler of Fortran, and should not care about physical addresses. That's kernel's business - kernel sets up and manages the MMU. You don't have to worry about it. Page faults are handled automatically and transparently.
sbrk(2) returns a virtual address specific to the process, if that's what you were asking.