In my kernel module I'd like to create multiple FDs, and pass them later to the user-space via ioctl.
The user-space code will use these FDs to wait for an event using poll() or select().
If I were creating such FDs in the user-space, I'd call eventfd(), but how do that in the kernel-space?
According to system call's expansion macro (#define SYSCALL_DEFINEx) in syscalls.h, maybe you can call sys_eventfd or sys_eventfd2 in the kernel-space.
Related
I am reading about kprobes BPF program type, and am wondering if it is possible to not just intercept a function call for tracing purposes or collect some low-level information (registers, stack etc.), but substitute a call and execute instead of the actual function?
Does kprobe provide this capability or I'm looking at the wrong tool?
No, kprobes BPF programs have only read access to the syscall parameters and return value, they cannot modify registers and therefore cannot intercept function calls. This is a limitation imposed by the BPF verifier.
Kernel modules, however, can intercept function calls using kprobes.
I'm making an emulation driver that requires me to call schedule() in ATOMIC contexts in order to make the emulation part work. For now I have this hack that allows me to call schedule() inside ATOMIC (e.g. spinlock) context:
int p_count = current_thread_info()->preempt_count;
current_thread_info()->preempt_count = 0;
schedule();
current_thread_info()->preempt_count = p_count;
But that doesn't work inside IRQs, the system just stops afer calling schedule().
Is there any way to hack the kernel in a way to allow me to do it? I'm using Linux kernel 4.2.1 with User Mode Linux
In kernel code you can be either in interrupt context or in process context.
When you are in interrupt context, you cannot call any blocking function (e.g., schedule()) or access the current pointer. That's related to how the kernel is designed and there is no way for having such functionalities in interrupt context. See also this answer.
Depending on what is your purpose, you can find some strategy that allows you to reach your goal. To me, it sounds strange that you have to call schedule() explicitly instead of relying on the natural kernel flow.
One possible approach follows (but, again, it depends on your specific goal). Form the IRQ you can schedule the work on a work queue through schedule_work(). The work queue, in fact, by design, executes kernel code in process context. From there, you are allowed to call blocking functions and access the current process data.
I am writing a Linux kernel module using Kprobes to trace specific system calls, and I need to write to a file from within a KProbe handler (specifically, a Kretprobe). I know this is generally not advised, but I need to write the output to a very specific location, so I can't use any standard logging mechanisms.
I can open/write fine from the init() function in the module, but when I try to do so from within a probe handler, the kernel crashes.
From Documentation/kprobes.txt:
Probe handlers are run with preemption disabled. Depending on the
architecture and optimization state, handlers may also run with
interrupts disabled (e.g., kretprobe handlers and optimized kprobe
handlers run without interrupt disabled on x86/x86-64). In any case,
your handler should not yield the CPU (e.g., by attempting to acquire
a semaphore).
In other words, you cannot sleep inside probe handler. Because read/write operations with file normally use disk I/O, you cannot use these operations inside the handler.
I need to write the output to a very specific location, so I can't use any standard logging mechanisms.
You can output trace from probe handler, e.g., into the special device file, and run(in parallel) user-space program, which simply reads that file and writes into one at very specific location.
I have a kernel module that uses hrtimers to notify userspace when the timer has fired. I understand I can just use userspace timers, but it is emulating a driver that will actually talk to hardware in the future. Every once in a while I get a BUG: Scheduling while atomic. After doing some research I am assuming that the hrtimer.function that I register as a callback, is being called from an interrupt routine by the kernel internals (making my callback function in an "Atomic Context"). Then when I call sysfs_notify() within the callback, I get the kernel bug, because sysfs_notify() acquires a mutex.
1) Is this a correct assumption?
If this is correct, I have seen that there is a function called sys_notify_dirent() that I can use to notify userspace from an atomic context. But according to this source:
http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-10/msg07510.html
It can only be called from a "process" context, and not an interrupt context (due to the spinlock).
2) Could someone explain the difference between process, interrupt, and atomic context?
3) If this cannot be used in an interrupt context, what is an alternative to notifying userspace in this context?
Correct, sysfs_notify() cannot be called from atomic context. And yes, sysfs_notify_dirent() appears to be safe to call from atomic context. The source you cite is a bug report that notices in an old kernel version that statement wasn't actually true, along with a patch to fix it. It now appears to be safe to call.
Follow the source code in gpiolib_sysfs.c, and you'll notice that sysfs_notify_dirent() eventually calls schedule_work(), which defers the actual call to sysfs_notify(), which is exactly what the comments to your question are advising you to do. It's just wrapped inside the convenience function.
Recently I was looking through the kernel at kobjects and sysfs.
I know/understand the following..
All kernel objects use addresses > 0x80000000
kobjects should be no exception to this rule
The sysfs is nothing but a hierarchy of kobjects (maybe includes ksets and other k* stuff..not sure)
Given this information, I'm not sure I understand exactly what happens when I run echo ondemand >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
I can see that the cpufreq module has a function called store_scaling_governor which handles writes to this 'file'..but how does usermode transcend into kernelmode with this simple echo?
When you execute command echo ondemand >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor, your shell calls write system call, then kernel dispatch it for corresponding handler.
The cpufreq setups struct kobj_type ktype_cpufreq with sysfs_ops. Then cpufreq register it in cpufreq_add_dev_interface(). After that, kernel can get corresponding handler to execute on write syscall.
I can tell you one implementation which I have used for accessing kernel space variables from sysfs (user-space in shell prompt).Basically each set of variables which are exposed to user-space in sys file system appear as a separate file under /sys/.Now when you issue an echo value > /sys/file-path in shell prompt (user-space).When you do so the respective method which gets called in kernel space in .store method.Additionally when you issue cat /sys/file-path the respective method which gets called is .show in kernel.You can see more information about here: http://lwn.net/Articles/31220/