I finished reading https://www.redhat.com/en/blog/introduction-virtio-networking-and-vhost-net and couldn't understand exactly.
I understand how virtio works. How the host kernel can read/write to a virtqueue and vice-versa for the guest kernel.
I'm reading the drivers/host/vhost.c and trying to understand what it does.
static int __init vhost_init(void)
{
return 0;
}
static void __exit vhost_exit(void)
{
}
module_init(vhost_init);
module_exit(vhost_exit);
MODULE_VERSION("0.0.1");
MODULE_LICENSE("GPL v2");
MODULE_AUTHOR("Michael S. Tsirkin");
MODULE_DESCRIPTION("Host kernel accelerator for virtio");
it has no init function so I don't know how anything can interact with it.
All I know is that vhost-net (on the guest) talks to vhost (on the kernel).
According to the website:
vhost protocol - A protocol that allows the virtio dataplane
implementation to be offloaded to another element (user process or
kernel module) in order to enhance performance.
How can this dataplane implementation be offloaded? I don't see any way to interact with vhost.c module as the init function has nothing.
It has a vhost_init function, but the function does not do anything, just return 0. That means the vhost kernel module does not need to do anything during initialization. But you can still interact with it by other means.
In the Linux kernel, one common way to interact with other kernel part or kernel module is to export symbols. In vhost.c, you will see lots of export symbols, e.g.
EXPORT_SYMBOL_GPL(vhost_vring_ioctl);
https://elixir.bootlin.com/linux/v5.15.3/source/drivers/vhost/vhost.c#L1718
By exporting symbols, other kernel code or kernel module can call this function. If you search vhost_vring_ioctl, you will see it is called from a few places in other files. That is how they interact with vhost.
Related
I'm trying to figure out how an ebpf program can change the outcome of a function (not a syscall, in my case) in kernel space. I've found numerous articles and blog posts about how ebpf turns the kernel into a programmable kernel, but it seems like every example is just read-only tracing and collecting statistics.
I can think of a few ways of doing this: 1) make a kernel application read memory from an ebpf program, 2) make ebpf change the return value of a function, 3) allow an ebpf program to call kernel functions.
The first approach does not seem like a good idea.
The second would be enough, but as far as I understand it's not easy. This question says syscalls are read-only. This bcc document says it is possible but the function needs to be whitelisted in the kernel. This makes me think that the whitelist is fixed and can only be changed by recompiling the kernel, is this correct?
The third seems to be the most flexible one, and this blog post encouraged me to look into it. This is the one I'm going for.
I started with a brand new 5.15 kernel, which should have this functionality
As the blog post says, I did something no one should do (security is not an issue since I'm just toying with this) and opened every function to ebpf by adding this to net/core/filter.c (which I'm not sure is the correct place to do so):
static bool accept_the_world(int off, int size,
enum bpf_access_type type,
const struct bpf_prog *prog,
struct bpf_insn_access_aux *info)
{
return true;
}
bool export_the_world(u32 kfunc_id)
{
return true;
}
const struct bpf_verifier_ops all_verifier_ops = {
.check_kfunc_call = export_the_world,
.is_valid_access = accept_the_world,
};
How does the kernel know of the existence of this struct? I don't know. None of the other bpf_verifier_ops declared are used anywhere else, so it doesn't seem like there is a register_bpf_ops
Next I was able to install bcc (after a long fight due to many broken installation guides).
I had to checkout v0.24 of bcc. I read somewhere that pahole is required when compiling the kernel, so I updated mine to v1.19.
My python file is super simple, I just copied the vfs example from bcc and simplified it:
bpf_text_kfunc = """
extern void hello_test_kfunc(void) __attribute__((section(".ksyms")));
KFUNC_PROBE(vfs_open)
{
stats_increment(S_OPEN);
hello_test_kfunc();
return 0;
}
"""
b = BPF(text=bpf_text_kfunc)
Where hello_test_kfunc is just a function that does a printk, inserted as a module into the kernel (it is present in kallsyms).
When I try to run it, I get:
/virtual/main.c:25:5: error: cannot call non-static helper function
hello_test_kfunc();
^
And this is where I'm stuck. It seems like it's the JIT that is not allowing this, but who exactly is causing this issue? BCC, libbpf or something else? Do I need to manually write bpf code to call kernel functions?
Does anyone have an example with code of what the lwn blog post I linked talks about actually working?
eBPF is fundamentally made to extend kernel functionality in very specific limited ways. Essentially a very advanced plugin system. One of the main design principles of the eBPF is that a program is not allowed to break the kernel. Therefor it is not possible to change to outcome of arbitrary kernel functions.
The kernel has facilities to call a eBPF program at any time the kernel wants and then use the return value or side effects from helper calls to effect something. The key here is that the kernel always knows it is doing this.
One sort of exception is the BPF_PROG_TYPE_STRUCT_OPS program type which can be used to replace function pointers in whitelisted structures.
But again, explicitly allowed by the kernel.
make a kernel application read memory from an ebpf program
This is not possible since the memory of an eBPF program is ephemaral, but you could define your own custom eBPF program type and pass in some memory to be modified to the eBPF program via a custom context type.
make ebpf change the return value of a function
Not possible unless you explicitly call a eBPF program from that function.
allow an ebpf program to call kernel functions.
While possible for a number for purposes, this typically doesn't give you the ability to change return values of arbitrary functions.
You are correct, certain program types are allowed to call some kernel functions. But these are again whitelisted as you discovered.
How does the kernel know of the existence of this struct?
Macro magic. The verifier builds a list of these structs. But only if the program type exists in the list of program types.
/virtual/main.c:25:5: error: cannot call non-static helper function
This seems to be a limitation of BCC, so if you want to play with this stuff you will likely have to manually compile your eBPF program and load it with libbpf or cilium/ebpf.
I'm reading the gpiolib.c code in the linux kernel to understand how the GPIO driver works. But I didn't find any definition of "trace_gpio_value" function.
trace_gpio_value(desc_to_gpio(desc), 0, value);
Anybody can help me about definition of trace_gpio_value?
trace_gpio_value() and generally trace_*() are used by kernel ftrace static tracing utility to monitor some internals such as GPIO or networking (used for debugging and other purposes). These static tracing points cause collected data to be stored in a kernel buffer which you can see it's internals from tracing virtual file system mounted at /sys/kernel/tracing if Kconfig option CONFIG_FTRACE=y. So, in a nutshell these trace points will act as a hook to call other tracing functions that you provide.
About the actual definition and how they work you must declare your tracing point using DECLARE_TRACE() in your header file. In your case check include/trace/events/gpio.h where you specify your tracing function and how it will work:
#include <linux/tracepoint.h>
# note the NAME (first_parameter)
TRACE_EVENT(gpio_value,
...<SNIP...>
);
NOTE: TRACE_EVENT() is a macro that gets expanded to DECLARE_TRACE().
Then in your C code file you will add trace_gpio_value() whenever you want to trace and get the gpio_value() being called or another functions for another purposes.
I have created one kernel module. within the module i have defined some functions say function1(int n) and function2().
There was no error in compiling and inserting the module. What i don't understand is how to call the function1(n) and function2() from a user space program.
I think there is no direct way to do it, you can't link userspace code with the kernel like you do with a library. First, you have to register your function as syscall and then call the syscall with the syscall() function.
See here
Also some interface between kernel and user space possible using socket communication see
this link
And find use full link related to this topic at right side of page.
You can make your driver to react on writes to a /dev/file file or a /proc/file file.
EDIT
Form name file my point is device is as file in kernel and you can access via ioctl()
the pretty good explanation is http://tldp.org/LDP/lkmpg/2.6/html/lkmpg.html#AEN885
See Link
I have a little driver that is handling a gpio that when enabled should tell the system to sleep/wake when a button is pressed. If its held down it should power off.
On WinCE there is a very easy to use mechanism (SetSystemPowerState) but there doesn't appear to be something similar on linux.
We also don't have dbus...
update:
I may have found the answer
Shutdown (embedded) linux from kernel-space
Though it doesn't really say how to sleep but i think I'll be able to figure the rest out. This doesn't seem like the proper way to handle a linux kernel driver since the module is built into the kernel. It doesn't appear that I have all the power states available to switch to without adding packages outside of the kernel.
If you want to suspend your system entirely, you can use /sys/power/state interface like belows.
echo "mem" > /sys/power/state
It calls state_store() function in kernel/power/main.c to suspend your system into memory. Instead of "mem", you can use "standby" or "disk" only if your system supports them.
The most general way would be to initiate the process from kernelspace as a userspace helper:
static const char * const set_power_argv[] =
{ "/bin/echo", "mem", "/sys/power/state", NULL };
call_usermodehelper(shutdown_argv[0], shutdown_argv, NULL, UMH_NO_WAIT);
However, location of echo command and power driver can differ in your system.
I am interested in developing kernel module that binds two block devices into a new block device in such manner that first block device contains data at mount time, and the other is considered empty. Every write is being made to second partition, so on next mount the base filesystem remains unchanged. I know of solutions like UnionFS, but those are filesystem-based, while i want to develop it a layer lower, block-based.
Can anyone tell me how could i open ad read/write block device from kernel module? Possibly without using userspace program for reading/writing merged block devices. I found similar topic here, but the answer was rather unsatysfying because filp_* functions are rather for reading small config files, not for (large) block device I/O.
Since interface for creating block devices is standarized i was thinking of direct (or almost direct) acces to functions implementing source devices, as i will be requested to export similar functions anyway. If i could do that i would simply create some proxy-functions calling appropriate functions on source devices. Can i somehow obtain pointer to a gendisk structure that belongs to different driver?
This serves only my own purposes (satisfying quriosity being main of them) so i am not worried about messing my kernel up seriously.
Or does somebody know if module like that already exists?
The source code in the device mapper driver will suit your needs. Look at the code in the Linux source in Linux/drivers/md/dm-*.
You don't need to access the other device's gendisk structure, but rather its request queue. You can prepare I/O requests and push it down the other device's queue, and it will do the rest itself.
I have implemented a simple block device that opens another block device. Take a look in my post describing it:
stackbd: Stacking a block device over another block device
Here are some examples of functions that you need for accessing another device's gendisk.
The way to open another block device using its path ("/dev/"):
struct block_device *bdev_raw = lookup_bdev(dev_path);
printk("Opened %s\n", dev_path);
if (IS_ERR(bdev_raw))
{
printk("stackbd: error opening raw device <%lu>\n", PTR_ERR(bdev_raw));
return NULL;
}
if (!bdget(bdev_raw->bd_dev))
{
printk("stackbd: error bdget()\n");
return NULL;
}
if (blkdev_get(bdev_raw, STACKBD_BDEV_MODE, &stackbd))
{
printk("stackbd: error blkdev_get()\n");
bdput(bdev_raw);
return NULL;
}
The simplest example of passing an I/O request from one device to another is by remapping it without modifying it. Notice in the following code that the bi_bdev entry is modified with a different device. One can also modify the block address (*bi_sector) and the data itself.
static void stackbd_io_fn(struct bio *bio)
{
bio->bi_bdev = stackbd.bdev_raw;
trace_block_bio_remap(bdev_get_queue(stackbd.bdev_raw), bio,
bio->bi_bdev->bd_dev, bio->bi_sector);
/* No need to call bio_endio() */
generic_make_request(bio);
}
Consider examining the code for the dm / md block devices in drivers/md - these existing drivers create a block device that stores data on other block devices.
In fact, you could probably implement your idea as another "RAID personality" in md, and thereby make use of the existing userspace tools for setting up the devices.
You know, if you're a GPL'd kernel module you can just call open(), read(), write(), etc. from kernel mode right?
Of course this way has certain caveats including requiring forking from kernel mode to create a space for your handle to live.