who creates map in BPF - linux-kernel

After reading man bpf and a few other sources of documentation, I was under impression that a map can be only created by user process. However the following small program seems to magically create bpf map:
struct bpf_map_def SEC("maps") my_map = {
.type = BPF_MAP_TYPE_ARRAY,
.key_size = sizeof(u32),
.value_size = sizeof(long),
.max_entries = 10,
};
SEC("sockops")
int my_prog(struct bpf_sock_ops *skops)
{
u32 key = 1;
long *value;
...
value = bpf_map_lookup_elem(&my_map, &key);
...
return 1;
}
So I load the program with the kernel's tools/bpf/bpftool and also verify that program is loaded:
$ bpftool prog show
1: sock_ops name my_prog tag f3a3583cdd82ae8d
loaded_at Jan 02/18:46 uid 0
xlated 728B not jited memlock 4096B
$ bpftool map show
1: array name my_map flags 0x0
key 4B value 8B max_entries 10 memlock 4096B
Of course the map is empty. However, removing bpf_map_lookup_elem from the program results in no map being created.
UPDATE
I debugged it with strace and found that in both cases, i.e. with bpf_map_lookup_elem and without it, bpftool does invoke bpf(BPF_MAP_CREATE, ...) and it apparently succeeds. Then, in case of bpf_map_lookup_elem left out, I strace on bpftool map show, and bpf(BPF_MAP_GET_NEXT_ID, ..) immediately returns ENOENT, and it never gets to dump a map. So obviously something is not completing the map creation.
So I wonder if this is expected behavior?
Thanks.

As explained by antiduh, and confirmed with your strace checks, bpftool is the user space program creating the maps in this case. It calls function bpf_prog_load() from libbpf (under tools/lib/bpf/), which in turn ends up performing the syscall. Then the program is pinned at the desired location (under a bpf virtual file system mount point), so that it is not unloaded when bpftool returns. Maps are not pinned.
Regarding map creation, the magic bits also take place in libbpf. When bpf_prog_load() is called, libbpf receives the name of the object file as an argument. bpftool does not ask to load this specific program or that specific map; instead, it provides the object file and libbpf has to deal with it. So the functions in libbpf parse this ELF object file, and eventually find a number of sections corresponding to maps and programs. Then it tries to load the first program.
Loading this program includes the following steps:
CHECK_ERR(bpf_object__create_maps(obj), err, out);
CHECK_ERR(bpf_object__relocate(obj), err, out);
CHECK_ERR(bpf_object__load_progs(obj), err, out);
In other words: start by creating all maps we found in the object file. Then perform map relocation (i.e. associate map index to eBPF instructions), and at last load program instructions.
So regarding your question: in both cases, with and without bpf_map_lookup_elem(), maps are created with a bpf(BPF_MAP_CREATE, ...) syscall. After that, relocation happens, and program instructions are adapted to point, if needed, to the newly created maps. Then once all steps are finished and the program is loaded, bpftool exits. The eBPF program should be pinned, and still loaded in the kernel. As far as I understand, if it does use the maps (if bpf_map_lookup_elem() was used), then maps are still referenced by a loaded program, and are kept in the kernel. On the other hand, if the program does not use the maps, then there is nothing more to hold them back, so the maps are destroyed when the file descriptors held by bpftool are closed, when bpftool returns.
So in the end, when bpftool has completed, you have a map loaded in the kernel if the program uses it, but no map if no program would rely on it. Sounds like expected behaviour in my opinion; but please do ping one way or another if you experience strange things with bpftool, I'm one of the guys working on the utility. One last generic observation: maps can also be pinned and remain in the kernel even if no program uses them, should one need to keep them around.

I was under impression that a map can be only created by user process.
You're completely right - user programs are the ones that invoke the bpf system call in order to load eBPF programs and create eBPF maps.
And you did just that:
So I load the program with tools/bpf/bpftool and ...
Your bpftool program is the user process that is invoking the bpf syscall, and thus is the user process that is creating the eBPF map.
BPF programs don't have to be unloaded when the user program that created it quits - bpftool likely uses this mechanism.
Some relevant bits from the man page to connect the dots:
A user process can create multiple maps ... and access them via file descriptors.
Generally, eBPF programs are loaded by the user process and automatically unloaded when the process exits. In some cases ... the program will continue to stay alive inside the kernel even after the process that loaded the program exits.
Each eBPF program is a set of instructions that is safe to run until its completion. ... During verification, the kernel increments reference counts for each of the maps that the eBPF program uses, so that the attached maps can't be removed until the program is unloaded.

Related

How can an ebpf program change kernel execution flow or call kernel functions?

I'm trying to figure out how an ebpf program can change the outcome of a function (not a syscall, in my case) in kernel space. I've found numerous articles and blog posts about how ebpf turns the kernel into a programmable kernel, but it seems like every example is just read-only tracing and collecting statistics.
I can think of a few ways of doing this: 1) make a kernel application read memory from an ebpf program, 2) make ebpf change the return value of a function, 3) allow an ebpf program to call kernel functions.
The first approach does not seem like a good idea.
The second would be enough, but as far as I understand it's not easy. This question says syscalls are read-only. This bcc document says it is possible but the function needs to be whitelisted in the kernel. This makes me think that the whitelist is fixed and can only be changed by recompiling the kernel, is this correct?
The third seems to be the most flexible one, and this blog post encouraged me to look into it. This is the one I'm going for.
I started with a brand new 5.15 kernel, which should have this functionality
As the blog post says, I did something no one should do (security is not an issue since I'm just toying with this) and opened every function to ebpf by adding this to net/core/filter.c (which I'm not sure is the correct place to do so):
static bool accept_the_world(int off, int size,
enum bpf_access_type type,
const struct bpf_prog *prog,
struct bpf_insn_access_aux *info)
{
return true;
}
bool export_the_world(u32 kfunc_id)
{
return true;
}
const struct bpf_verifier_ops all_verifier_ops = {
.check_kfunc_call = export_the_world,
.is_valid_access = accept_the_world,
};
How does the kernel know of the existence of this struct? I don't know. None of the other bpf_verifier_ops declared are used anywhere else, so it doesn't seem like there is a register_bpf_ops
Next I was able to install bcc (after a long fight due to many broken installation guides).
I had to checkout v0.24 of bcc. I read somewhere that pahole is required when compiling the kernel, so I updated mine to v1.19.
My python file is super simple, I just copied the vfs example from bcc and simplified it:
bpf_text_kfunc = """
extern void hello_test_kfunc(void) __attribute__((section(".ksyms")));
KFUNC_PROBE(vfs_open)
{
stats_increment(S_OPEN);
hello_test_kfunc();
return 0;
}
"""
b = BPF(text=bpf_text_kfunc)
Where hello_test_kfunc is just a function that does a printk, inserted as a module into the kernel (it is present in kallsyms).
When I try to run it, I get:
/virtual/main.c:25:5: error: cannot call non-static helper function
hello_test_kfunc();
^
And this is where I'm stuck. It seems like it's the JIT that is not allowing this, but who exactly is causing this issue? BCC, libbpf or something else? Do I need to manually write bpf code to call kernel functions?
Does anyone have an example with code of what the lwn blog post I linked talks about actually working?
eBPF is fundamentally made to extend kernel functionality in very specific limited ways. Essentially a very advanced plugin system. One of the main design principles of the eBPF is that a program is not allowed to break the kernel. Therefor it is not possible to change to outcome of arbitrary kernel functions.
The kernel has facilities to call a eBPF program at any time the kernel wants and then use the return value or side effects from helper calls to effect something. The key here is that the kernel always knows it is doing this.
One sort of exception is the BPF_PROG_TYPE_STRUCT_OPS program type which can be used to replace function pointers in whitelisted structures.
But again, explicitly allowed by the kernel.
make a kernel application read memory from an ebpf program
This is not possible since the memory of an eBPF program is ephemaral, but you could define your own custom eBPF program type and pass in some memory to be modified to the eBPF program via a custom context type.
make ebpf change the return value of a function
Not possible unless you explicitly call a eBPF program from that function.
allow an ebpf program to call kernel functions.
While possible for a number for purposes, this typically doesn't give you the ability to change return values of arbitrary functions.
You are correct, certain program types are allowed to call some kernel functions. But these are again whitelisted as you discovered.
How does the kernel know of the existence of this struct?
Macro magic. The verifier builds a list of these structs. But only if the program type exists in the list of program types.
/virtual/main.c:25:5: error: cannot call non-static helper function
This seems to be a limitation of BCC, so if you want to play with this stuff you will likely have to manually compile your eBPF program and load it with libbpf or cilium/ebpf.

Why does WriteProcessMemory need the handle value passed in, not the ID of the target process?

In the Windows system, we can modify the memory of another process across processes. For example, if process A wants to modify the memory of process B, A can call the system function WriteProcessMemory. The approximate form is as follows:
BOOL flag = WriteProcessMemory(handler, p_B_addr, &p_A_buff, write_size); ...
This function return a Boolean value, which represents whether the write operation is successful. It needs to pass four parameters, let's take a look at these four parameters:
handler. This is a process handle, and it can be used to find B process.
p_B_addr. In process B, the address offset to be written into memory.
p_A_buff. In process A, the pointer to the write data buffer.
write_size. The number of bytes to write.
I am confused about the first parameter handler, which is a variable of type HANDLE. For example, when our program is actually running, the ID of process B is 2680, and then I want to write memory to process B. First I need to use this 2680 to get the handle of process B in process A. The specific form is handler=OpenProcess(PROCESS_ALL_ACCESS, FALSE, 2680), and then you can use this handler to fall into the kernel to modify the memory of process B.
Since they are all trapped in kernel functions to modify memory across processes, why is the WriteProcessMemory function not designed to be in the form of WriteProcessMemory(B_procID, p_B_addr, &p_A_buff, write_size)?
Among them, B_procID is the ID of the B process, since each process they all have unique IDs. Can the system kernel not find the physical address that the virtual address of the B process can map through this B_procID? Why must the process handle index of the B process in the A process be passed in?
There are multiple reasons, all touched on in the comments.
Lifetime. The process id is simply a number, knowing the id does not keep the process alive. Having a open handle to a process means the kernel EPROCESS structure and the process address space will stay intact, even if said process finishes by calling ExitProcess. Windows tries to not re-use the id for a new process right away but it will happen some time in the future given enough time.
Security/Access control. In Windows NT, access control is performed when you open a object, not each time you interact with the object. In this case, the kernel needs to know that the caller has PROCESS_VM_WRITE and PROCESS_VM_OPERATION access to the process. This is related to point 3, efficiency.
Speed. Windows could of course implement a WriteProcessMemoryById function that calls OpenProcess+WriteProcessMemory+CloseHandle but this encourages sub optimal design as well as opening you up to race conditions related to point 1. The same applies to "why is there no WriteFileByFilename function" (and all other Read/Write functions).

Map sharing between different ebpf program types

Is it possible to share ebpf maps between different program types. I need to share a map between a tc-bpf program and a cgroup bpf program. This should be possible if the map is pinned to file system that act as global namespace. But, I haven't got this working.
The map is created by tc-bpf program and pinned to global namespace. Since it is tc-bpf program, the map is of type struct bpf_elf_map. This bpf program is loaded via iproute2.
Now, I have a cgroup bpf program that should be accessing this map, but since it is loaded via a user.c (libbpf) or bpftool and not iproute, the map that is defined here cannot be ‘bpf_elf_map’, but it is struct bpf_map_def. So in the cgroup bpf program, the same map is defined as struct bpf_map_def and not struct bpf_elf_map.
Probably because of this the cgroup program gets a new map_id when I dump the maps (and does not share the intended map), ideally when the same map is shared across bpf programs, these bpf programs would be having the same map_id associated with their unique prog_ids.
It is possible to share access to eBPF maps between programs of different types.
First, you can forget about those differences between struct bpf_elf_map and struct bpf_map_def. They are structs used in user space to build the objects to pass to the kernel. Iproute2 and libbpf may not use the same struct/attribute names, but they both end up passing the map metadata to the bpf() system call in the same format, or the kernel would not understand what map is to be created.
When they are loaded to the kernel, eBPF programs refer to a given map through file descriptors to this map. This means that the process calling the bpf() system call to load the program first has to retrieve file descriptors to the map to use. So the two following cases may happen:
User space application (ip, tc, bpftool...) parses the ELF object file and collects map-related metadata. It does not identify (and possibly did not even try to identify) any existing map that should be reused for the program. So it creates a new map with the bpf() syscall, which returns a file descriptor to this newly-created map. This file descriptor is used in the program instructions referring to map access (once the program is loaded in the kernel, those file descriptors will be replaced by the map address), and the program is then loaded with the bpf() syscall. This is what happens with your tc program, and in your case it seems, with your cgroup program which is creating a second map.
Or user application parses the ELF object file, and finds somehow that there is already an existing map that the program should use. For example, it finds a map of id 1337, or a pinned map under /sys/fs/bpf/. In that case it retrieves a file descriptor to that map (from the id with bpf() syscall, from a pinned path with open()). Then as in the first case, it uses this file descriptor to prepare and then load the program.
Libbpf provides a way to reuse a file descriptor for a given map to use with a program. See for example bpf_map__reuse_fd(). Bpftool uses it to support reusing existing maps, with the map argument for bpftool prog load (see man bpftool-prog). For example, load a program from foo.o and tell it to reuse map of id 27 for the first map found in the object file, then the map pinned at /sys/fs/bpf/foomap for the map named foomap in the object file:
# bpftool prog load foo.o /sys/fs/bpf/foo_prog \
map idx 0 id 27 \
map foomap stats pinned /sys/fs/bpf/foomap

why need linker script and startup code?

I've read this tutorial
I could follow the guide and run the code. but I have questions.
1) Why do we need both load-address and run-time address. As I understand it is because we have put .data at flash too; so why we don't run app there, but need start-up code to copy it into RAM?
http://www.bravegnu.org/gnu-eprog/c-startup.html
2) Why we need linker script and start-up code here. Can I not just build C source as below and run it with qemu?
arm-none-eabi-gcc -nostdlib -o sum_array.elf sum_array.c
Many thanks
Your first question was answered in the guide.
When you load a program on an operating system your .data section, basically non-zero globals, are loaded from the "binary" into the right offset in memory for you, so that when your program starts those memory locations that represent your variables have those values.
unsigned int x=5;
unsigned int y;
As a C programmer you write the above code and you expect x to be 5 when you first start using it yes? Well, if are booting from flash, bare metal, you dont have an operating system to copy that value into ram for you, somebody has to do it. Further all of the .data stuff has to be in flash, that number 5 has to be somewhere in flash so that it can be copied to ram. So you need a flash address for it and a ram address for it. Two addresses for the same thing.
And that begins to answer your second question, for every line of C code you write you assume things like for example that any function can call any other function. You would like to be able to call functions yes? And you would like to be able to have local variables, and you would like the variable x above to be 5 and you might assume that y will be zero, although, thankfully, compilers are starting to warn about that. The startup code at a minimum for generic C sets up the stack pointer, which allows you to call other functions and have local variables and have functions more than one or two lines of code long, it zeros the .bss so that the y variable above is zero and it copies the value 5 over to ram so that x is ready to go when the code your entry point C function is run.
If you dont have an operating system then you have to have code to do this, and yes, there are many many many sandboxes and toolchains that are setup for various platforms that already have the startup and linker script so that you can just
gcc -O myprog.elf myprog.c
Now that doesnt mean you can make system calls without a...system...printf, fopen, etc. But if you download one of these toolchains it does mean that you dont actually have to write the linker script nor the bootstrap.
But it is still valuable information, note that the startup code and linker script are required for operating system based programs too, it is just that native compilers for your operating system assume you are going to mostly write programs for that operating system, and as a result they provide a linker script and startup code in that toolchain.
1) The .data section contains variables. Variables are, well, variable -- they change at run time. The variables need to be in RAM so that they can be easily changed at run time. Flash, unlike RAM, is not easily changed at run time. The flash contains the initial values of the variables in the .data section. The startup code copies the .data section from flash to RAM to initialize the run-time variables in RAM.
2) Linker-script: The object code created by your compiler has not been located into the microcontroller's memory map. This is the job of the linker and that is why you need a linker script. The linker script is input to the linker and provides some instructions on the location and extent of the system's memory.
Startup code: Your C program that begins at main does not run in a vacuum but makes some assumptions about the environment. For example, it assumes that the initialized variables are already initialized before main executes. The startup code is necessary to put in place all the things that are assumed to be in place when main executes (i.e., the "run-time environment"). The stack pointer is another example of something that gets initialized in the startup code, before main executes. And if you are using C++ then the constructors of static objects are called from the startup code, before main executes.
1) Why do we need both load-address and run-time address.
While it is in most cases possible to run code from memory mapped ROM, often code will execute faster from RAM. In some cases also there may be a much larger RAM that ROM and application code may compressed in ROM, so the executable code may not simply be copied from ROM also decompressed - allowing a much larger application than the available ROM.
In situations where the code is stored on non-memory mapped mass-storage media such as NAND flash, it cannot be executed directly in any case and must be loaded into RAM by some sort of bootloader.
2) Why we need linker script and start-up code here. Can I not just build C source as below and run it with qemu?
The linker script defines the memory layout of you target and application. Since this tutorial is for bare-metal programming, there is no OS to handle that for you. Similarly the start-up code is required to at least set an initial stack-pointer, initialise static data, and jump to main. On an embedded system it is also necessary to initialise various hardware such as the PLL, memory controllers etc.

Memory mapping of binary to VAS

When a new process is created the Address space is created using fork() i.e new page table entries are created for the new process which are exactly same as the parent process.
After fork() the exec() is called. What happens during the exec() system call?
I read in the book "Operating system concepts " that when a new program is executed, the process is given a new empty VAS. Does that mean that the page table entries created during fork() would get deleted/modifeid ? What is the meaning of empty VAS?
How does the memory mapping of binary to VAS is performed? How does the loader knows that what addresses of the VAS should be mapped to the corresponding binary file?
I am really confused here.
when you call exec the kernel will load the binary and set up a whole new set of page tables (replacing the old ones).
The loader gets the address to load the binary at from the binary itself (basically it does read() to get the headers and stuff that's not code, then mmap() to actually load the code/data stuff in the binary)
so it looks at the binary and figures out how it should be loaded, the does mmap(), passing in an address to do the map at for each part of the binary that needs to be in a different place (ie code and data sections are probably two different calls to mmap() also the .bss section would be mapped from /dev/zero)
Note that depending on the OS and the binary being loaded some of this stuff may be handled by the kernel directly or by a userspace loader (on UNIXish systems ld would be the loader, it handles shared object loading)

Resources