I am writing a new syscall where I want to send a kill(pid, SIGSTOP) signal to a process that I just created in order to move it from the runqueue to the waitqueue. Then, I can wake it up again using kill(pid, SIGCONT).
The problem is that the kill is only used from the userspace, how can I send a signal from inside the kernel itself? is there an equivalent function to use that can do so?
I found kill_pid, but I don't know how its headers should be included.
It seems like you found the correct method of sending a signal to a process from kernel space - kill_pid which is also exported, which means it is available to kernel modules.
Using elixir - lets look at some usage examples - this shows you in which header file the symbol is defined - so you should start by including sched/signal.h and do the process for any other dependencies you may have
After reading man bpf and a few other sources of documentation, I was under impression that a map can be only created by user process. However the following small program seems to magically create bpf map:
struct bpf_map_def SEC("maps") my_map = {
.type = BPF_MAP_TYPE_ARRAY,
.key_size = sizeof(u32),
.value_size = sizeof(long),
.max_entries = 10,
};
SEC("sockops")
int my_prog(struct bpf_sock_ops *skops)
{
u32 key = 1;
long *value;
...
value = bpf_map_lookup_elem(&my_map, &key);
...
return 1;
}
So I load the program with the kernel's tools/bpf/bpftool and also verify that program is loaded:
$ bpftool prog show
1: sock_ops name my_prog tag f3a3583cdd82ae8d
loaded_at Jan 02/18:46 uid 0
xlated 728B not jited memlock 4096B
$ bpftool map show
1: array name my_map flags 0x0
key 4B value 8B max_entries 10 memlock 4096B
Of course the map is empty. However, removing bpf_map_lookup_elem from the program results in no map being created.
UPDATE
I debugged it with strace and found that in both cases, i.e. with bpf_map_lookup_elem and without it, bpftool does invoke bpf(BPF_MAP_CREATE, ...) and it apparently succeeds. Then, in case of bpf_map_lookup_elem left out, I strace on bpftool map show, and bpf(BPF_MAP_GET_NEXT_ID, ..) immediately returns ENOENT, and it never gets to dump a map. So obviously something is not completing the map creation.
So I wonder if this is expected behavior?
Thanks.
As explained by antiduh, and confirmed with your strace checks, bpftool is the user space program creating the maps in this case. It calls function bpf_prog_load() from libbpf (under tools/lib/bpf/), which in turn ends up performing the syscall. Then the program is pinned at the desired location (under a bpf virtual file system mount point), so that it is not unloaded when bpftool returns. Maps are not pinned.
Regarding map creation, the magic bits also take place in libbpf. When bpf_prog_load() is called, libbpf receives the name of the object file as an argument. bpftool does not ask to load this specific program or that specific map; instead, it provides the object file and libbpf has to deal with it. So the functions in libbpf parse this ELF object file, and eventually find a number of sections corresponding to maps and programs. Then it tries to load the first program.
Loading this program includes the following steps:
CHECK_ERR(bpf_object__create_maps(obj), err, out);
CHECK_ERR(bpf_object__relocate(obj), err, out);
CHECK_ERR(bpf_object__load_progs(obj), err, out);
In other words: start by creating all maps we found in the object file. Then perform map relocation (i.e. associate map index to eBPF instructions), and at last load program instructions.
So regarding your question: in both cases, with and without bpf_map_lookup_elem(), maps are created with a bpf(BPF_MAP_CREATE, ...) syscall. After that, relocation happens, and program instructions are adapted to point, if needed, to the newly created maps. Then once all steps are finished and the program is loaded, bpftool exits. The eBPF program should be pinned, and still loaded in the kernel. As far as I understand, if it does use the maps (if bpf_map_lookup_elem() was used), then maps are still referenced by a loaded program, and are kept in the kernel. On the other hand, if the program does not use the maps, then there is nothing more to hold them back, so the maps are destroyed when the file descriptors held by bpftool are closed, when bpftool returns.
So in the end, when bpftool has completed, you have a map loaded in the kernel if the program uses it, but no map if no program would rely on it. Sounds like expected behaviour in my opinion; but please do ping one way or another if you experience strange things with bpftool, I'm one of the guys working on the utility. One last generic observation: maps can also be pinned and remain in the kernel even if no program uses them, should one need to keep them around.
I was under impression that a map can be only created by user process.
You're completely right - user programs are the ones that invoke the bpf system call in order to load eBPF programs and create eBPF maps.
And you did just that:
So I load the program with tools/bpf/bpftool and ...
Your bpftool program is the user process that is invoking the bpf syscall, and thus is the user process that is creating the eBPF map.
BPF programs don't have to be unloaded when the user program that created it quits - bpftool likely uses this mechanism.
Some relevant bits from the man page to connect the dots:
A user process can create multiple maps ... and access them via file descriptors.
Generally, eBPF programs are loaded by the user process and automatically unloaded when the process exits. In some cases ... the program will continue to stay alive inside the kernel even after the process that loaded the program exits.
Each eBPF program is a set of instructions that is safe to run until its completion. ... During verification, the kernel increments reference counts for each of the maps that the eBPF program uses, so that the attached maps can't be removed until the program is unloaded.
Recently I read the source of leveldb, the source url is https://leveldb.googlecode.com/files/leveldb-1.13.0.tar.gz
And when I read db/db_impl.cc,there comes the following code:
mutex_.AssertHeld()
I follow it into file port/port_posix.h,and I find the following :
void AssertHeld() { }
Then I grep in the souce dir,but can't find anyother implementation of the AssertHeld() anymore.
So here is my question,what does the mutex_.AssertHeld() do in db/db_impl.cc? THX
As you have observed it does nothing in the default implementation. The function seems to be a placeholder for checking whether a particular thread holds a mutex and optionally abort if it doesn't. This would be equivalent to the normal asserts we use for variables but applied on mutexes.
I think the reason it is not implemented yet is we don't have an equivalent light weight function to assert whether a thread holds a lock in pthread_mutex_t used in the default implementation. Some platforms which has that capability could fill this implementation as part of porting process. Searching online I did find some implementation for this function in the windows port of leveldb. I can see one way to implement it using a wrapper class over pthread_mutex_t and setting some sort of a thread id variable to indicate which thread(s) currently holds the mutex, but it will have to be carefully implemented given the race conditions that can arise.
Recently I was looking through the kernel at kobjects and sysfs.
I know/understand the following..
All kernel objects use addresses > 0x80000000
kobjects should be no exception to this rule
The sysfs is nothing but a hierarchy of kobjects (maybe includes ksets and other k* stuff..not sure)
Given this information, I'm not sure I understand exactly what happens when I run echo ondemand >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
I can see that the cpufreq module has a function called store_scaling_governor which handles writes to this 'file'..but how does usermode transcend into kernelmode with this simple echo?
When you execute command echo ondemand >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor, your shell calls write system call, then kernel dispatch it for corresponding handler.
The cpufreq setups struct kobj_type ktype_cpufreq with sysfs_ops. Then cpufreq register it in cpufreq_add_dev_interface(). After that, kernel can get corresponding handler to execute on write syscall.
I can tell you one implementation which I have used for accessing kernel space variables from sysfs (user-space in shell prompt).Basically each set of variables which are exposed to user-space in sys file system appear as a separate file under /sys/.Now when you issue an echo value > /sys/file-path in shell prompt (user-space).When you do so the respective method which gets called in kernel space in .store method.Additionally when you issue cat /sys/file-path the respective method which gets called is .show in kernel.You can see more information about here: http://lwn.net/Articles/31220/
after doing some reading I came to understand that adding a new syscall via a LKM has gotten harder in 2.6. It seems that the syscall table is not exported any longer, therefore making it (impossible?) to insert a new call at runtime.
The stuff I want to achieve is the following.
I have a kernel module which is doing a specific task.
This task depends on input which should be provided by a user land process.
This information needs to reach the module.
For this purpose I would introduce a new syscall which is implemented in the kernel module and callable from the user land process.
If I have to recompile the kernel in order to add my new syscall, I would also need to write the actual syscall logic outside of the kernel module, correct?
Is there another way to do this?
Cheers,
eeknay
Syscalls are not the correct interface for this sort of work. At least, that's the reason kernel developers made adding syscalls difficult.
There are lots of different ways to move data between userspace and a kernel module: the proc and sysfs pseudo-filesystems, char device interface (using read or write or ioctl), or the local pseudo-network interface netlink.
Which one you choose depends on the amount of type of data you want to send. You should probably only use proc/sysfs if you intend to pass only tiny amounts of data; for big bulk transfers char device or netlink are better suited.
Impossible -- no.
AV modules and rootkits do it all the time.