Linux 'socketcall' system call implementation - linux-kernel

In linux all socket related system calls are gated throw one system call named socketcall.Its handler is found in /net/socket.c. As one can expect there are a copy_from_user for the arguments and then a switch for all socket functions.
I expected to see in each case a call for a ordinary function , but it seems that there are callings to another system calls. For example the case for 'socket' :
case SYS_SOCKET:
err = sys_socket(a0, a1, a[2]);
break;
sys_socket is also defined in /net/socket.c as :
SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
My question is why its defined like this. I guess its for backward compatibility, or I have a mistake somewhere?

man 2 socketcall says that
NOTES
On a few architectures, for example ia64, there is no socketcall() system call; instead socket(2), accept(2), bind(2), and so on really
are implemented as separate system calls
So, in the case of x86, the socketcall dispatcher is only for x86_32, while x86_64 uses separate system calls for each socket API.

Related

How can an ebpf program change kernel execution flow or call kernel functions?

I'm trying to figure out how an ebpf program can change the outcome of a function (not a syscall, in my case) in kernel space. I've found numerous articles and blog posts about how ebpf turns the kernel into a programmable kernel, but it seems like every example is just read-only tracing and collecting statistics.
I can think of a few ways of doing this: 1) make a kernel application read memory from an ebpf program, 2) make ebpf change the return value of a function, 3) allow an ebpf program to call kernel functions.
The first approach does not seem like a good idea.
The second would be enough, but as far as I understand it's not easy. This question says syscalls are read-only. This bcc document says it is possible but the function needs to be whitelisted in the kernel. This makes me think that the whitelist is fixed and can only be changed by recompiling the kernel, is this correct?
The third seems to be the most flexible one, and this blog post encouraged me to look into it. This is the one I'm going for.
I started with a brand new 5.15 kernel, which should have this functionality
As the blog post says, I did something no one should do (security is not an issue since I'm just toying with this) and opened every function to ebpf by adding this to net/core/filter.c (which I'm not sure is the correct place to do so):
static bool accept_the_world(int off, int size,
enum bpf_access_type type,
const struct bpf_prog *prog,
struct bpf_insn_access_aux *info)
{
return true;
}
bool export_the_world(u32 kfunc_id)
{
return true;
}
const struct bpf_verifier_ops all_verifier_ops = {
.check_kfunc_call = export_the_world,
.is_valid_access = accept_the_world,
};
How does the kernel know of the existence of this struct? I don't know. None of the other bpf_verifier_ops declared are used anywhere else, so it doesn't seem like there is a register_bpf_ops
Next I was able to install bcc (after a long fight due to many broken installation guides).
I had to checkout v0.24 of bcc. I read somewhere that pahole is required when compiling the kernel, so I updated mine to v1.19.
My python file is super simple, I just copied the vfs example from bcc and simplified it:
bpf_text_kfunc = """
extern void hello_test_kfunc(void) __attribute__((section(".ksyms")));
KFUNC_PROBE(vfs_open)
{
stats_increment(S_OPEN);
hello_test_kfunc();
return 0;
}
"""
b = BPF(text=bpf_text_kfunc)
Where hello_test_kfunc is just a function that does a printk, inserted as a module into the kernel (it is present in kallsyms).
When I try to run it, I get:
/virtual/main.c:25:5: error: cannot call non-static helper function
hello_test_kfunc();
^
And this is where I'm stuck. It seems like it's the JIT that is not allowing this, but who exactly is causing this issue? BCC, libbpf or something else? Do I need to manually write bpf code to call kernel functions?
Does anyone have an example with code of what the lwn blog post I linked talks about actually working?
eBPF is fundamentally made to extend kernel functionality in very specific limited ways. Essentially a very advanced plugin system. One of the main design principles of the eBPF is that a program is not allowed to break the kernel. Therefor it is not possible to change to outcome of arbitrary kernel functions.
The kernel has facilities to call a eBPF program at any time the kernel wants and then use the return value or side effects from helper calls to effect something. The key here is that the kernel always knows it is doing this.
One sort of exception is the BPF_PROG_TYPE_STRUCT_OPS program type which can be used to replace function pointers in whitelisted structures.
But again, explicitly allowed by the kernel.
make a kernel application read memory from an ebpf program
This is not possible since the memory of an eBPF program is ephemaral, but you could define your own custom eBPF program type and pass in some memory to be modified to the eBPF program via a custom context type.
make ebpf change the return value of a function
Not possible unless you explicitly call a eBPF program from that function.
allow an ebpf program to call kernel functions.
While possible for a number for purposes, this typically doesn't give you the ability to change return values of arbitrary functions.
You are correct, certain program types are allowed to call some kernel functions. But these are again whitelisted as you discovered.
How does the kernel know of the existence of this struct?
Macro magic. The verifier builds a list of these structs. But only if the program type exists in the list of program types.
/virtual/main.c:25:5: error: cannot call non-static helper function
This seems to be a limitation of BCC, so if you want to play with this stuff you will likely have to manually compile your eBPF program and load it with libbpf or cilium/ebpf.

How does Go make system calls?

As far as I know, in CPython, open() and read() - the API to read a file is written in C code. The C code probably calls some C library which knows how to make system call.
What about a language such as Go? Isn't Go itself now written in Go? Does Go call C libraries behind the scenes?
The short answer is "it depends".
Go compiles for multiple combinations of H/W and OS, and they all have different approaches to how syscalls are to be made when working with them.
For instance, Solaris does not provide a stable supported set of syscalls, so they go through the systems libc — just as required by the vendor.
Windows does support a rather stable set of syscalls but it is defined as a C API provided by a set of standard DLLs.
The functions exposed by those DLLs are mostly shims which use a single "make a syscall by number" function, but these numbers are not documented and are different between the kernel flavours and releases (perhaps, intentionally).
Linux does provide a stable and documented set of numbered syscalls and hence there Go just calls the kernel directly.
Now keep in mind that for Go to "call the kernel directly" means following the so-called ABI of the H/W and OS combo. For instance, on modern Linux on amd64 making a syscall requires filling a set of CPU registers with certain values, doing some other arrangements and then issuing the SYSENTER CPU instruction.
On Windows, you have to use its native calling convention (which is stdcall, not cdecl).
Yes go is now written in go. But, you don't need C to make syscalls.
An important thing to call out is that syscalls aren't "written in C." You can make syscalls from C on Unix because of <unistd.h>. In particular, how Linux defines this header is a little convoluted, but you can see from this file the general idea. Syscalls are defined with a name and a number. When you call read for example, what really happens behind the scenes is the parameters are setup in the proper registers/memory (linux expects the syscall number in eax) followed by the instruction syscall which fires interrupt 0x80. The OS has already setup the proper interrupt handlers that will receive this interrupt and the OS goes about doing whatever is needed for that syscall. So, you don't need something written in C (or a standard library for that matter) to make syscalls. You just need to understand the call ABI and know the interrupt numbers.
However, as #retgits points out golang's approach is to piggyback off the fact that libc already has all of the logic for handling syscalls. mksyscall.go is a CLI script that parses these libc files to extract the necessary information.
You can actually trace the life of a syscall if you compile a go script like:
package main
import (
"syscall"
)
func main() {
var buf []byte
syscall.Read(9, buf)
}
Run objdump -D on the resulting binary. The go runtime is rather large, so your best bet is to find the main function, see where it calls syscall.Read and then search for the offsets from there: syscall.Read calls syscall.syscall, syscall.syscall calls runtime.libcCall (which switches from the go ABI to C ABI compatibility so that arguments are located where the OS expects--you can see this in runtime, for darwin for example), runtime.libcCall calls runtime.asmcgocall, etc.
For extra fun, run that binary with gdb and continue stepping in until you hit the syscall.
The sys package takes care of the syscalls to the underlying OS. Depending on the OS you're using different packages are used to generate the appropriate calls. Here is a link to the README for Go running on Unix systems: https://github.com/golang/sys/blob/master/unix/README.md the parts on mksyscall.go, which are hand-written Go files which implement system calls that need special handling, and type files, should walk you through how it works.
The Go compiler (which translates the Go code to target CPU code) is written in Go but that is different to the run time support code which is what you are talking about. The standard library is mainly written in Go and probably knows how to directly make system calls with no C code involved. However, there may be a bit of C support code, depending on the target platform.

Regarding msgrcv in android-kernel?

I was running a test suite for testing IPC related functionality in android kernel. while I was testing msgrcv system call , it return error function not implemented.
So is it true msgrcv() system call not implemented in android-kernel, if so why and which system call in android kernel serve purpose of msgrcv() system call.
I got related statement which says System V IPCs (including message queues) are not implemented on Bionic. but not sure what does it mean.
Update : I am able to find definition of msgrcv in android kernel , but not sure why it is returning error function not implemented.
Below code snippet :
SYSCALL_DEFINE5(msgrcv, int, msqid, struct msgbuf __user *, msgp, size_t, msgsz,
long, msgtyp, int, msgflg)
{
return do_msgrcv(msqid, msgp, msgsz, msgtyp, msgflg, do_msg_fill);
}
Please comment if information seems incomplete or vague ,Help is appreciated.
System V IPC may be available in the kernel but system call interfaces are not implemented in Bionic lib C. For Example, /bionic/libc/arch-arm/syscalls/ contains all system call implementations with respect to ARM.

When to use network system calls vs. sk_buff within a KM

While trying to learn more about linux kernel networking ... I have a kernel module that contains a protocol which runs on top of TCP. Its almost an application layer protocol I'm experimenting with. The calls are passed in via the normal system call interface as executed from userspace.
So network calls from within my (layer above TCP) module generally look like this ...
ret = sock->ops->connect(sock, (struct sockaddr *) &myprot.daddr,
sizeof(myprot.daddr), flags);
I've used sendmsg/recvmsg successfully within my KM to send and receive data from a client to a server (from two separate kernel instances). The calls within the KM generally looks as follows:
ret = sock->ops->sendmsg(iocb, myprot.skt, &msg, sizeof(struct msghdr));
ret = sock->ops->recvmsg(iocb, sock, msg, total_len, flags);
What I'm trying to understand now is how and when to use sk_buff to do the same thing. I.e. when to use system calls such as what I use above, and when to directly access the network stack via sk_buff to send and receive data.
I've found many examples of how to send and receive data from within transport layers using sk_buff, but nothing from a layer above the transport that is also contained in a kernel module and using sk_buff.
Update for clarification.
I've overridden struct proto_ops and replaced the member methods for my own protocols use which do correspond to system calls from user space. I do understand that sk_buff is the buffer system for the kernel and is where packets are enqueued. However. I don't see any reason why I can't use the protocol-specific functions of struct proto_ops which also handles sockets and the data enqueued on them (though at a higher level). So it seems to me there are two ways to access sk_buffs depending upon where one wants to access them.
If I'm working in the transport layer and want to access data anywheres within the network stack (e.g. transport, ip, mac), I could directly access sk_buffs, but if I am working above the transport layer, I would use the abstracted protocol specific member functions that correspond to system calls. After all, they both eventually work on sk_buffs.
I guess my confusion, or what I'm trying to confirm that I'm doing right or wrong by knowing the difference in these two ways to access sk_buffs and from where, is ... if I'm sending data over a transport from TCP within the kernel, than I can just make use of the proto_ops system calls that relate to TCP unless I need more control in which I would then make use of the lower level skb functions to manage the queues.
Not sure to understand because you want to use to different things for the same purpose. The proto_ops in sock->ops are operations invoked during the correspondent system call. The sk_buff is the socket buffer system of the kernel; it is the place where packet are enqueued.
There is not the possibility to do the same thing of proto_ops with sk_buff, if it should be possible one of these structures is useless.

Problem with .release behavior in file_operations

I'm dealing with a problem in a kernel module that get data from userspace using a /proc entry.
I set open/write/release entries for my own defined /proc entry, and manage well to use it to get data from userspace.
I handle errors in open/write functions well, and they are visible to user as open/fopen or write/fwrite/fprintf errors.
But some of the errors can only be checked at close (because it's the time all the data is available). In these cases I return something different than 0, which I supposed to be in some way the value 'close' or 'fclose' will return to user.
But whatever the value I return my close behave like if all is fine.
To be sure I replaced all the release() code by a simple 'return(-1);' and wrote a program that open/write/close the /proc entry, and prints the close return value (and the errno). It always return '0' whatever the value I give.
Behavior is the same with 'fclose', or by using shell mechanism (echo "..." >/proc/my/entry).
Any clue about this strange behavior that is not the one claimed in many tutorials I found?
BTW I'm using RHEL5 kernel (2.6.18, redhat modified), on a 64bit system.
Thanks.
Regards,
Yannick
The release() isn't allowed to cause the close() to fail.
You could require your userspace programs to call fsync() on the file descriptor before close(), if they want to find out about all possible errors; then implement your final error checking in the fsync() handler.

Resources