Make a system call to get list of processes - linux-kernel

I'm new on modules programming and I need to make a system call to retrieve the system processes and show how much CPU they are consuming.
How can I make this call?

Why would you implement a system call for this? You don't want to add a syscall to the existing Linux API. This is the primary Linux interface to userspace and nobody touches syscalls except top kernel developers who know what they do.
If you want to get a list of processes and their parameters and real-time statuses, use /proc. Every directory that's an integer in there is an existing process ID and contains a bunch of useful dynamic files which ps, top and others use to print their output.
If you want to get a list of processes within the kernel (e.g. within a module), you should know that the processes are kept internally as a doubly linked list that starts with the init process (symbol init_task in the kernel). You should use macros defined in include/linux/sched.h to get processes. Here's an example:
#include <linux/module.h>
#include <linux/printk.h>
#include <linux/sched.h>
static int __init ex_init(void)
{
struct task_struct *task;
for_each_process(task)
pr_info("%s [%d]\n", task->comm, task->pid);
return 0;
}
static void __exit ex_fini(void)
{
}
module_init(ex_init);
module_exit(ex_fini);
This should be okay to gather information. However, don't change anything in there unless you really know what you're doing (which will require a bit more reading).

There are syscalls for that, called open, and read. The information of all processes are all kept in /proc/{pid} directories. You can gather process information by reading corresponding files.
More explained here: http://www.tldp.org/LDP/Linux-Filesystem-Hierarchy/html/proc.html

Related

How can an ebpf program change kernel execution flow or call kernel functions?

I'm trying to figure out how an ebpf program can change the outcome of a function (not a syscall, in my case) in kernel space. I've found numerous articles and blog posts about how ebpf turns the kernel into a programmable kernel, but it seems like every example is just read-only tracing and collecting statistics.
I can think of a few ways of doing this: 1) make a kernel application read memory from an ebpf program, 2) make ebpf change the return value of a function, 3) allow an ebpf program to call kernel functions.
The first approach does not seem like a good idea.
The second would be enough, but as far as I understand it's not easy. This question says syscalls are read-only. This bcc document says it is possible but the function needs to be whitelisted in the kernel. This makes me think that the whitelist is fixed and can only be changed by recompiling the kernel, is this correct?
The third seems to be the most flexible one, and this blog post encouraged me to look into it. This is the one I'm going for.
I started with a brand new 5.15 kernel, which should have this functionality
As the blog post says, I did something no one should do (security is not an issue since I'm just toying with this) and opened every function to ebpf by adding this to net/core/filter.c (which I'm not sure is the correct place to do so):
static bool accept_the_world(int off, int size,
enum bpf_access_type type,
const struct bpf_prog *prog,
struct bpf_insn_access_aux *info)
{
return true;
}
bool export_the_world(u32 kfunc_id)
{
return true;
}
const struct bpf_verifier_ops all_verifier_ops = {
.check_kfunc_call = export_the_world,
.is_valid_access = accept_the_world,
};
How does the kernel know of the existence of this struct? I don't know. None of the other bpf_verifier_ops declared are used anywhere else, so it doesn't seem like there is a register_bpf_ops
Next I was able to install bcc (after a long fight due to many broken installation guides).
I had to checkout v0.24 of bcc. I read somewhere that pahole is required when compiling the kernel, so I updated mine to v1.19.
My python file is super simple, I just copied the vfs example from bcc and simplified it:
bpf_text_kfunc = """
extern void hello_test_kfunc(void) __attribute__((section(".ksyms")));
KFUNC_PROBE(vfs_open)
{
stats_increment(S_OPEN);
hello_test_kfunc();
return 0;
}
"""
b = BPF(text=bpf_text_kfunc)
Where hello_test_kfunc is just a function that does a printk, inserted as a module into the kernel (it is present in kallsyms).
When I try to run it, I get:
/virtual/main.c:25:5: error: cannot call non-static helper function
hello_test_kfunc();
^
And this is where I'm stuck. It seems like it's the JIT that is not allowing this, but who exactly is causing this issue? BCC, libbpf or something else? Do I need to manually write bpf code to call kernel functions?
Does anyone have an example with code of what the lwn blog post I linked talks about actually working?
eBPF is fundamentally made to extend kernel functionality in very specific limited ways. Essentially a very advanced plugin system. One of the main design principles of the eBPF is that a program is not allowed to break the kernel. Therefor it is not possible to change to outcome of arbitrary kernel functions.
The kernel has facilities to call a eBPF program at any time the kernel wants and then use the return value or side effects from helper calls to effect something. The key here is that the kernel always knows it is doing this.
One sort of exception is the BPF_PROG_TYPE_STRUCT_OPS program type which can be used to replace function pointers in whitelisted structures.
But again, explicitly allowed by the kernel.
make a kernel application read memory from an ebpf program
This is not possible since the memory of an eBPF program is ephemaral, but you could define your own custom eBPF program type and pass in some memory to be modified to the eBPF program via a custom context type.
make ebpf change the return value of a function
Not possible unless you explicitly call a eBPF program from that function.
allow an ebpf program to call kernel functions.
While possible for a number for purposes, this typically doesn't give you the ability to change return values of arbitrary functions.
You are correct, certain program types are allowed to call some kernel functions. But these are again whitelisted as you discovered.
How does the kernel know of the existence of this struct?
Macro magic. The verifier builds a list of these structs. But only if the program type exists in the list of program types.
/virtual/main.c:25:5: error: cannot call non-static helper function
This seems to be a limitation of BCC, so if you want to play with this stuff you will likely have to manually compile your eBPF program and load it with libbpf or cilium/ebpf.

Is returning while holding a spinlock automatically unsafe?

The venerated book Linux Driver Development says that
The flags argument passed to spin_unlock_irqrestore must be the same variable passed to spin_lock_irqsave. You must also call spin_lock_irqsave and spin_unlock_irqrestore in the same function; otherwise your code may break on some architectures.
Yet I can't find any such restriction required by the official documentation bundled with the kernel code itself. And I find driver code that violates this guidance.
Obviously it isn't a good idea to call spin_lock_irqsave and spin_unlock_irqrestore from separate functions, because you're supposed to minimize the work done while holding a lock (with interrupts disabled, no less!). But have changes to the kernel made it possible if done with care, was it never actually against the API contract, or is it still verboten to do so?
If the restriction has been removed at some point, did it apply to version 3.10.17?
This is just a guess, but the might be unclearly referring to a potential bug which could happen if you try to use a nonlocal variable or storage location for flags.
Basically, flags has to be private to the current execution context, which is why spin_lock_irqsave is a macro which takes the name of the flags. While flags is being saved, you don't have the spinlock yet.
How this is related to locking and unlocking in a different function:
Consider two functions that some driver developer might write:
void my_lock(my_object *ctx)
{
spin_lock_irqsave(&ctx->mylock, ctx->myflags); /* BUG */
}
void my_unlock(my_object *ctx)
{
spin_unlock_irqrestore(&ctx->mylock, ctx->myflags);
}
This is a bug because at the time ctx->myflags is written, the lock is not yet held, and it is a shared variable visible to other contexts and processors. The local flags must be saved to a private location on the stack. Then when the lock is owned, by the caller, a copy of the flags can be saved into the exclusively owned object. In other words, it can be fixed like this:
void my_lock(my_object *ctx)
{
unsigned long flags;
spin_lock_irqsave(&ctx->mylock, flag);
ctx->myflags = flags;
}
void my_unlock(my_object *ctx)
{
unsigned long flags = ctx->myflags; /* probably unnecessary */
spin_unlock_irqrestore(&ctx->mylock, flags);
}
If it couldn't be fixed like that, it would be very difficult to implement higher level primitives which need to wrap IRQ spinlocks.
How it could be arch-dependent:
Suppose that spin_lock_irqsave expands into machine code which saves the current flags in some register, then acquires the lock, and then saves that register into specified flags destination. In that case, the buggy code is actually safe. If the expanded code saves the flags into the actual flags object designated by the caller and then tries to acquire the lock, then it's broken.
I have never see that constraint aside from the book. Probably, given information in the book is just outdated, or .. simply wrong.
In the current kernel(and at least since 2.6.32, which I start to work with) actual locking is done through many level of nested calls from spin_lock_irqsave(see, e.g. __raw_spin_lock_irqsave, which is called in the middle). So different function's context for lock and unlock may hardly be a reason for misfunction.

number of forks a shell script performs during execution

Is there a way to compute the number of forks a shell script performs while it is executing? I've been looking at maybe writing a C wrapper using getrusage(2) and analyzing the various fields of
struct rusage {
struct timeval ru_utime; /* user time used */
struct timeval ru_stime; /* system time used */
long ru_maxrss; /* max resident set size */
long ru_ixrss; /* integral shared text memory size */
long ru_idrss; /* integral unshared data size */
long ru_isrss; /* integral unshared stack size */
long ru_minflt; /* page reclaims */
long ru_majflt; /* page faults */
long ru_nswap; /* swaps */
long ru_inblock; /* block input operations */
long ru_oublock; /* block output operations */
long ru_msgsnd; /* messages sent */
long ru_msgrcv; /* messages received */
long ru_nsignals; /* signals received */
long ru_nvcsw; /* voluntary context switches */
long ru_nivcsw; /* involuntary context switches */
};
but the number of forks isn't available here. Next idea is to strace shell and children and look for the forks. Is there a simpler way with less overhead? Is there some shell with a nonstandard option/variable/mechanism to show the number of forks?
There are a few options:
the best multi-platform approach is likely strace or its equivalent (truss, ktrace), or dtrace. See below. This also lets you attach to a running process.
a workable, if slightly tricky multi-platform approach is to create a dynamic library with your own versions of fork/execve etc, which log the calls, and then call the real C library functions. Search SO for LD_PRELOAD to get some ideas. This won't work on statically linked binaries though.
on Linux and Solaris you can set the environment variable LD_DEBUG=files, and the dynamic linker will issue various diagnostics as both executables and libraries are loaded, it should be enlightening for such a simple step. On Linux each new process should output some or most of "initialize", "init" and "fini" entries, along with PIDs. This won't work on statically linked binaries though.
if you are on Linux or *BSD or Solaris, and have root access, and process accounting is available you can run your command(s), and then inspect the output of lastcomm or dump-acct. This might require accounting to be started (if it is not already running). This might not provide the details you need on some platforms. It can be done easily on RH/CentOS 6, and provides all the details needed. Other systems also have process accounting.
if you are on Linux and have auditd support you can use autrace myscript.sh to log system calls. (auditd should be running for this so the kernel data is logged to the audit file)
for completeness: you could use a debugger, but that's about the most tedious approach I can think of ;-)
On Linux you can trace execution (moderate performance penalty) with:
strace -f -o /tmp/myscript.trace -e trace=process ./myscript.sh
Then inspect the .trace file. The parameter -e trace=process filters to show only process related syscalls.
On Solaris you can trace with:
truss -f -o /tmp/myscript.trace \
-u libc:fork,execl,execv,execle,execve,execlp,execvp ./myscript.sh
Solaris truss lets you trace both userland libraries and kernel syscalls. You could also use dtrace, see here for some ideas: http://www.brendangregg.com/DTrace/lostcpu.html
Other platforms have variables similar to LD_DEBUG or LD_VERBOSE, see the linker documentation (e.g. man ld.so).
In the above cases you should understand that what programs (usually) call are C library functions, e.g. fork(), what is requested of the kernel depends on the OS at least, and it may result in a syscall of vfork, execve or clone.

Usage of the for_each_process macro in the Linux kernel

I want to iterate over each process in the kernel and modify some parameters in task_struct. I think I can use the for_each_process() macro to do so.
However, to do it safely, I have to ensure that the process is not being executed currently and also after I get reference to its task_struct, I want to lock it down so that no one else accesses it while I am modifying it.
How can I accomplish these two goals?
You can use:
int flags;
smp_wmb();
raw_spin_lock_irqsave(&task->pi_lock, flags);
do your stuff
raw_spin_unlock_irqrestore(&task->pi_lock, flags);
to lock the task you are currently processing.

Fixed Memory I don't need to allocate?

I just need a fixed address in any win32 process, where I can store 8 bytes without using any winapi function. I also cannot use assembler prefixes like fs:. and I have no stack pointer.
What I need:
-8 bytes of memory
-constant address and present in any process
-read and write access (via pointer, from the same process)
-should not crash the application (at least not instantly) if modified.
Don't even ask, why I need it.
The only way I'm aware of to do this is to use a DLL with a shared section...
// This goes in a DLL loaded by all apps that want to share the data
#pragma data_seg (".sharedseg")
long long myShared8Bytes = 0; // has to be initialized or this fails
#pragma data_seg()
Then, you add the following to the link command for the dll:
/SECTION:sharedseg,RWS
I am also curious why you want this...
Not that I recommend this, but the PEB probably has some unused or inconsequential fields in it that you could overwrite. I still think this is a terrible idea, though.
constant address and present in any
process
You won't be able to achieve that. Win32 uses paged memory so different processes can access the same memory addresses even though it is different memory.

Resources