Clean halt in Linux Kernel - linux-kernel

I need to execute a few machine instructions just before kernel "halts".
Reason is I need to inform a board controller it can actually remove power.
Question is: what is the best-practice to achieve this?
In an old (3.18) kernel for the same board I hacked .../arch/mips/ralink/reset.c to add some register settings in static void ralink_halt(void) but that function seems gone together with static int __init mips_reboot_setup(void), so i guess structure changed a lot since then.
What is the correct hook to use in modern kernels?

Related

How can an ebpf program change kernel execution flow or call kernel functions?

I'm trying to figure out how an ebpf program can change the outcome of a function (not a syscall, in my case) in kernel space. I've found numerous articles and blog posts about how ebpf turns the kernel into a programmable kernel, but it seems like every example is just read-only tracing and collecting statistics.
I can think of a few ways of doing this: 1) make a kernel application read memory from an ebpf program, 2) make ebpf change the return value of a function, 3) allow an ebpf program to call kernel functions.
The first approach does not seem like a good idea.
The second would be enough, but as far as I understand it's not easy. This question says syscalls are read-only. This bcc document says it is possible but the function needs to be whitelisted in the kernel. This makes me think that the whitelist is fixed and can only be changed by recompiling the kernel, is this correct?
The third seems to be the most flexible one, and this blog post encouraged me to look into it. This is the one I'm going for.
I started with a brand new 5.15 kernel, which should have this functionality
As the blog post says, I did something no one should do (security is not an issue since I'm just toying with this) and opened every function to ebpf by adding this to net/core/filter.c (which I'm not sure is the correct place to do so):
static bool accept_the_world(int off, int size,
enum bpf_access_type type,
const struct bpf_prog *prog,
struct bpf_insn_access_aux *info)
{
return true;
}
bool export_the_world(u32 kfunc_id)
{
return true;
}
const struct bpf_verifier_ops all_verifier_ops = {
.check_kfunc_call = export_the_world,
.is_valid_access = accept_the_world,
};
How does the kernel know of the existence of this struct? I don't know. None of the other bpf_verifier_ops declared are used anywhere else, so it doesn't seem like there is a register_bpf_ops
Next I was able to install bcc (after a long fight due to many broken installation guides).
I had to checkout v0.24 of bcc. I read somewhere that pahole is required when compiling the kernel, so I updated mine to v1.19.
My python file is super simple, I just copied the vfs example from bcc and simplified it:
bpf_text_kfunc = """
extern void hello_test_kfunc(void) __attribute__((section(".ksyms")));
KFUNC_PROBE(vfs_open)
{
stats_increment(S_OPEN);
hello_test_kfunc();
return 0;
}
"""
b = BPF(text=bpf_text_kfunc)
Where hello_test_kfunc is just a function that does a printk, inserted as a module into the kernel (it is present in kallsyms).
When I try to run it, I get:
/virtual/main.c:25:5: error: cannot call non-static helper function
hello_test_kfunc();
^
And this is where I'm stuck. It seems like it's the JIT that is not allowing this, but who exactly is causing this issue? BCC, libbpf or something else? Do I need to manually write bpf code to call kernel functions?
Does anyone have an example with code of what the lwn blog post I linked talks about actually working?
eBPF is fundamentally made to extend kernel functionality in very specific limited ways. Essentially a very advanced plugin system. One of the main design principles of the eBPF is that a program is not allowed to break the kernel. Therefor it is not possible to change to outcome of arbitrary kernel functions.
The kernel has facilities to call a eBPF program at any time the kernel wants and then use the return value or side effects from helper calls to effect something. The key here is that the kernel always knows it is doing this.
One sort of exception is the BPF_PROG_TYPE_STRUCT_OPS program type which can be used to replace function pointers in whitelisted structures.
But again, explicitly allowed by the kernel.
make a kernel application read memory from an ebpf program
This is not possible since the memory of an eBPF program is ephemaral, but you could define your own custom eBPF program type and pass in some memory to be modified to the eBPF program via a custom context type.
make ebpf change the return value of a function
Not possible unless you explicitly call a eBPF program from that function.
allow an ebpf program to call kernel functions.
While possible for a number for purposes, this typically doesn't give you the ability to change return values of arbitrary functions.
You are correct, certain program types are allowed to call some kernel functions. But these are again whitelisted as you discovered.
How does the kernel know of the existence of this struct?
Macro magic. The verifier builds a list of these structs. But only if the program type exists in the list of program types.
/virtual/main.c:25:5: error: cannot call non-static helper function
This seems to be a limitation of BCC, so if you want to play with this stuff you will likely have to manually compile your eBPF program and load it with libbpf or cilium/ebpf.

Disable Lazy Evaluation in QEMU

Is it possible to disable lazy evaluation in QEMU (User-Mode)?
I found no flags for it when running qemu-i386.
Looking at the Code i found the function:
static target_ulong disas_insn(DisasContext *s, CPUState *cpu)
in target/i386/translate.c which converts one instruction into a host instruction. Within this function, the function:
static void gen_compute_eflags(DisasContext *s)
is used to generate eflags for specific instructions (that will need them).
My first idea would be to add the gen_compute_eflags() to every instruction, i wonder if there is a more effective and less error-prone way to do this.
What are you actually trying to achieve? There is no easy way to somehow disable the optimisation away of the flag-register codegen when the flag value is never used before it is overwritten, but I don't know why you would want to do that. The optimisation makes execution faster and QEMU does it in only the places where it is safe (ie where the flag value we don't compute is never used by the guest and not visible to the user via the debug stub).

Linux kernel detecting the pre-boot environment for watchdog

So I'm developing for an embedded Linux system and we had some trouble with an external watchdog chip which needed to be fed very early in the boot process.
More specifically, from what we could work out it would this external watchdog would cause a reset while the kernel was decompressing its image in the pre-boot environment. Not enough down time before it starts needing to be fed, which should probably have been sorted in hardware as it is external, but an internal software solution is wanted.
The solution from one of our developers was to put in some extra code into...
int zlib_inflate(z_streamp strm, int flush) in the lib/zlib_inflate/inflate.c kernel code
This new code periodically toggles the watchdog pin during the decompression.
Now besides the fact that I feel like this is a little bit of a dirty hack. It does work, and it has raised an interesting point in my mind. Because this lib is used after boot as well. So is there a nice way for a bit of code detecting whether you're in the pre-boot environment? So it could only preform this toggling pre-boot and not when the lib is used later.
As an aside, I'm also interested in any ideas to avoid the hack in the first place.
So is there a nice way for a bit of code detecting whether you're in the pre-boot environment?
You're asking an XY question.
The solution to the X problem can be cleanly solved if you are using U-Boot.
(BTW instead of "pre-boot", i.e. before boot, you probably mean "boot", i.e. before the kernel is started.)
If you're using U-Boot in the boot sequence, then you do not have to hack any boot or kernel code. Apparently you are booting a self-extracting compressed kernel in a zImage (or a zImage within a uImage) file. The hack-free solution is described by U-Boot's author/maintainer, Wolfgang Denk:
It is much better to use normal (uncompressed) kernel image, compress it
using just gzip, and use this as poayload for mkimage. This way
U-Boot does the uncompresiong instead of including yet another
uncompressor with each kernel image.
So instead of make uImage, do a simple make.
Compress the Image file, and then encapsulate it with the U-Boot wrapper using mkimage (and specify the compression algorithm that was applied so that U-Boot can use its built-in decompressor) to produce your uImage file.
When U-Boot loads this uImage file, the wrapper will indicate that it's a compressed file.
U-Boot will execute its internal decompressor library, which (in recent versions) is already watchdog aware.
Quick and dirty solution off the top of my head:
Make a global static variable in the file that's initialized to 1, and as long as it's 1, consider that "pre-boot".
Add a *_initcall (choose whichever fits your needs. I'm not sure when the kernel is decompressed) to set it to 0.
See include/linux/init.h in the kernel tree for initcall levels.
See #sawdust answer for an answer on how to achieve the watchdog feeding without having to hack the kernel code.
However this does not fully address the original question of how to detect that code is being compiled in the "pre-boot environment", as it is called within kernel source.
Files within the kernel such as ...
include/linux/decompress/mm.h
lib/decompress_inflate.c
And to a lesser extent (it isn't commented as clearly)...
lib/decompress_unlzo.c
Seem to check the STATIC definition to set "pre-boot environment" differences. Such as in this excerpt from include/linux/decompress/mm.h ...
#ifdef STATIC
/* Code active when included from pre-boot environment: */
...
#else /* STATIC */
/* Code active when compiled standalone for use when loading ramdisk: */
...
#endif /* STATIC */
Another idea can be disabling watchdog from bootloader and enabling it from user space once system has booted completely.

Linux UART driver - debugging time taken for __init call

I am a bit new to the Linux kernel and our team is trying to optimize the boot-up time for the device. It was observed that 8250 UART driver takes more than 1 second to complete the __init call. Using printk's and going by the generated console time-stamps prefixed to every log message, I was able to narrow down the function call which takes the extra time:
ret = platform_driver_register(&serial8250_isa_driver);
Being a novice, I was unsure about what more could I do from a debugging standpoint to track down the issue ? I am looking for pointers/suggestions from some of the experienced Kernel developers out there.. Just curious as to what other approach would the Kernel developers, use from their "Debugging Toolbox" ?
Thanks,
Vijay
If I understand correct, the register function is doing stuff with that struct (maybe polling addresses or something). You would need to see if any of the functions defined within are being called by register.
To more answer your question, does the platform you're running on have an 8250 ISA UART? If not, that could well explain why it's taking so long to init (it's timing out).

how can i override malloc(), calloc(), free() etc under OS X?

Assuming the latest XCode and GCC, what is the proper way to override the memory allocation functions (I guess operator new/delete as well). The debugging memory allocators are too slow for a game, I just need some basic stats I can do myself with minimal impact.
I know its easy in Linux due to the hooks, and this was trivial under codewarrior ten years ago when I wrote HeapManager.
Sadly smartheap no longer has a mac version.
I would use library preloading for this task, because it does not require modification of the running program. If you're familiar with the usual Unix way to do this, it's almost a matter of replacing LD_PRELOAD with DYLD_INSERT_LIBRARIES.
First step is to create a library with code such as this, then build it using regular shared library linking options (gcc -dynamiclib):
void *malloc(size_t size)
{
void * (*real_malloc)(size_t);
real_malloc = dlsym(RTLD_NEXT, "malloc");
fprintf(stderr, "allocating %lu bytes\n", (unsigned long)size);
/* Do your stuff here */
return real_malloc(size);
}
Note that if you also divert calloc() and its implementation calls malloc(), you may need additional code to check how you're being called. C++ programs should be pretty safe because the new operator calls malloc() anyway, but be aware that no standard enforces that. I have never encountered an implementation that didn't use malloc(), though.
Finally, set up the running environment for your program and launch it (might require adjustments depending on how your shell handles environment variables):
export DYLD_INSERT_LIBRARIES=./yourlibrary.dylib
export DYLD_FORCE_FLAT_NAMESPACE=1
yourprogram --yourargs
See the dyld manual page for more information about the dynamic linker environment variables.
This method is pretty generic. There are limitations, however:
You won't be able to divert direct system calls
If the application itself tricks you by using dlsym() to load malloc's address, the call won't be diverted. Unless, however, you trick it back by also diverting dlsym!
The malloc_default_zone technique mentioned at http://lists.apple.com/archives/darwin-dev/2005/Apr/msg00050.html appears to still work, see e.g. http://code.google.com/p/fileview/source/browse/trunk/fileview/fv_zone.cpp?spec=svn354&r=354 for an example use that seems to be similar to what you intend.
After much searching (here included) and issues with 10.7 I decided to write a blog post about this topic: How to set malloc hooks in OSX Lion
You'll find a few good links at the end of the post with more information on this topic.
The basic solution:
malloc_zone_t *dz=malloc_default_zone();
if(dz->version>=8)
{
vm_protect(mach_task_self(), (uintptr_t)malloc_zones, protect_size, 0, VM_PROT_READ | VM_PROT_WRITE);//remove the write protection
}
original_free=dz->free;
dz->free=&my_free; //this line is throwing a bad ptr exception without calling vm_protect first
if(dz->version==8)
{
vm_protect(mach_task_self(), (uintptr_t)malloc_zones, protect_size, 0, VM_PROT_READ);//put the write protection back
}
This is an old question, but I came across it while trying to do this myself. I got curious about this topic for a personal project I was working on, mainly to make sure that what I thought was automatically deallocated was being properly deallocated. I ended up writing a C++ implementation to allow me to track the amount of allocated heap and report it out if I so chose.
https://gist.github.com/monitorjbl/3dc6d62cf5514892d5ab22a59ff34861
As the name notes, this is OSX-specific. However, I was able to do this on Linux environments using the malloc_usable_size
Example
#define MALLOC_DEBUG_OUTPUT
#include "malloc_override_osx.hpp"
int main(){
int* ip = (int*)malloc(sizeof(int));
double* dp = (double*)malloc(sizeof(double));
free(ip);
free(dp);
}
Building
$ clang++ -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk \
-pipe -stdlib=libc++ -std=gnu++11 -g -o test test.cpp
$ ./test
0x7fa28a403230 -> malloc(16) -> 16
0x7fa28a403240 -> malloc(16) -> 32
0x7fa28a403230 -> free(16) -> 16
0x7fa28a403240 -> free(16) -> 0
Hope this helps someone else out in the future!
If the basic stats you need can be collected in a simple wrapper, a quick (and kinda dirty) trick is just using some #define macro replacement.
void* _mymalloc(size_t size)
{
void* ptr = malloc(size);
/* do your stat work? */
return ptr;
}
and
#define malloc(sz_) _mymalloc(sz_)
Note: if the macro is defined before the _mymalloc definition it will end up replacing the malloc call inside that function leaving you with infinite recursion... so ensure this isn't the case. You might want to explicitly #undef it before that function definition and simply (re)define it afterward depending on where you end up including it to hopefully avoid this situation.
I think if you define a malloc() and free() in your own .c file included in the project the linker will resolve that version.
Now then, how do you intend to implement malloc?
Check out Emery Berger's -- the author of the Hoard memory allocator's -- approach for replacing the allocator on OSX at https://github.com/emeryberger/Heap-Layers/blob/master/wrappers/macwrapper.cpp (and a few other files you can trace yourself by following the includes).
This is complementary to Alex's answer, but I thought this example was more to-the-point of replacing the system provided allocator.

Resources