How to get cgroup path of task in an eBPF program? - linux-kernel

I have been trying to play with tcptop BCC tool by Brendan Gregg to learn more about how eBPF programs work. I am trying to get it to print the CGROUP path of the tasks.
With my rusty knowledge of Linux systems programming, I thought I could use functions from linux/cgroup.h, especially task_cgroup_path() looked promising as I can pass current task_struct * (obtained from bpf_get_current_task()) to it. I am using a CentOS7 machine with 4.19.59 kernel.
However, when I try to executed modified tcptop, the verifier fails with last insn is not an exit or jmp error message. I am trying to understand why this is happening.
Here's the diff of modified tcptop: patch
Here's the verifier output

I thought I could use functions from linux/cgroup.h
No, you cannot.
The only functions that may be called from an eBPF program are:
Other functions defined in your eBPF code (they can be inlined but this is no longer necessary, eBPF supports function calls),
eBPF helpers, which are kernel functions specifically exposed to eBPF programs (see their documentation in the source or as a man page),
Compiler built-ins, such as __builtin_memcpy() for example.
Other kernel functions, even if the symbols are exported and exposed to kernel modules, or user functions from the standard library for that matter, are not callable from an eBPF program (they would be traceable with eBPF, but this is different).
Regarding your use case, I am not sure what the best way to get the cgroup path from its id is. You could do it in user space, although I don't know how to get a path from the id. I know the reverse is possible (getting the id from the path, this can be done with the name_to_handle_at() syscall), so worst case you could iterate on all existing paths and get the ids. If you have a strong use case and some time ahead, a longer-term solution would be to submit a new eBPF helper that would call into task_cgroup_path().
[Edit] Regarding the error message returned by the verifier (last insn is not an exit or jmp): BPF programs support function calls, so you can have several functions (“subprograms”) in your program. Each of those subprograms is a list of instructions, possibly containing forward or backward (under certain conditions) jumps. From the verifier's point of view, the execution of the program must end with an exit or unconditional backward jump instruction (a “return” from the subprogram), this is to avoid fall-through from one subprog into another. It wouldn't make sense to do nothing specific (return or exit from whole program) at the end of a function, and it would be dangerous (once JIT-ed, programs are not necessarily contiguous in memory so you couldn't fall through safely, even if that made sense).
Now if we look at how clang compiled your program, we see this for the call to task_cgroup_path():
; task_cgroup_path(t, (char *) &cgpath, sizeof(cgpath)); // Line 50
20: bf 01 00 00 00 00 00 00 r1 = r0
21: b7 03 00 00 10 00 00 00 r3 = 16
22: 85 10 00 00 ff ff ff ff call -1
The call -1 is a call to a BPF function (not a kernel helper, but a function expected to be found in your program, you can tell with the source register being set at 1 (BPF_PSEUDO_CALL) on the second byte of the instruction). So the verifier considers that the target for this jump is a subprogram. Except that because task_cgroup_path() is not a BPF subprogram, clang could not find it to set a relevant offset and used -1 instead, marking the call -1 as the first instruction of this would-be subprogram. But the last instruction before the call -1 is neither an exit nor a jump, so the verifier eventually catches that something is wrong with the program, and rejects it. All this logic happens in function check_subprogs().

Related

Threads spawned by a detoured pthread_create do not execute instructions

I've got a custom implementation of detours on macOS and a test application using it, which is written in C, compiled for macOS x86_64, running on an Intel i9 processor.
The implemention works fine with a multitude of functions. However, if I detour pthread_create, I encounter strange behaviour: threads that have been spawned via a detoured pthread_create do not execute instructions. I can step through instructions one by one but as soon as I continue it does not progress. There are no mutexes or synchronisation involved and the result of the function is 0 (success). The exact same application with detours turned off works fine so it's unlikely to be the culprit.
This does not happen all the time - sometimes they are fine but at other times the test applications stalls in the following state:
(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x00007fff7296f55e libsystem_kernel.dylib`__ulock_wait + 10
frame #1: 0x00007fff72a325c2 libsystem_pthread.dylib`_pthread_join + 347
frame #2: 0x0000000100001186 DetoursTestApp`main + 262
frame #3: 0x00007fff7282ccc9 libdyld.dylib`start + 1
frame #4: 0x00007fff7282ccc9 libdyld.dylib`start + 1
thread #2
frame #0: 0x00007fff72a2cb7c libsystem_pthread.dylib`thread_start
Relevant memory pages have the executable flag set. The detour function that intercepts the thread creation looks like this:
static int pthread_create_detour(pthread_t* thread,
const pthread_attr_t* attr,
void* (*start_routine)(void*),
void* arg)
{
detour_count++;
pthread_fn original = (pthread_fn)detour_original(dlsym((void*)-1, "pthread_create"));
return original(thread, attr, start_routine, arg);
}
Where detour_original retrieves the pointer to [original function + size of function's prologue].
Tracing through the instructions, everything seems to be working correctly and pthread_create terminates successfully. Tracing the application's system calls via dtruss does show calls to
bsdthread_create(0x10DB964B0, 0x0, 0x7000080DB000) = 29646848 0
With what I have confirmed are the correct arguments.
This behaviour is only observed in release builds - debug works fine but the disassembly and execution of a detoured pthread_create and associated detours code seems to be identical in both cases.
Workarounds
I found a couple of odd workarounds for this issue that don't make much sense. Given the detour function, a number of things can be substituted into the following:
static int pthread_create_detour(pthread_t* thread,
const pthread_attr_t* attr,
void* (*start_routine)(void*),
void* arg)
{
detour_count++;
pthread_fn original = (pthread_fn)detour_original(dlsym((void*)-1, "pthread_create"));
<...> <== SUBSTITUTE HERE
return original(thread, attr, start_routine, arg);
}
A cache flush.
__asm__ __volatile__("" ::: "memory");
_mm_clflush(real_pthread_create);
A sleep of any duration - usleep(1)
A printf statement.
A memory allocation larger than 32768 bytes, e.g. void *data = malloc(40000);.
Cache?
All of these seem to point to a stale instruction cache. However, the Intel manual states the following:
A write to a memory location in a code segment that is currently cached in the processor causes the associated cache line (or lines) to be invalidated. This check is based on the physical address of the instruction. In addition, the P6 family and Pentium processors check whether a write to a code segment may modify an instruction that has been prefetched for execution. If the write affects a prefetched instruction, the prefetch queue is invalidated. This latter check is based on the linear address of the instruction.
What's even more interesting is that those workarounds have to be executed for every new thread created, with the execution happening on the main thread, so it's very unlikely to be the cache. I have also tried putting in cache flushes at every memory write that writes instructions but that did not help. I've also written a memcpy that bypasses the cache with the use of Intel's intrinsic _mm_stream_si32 and swapped it out for every instruction memory write in my implementation without any success.
Race condition?
The next suspect in line is a race condition. However, it's not clear what would be racing as at first there are no other threads. I have put in a fibonacci sequence calculation for a randomly-generated number and that would still stall the newly-spawned threads.
The question
What is causing this issue? What other mechanisms could be responsible for this?
At this point I have run out of things to check so any suggestions will be welcome.
I found that the reason why the spawned thread was not executing instructions was that the r8 register wasn't being cleared at the right time in the execution of pthread_create due to an issue with my detours implementation.
If we look at the disassembly of the function, it is split up to two parts - the "head" and the "body" that's found in an internal _pthread_create function. The head does two things - zeroes out r8 and jumps to the body:
libsystem_pthread.dylib`pthread_create:
0x7fff72a2e236 <+0>: 45 31 c0 xor r8d, r8d
0x7fff72a2e239 <+3>: e9 40 37 00 00 jmp 0x7fff72a3197e ; _pthread_create
libsystem_pthread.dylib`_pthread_create:
0x7fff72a3197e <+0>: 55 push rbp
0x7fff72a3197f <+1>: 48 89 e5 mov rbp, rsp
0x7fff72a31982 <+4>: 41 57 push r15
<...> // the rest of the 1409 instructions
My implementation would detour the internal _pthread_create function instead of the head containing the actual entry point which meant that the r8 would get cleared at the wrong time (before the detour). Since the detour function would contain some could, the execution would go something like:
pthread_create (r8 gets cleared) -> _pthread_create -> chain of jumps -> pthread_create_detour -> trampoline (containing the beginning of _pthread_create) -> _pthread_create + 6
Which meant that depending on the contents of the pthread_create_detour function the r8 would not always end up with a 0 when it returned to the internal function.
It's not yet clear why having r8 set to something other than 0 before _pthread_create would not crash but instead start up a thread in a locked up state. An important detail is that the stalled thread would have the rflags register set to 0x200 which should never be the case according to Intel's manual. This is what lead me to inspecting the CPU state more closely, leading to the answer.

Setting a random address breakpoint in gdb

I am learning some mechanism of breakpoint and I learned that 'In x86, there exist a instruction called int3 for debugger to interrupt the CPU. And then CPU will interrupt the running program by signal'.
For example:
8048e20: 55 push %ebp
8048e21: 89 e5 mov %esp,%ebp
When the user input
b *0x8048e21
The instruction will be replaced by int3(opcode 0xcc) and become this:
8048e20: 55 push %ebp
8048e21: cc e5 mov %esp,%ebp
And it will stop at the right place.
Then comes the question:
What would happen if I set the breakpoint not at the beginning of a instruction? ie, if I input:
b *0x8048e22
will debugger still replace the e5 with cc? So I write a simple example and run it with gdb.
As you can see above, I set two break points and the second is at the middle of a break points. I Input r and stop at the first breakpoint and input c and run to the end.
So it seems that the gdb ignore the second breakpoint. (For if it really repalce it with a int3 the program would be totally wrong).
Question: What happen to the second breakpoint, more specifically, what does gdb deal with it( or what I learn is wrong?)
Edit: #dbrank already give a great example about altering the data field of a instruction, I will try to make it more comprehensive with a similar example (it seems the register).
(Any reference about mechanism of breakpoint is appreciated!)
Inserting breakpoint in the middle of instruction will alter the instruction.
See this example of a program, where inserting a breakpoint overwrites original value assigned to variable (42 (0x2a)) with breakpoint instruction (0xcc (204)).
You can find more about how breakpoints work here.
You can also look into GDB sources (breakpoint.c & infrun.c mostly).

Assembly syntax to distinguish two forms of near jump

I'm assembling the same source with two different assemblers. I expect to get two identical results (modulo memory offsets, exact value of NOPs and such). Yet I've suddenly encountered the weirdest issue: there are two possible encodings of JZ:
74 cb
and
0F 84 cw/cd
The displacement, in my case, fits into one byte, and one assembler (a flavor of GAS, I think) emits the former while another (MASM) emits the latter. Since I perform some validation by matching outputs, this throws the validation off.
I have rather little control over the options of GAS, but I have complete control over MASM. Question - is there an option, a directive, or a specific command syntax to force one encoding over the other?
If all of the code, except for this one instruction is the exact same when assembled, this looks like a bug in MASM. These instructions resolve to:
74: jz rel8
0F 84: jz rel16/32
So, MASM is improperly using more space for that opcode than it should. You may be able to remedy this by using a more explicit form of the instruction in MASM, like
jz byte my_label
However, if your machine code is different at all, this may be the proper behavior of MASM. Ensure that the signed word/dword argument to jz rel16 would fit into a signed byte

More Null Free Shellcode

I need to find null-free replacements for the following instructions so I can put the following code in shellcode.
The first instruction I need to convert to null-free is:
mov ebx, str ; the string containing /dev/zero
The string str is defined in my .data section.
The second is:
mov eax,0x5a
Thanks!
Assuming what you want to learn is how assembly code is made up, what type of instruction choices ends up in assembly code with specific properties, then (on x86/x64) do the following:
Pick up Intel's instruction set reference manuals (four volumes as of this writing, I think). They contain opcode tables (instruction binary formats), and detailed lists of all allowed opcodes for a specific assembly mnemonic (instruction name).
Familiarize yourself with those and mentally divide them into two groups - those that match your expected properties (like, not containing the 'x' character ... or any other specific one), and those that don't. The 2nd category you need to eliminate from your code if they're present.
Compile your code telling the compiler not to discard compile intermediates:gcc -save-temps -c csource.c
Disassemble the object file:objdump -d csource.o
The disassembly output from objdump will contain the binary instructions (opcodes) as well as the instruction names (mnemonics), i.e. you'll see exactly which opcode format was chosen. You can now check whether any opcodes in there are from the 2nd set as per 1. above.
The creative bit of the work comes in now. When you've found an instruction in the disassembly output that doesn't match the expectations/requirements you have, look up / create a substitute (or, more often, a substitute sequence of several instructions) that gives the same end result but is only made up from instructions that do match what you need.
Go back to the compile intermediates from above, find the csource.s assembly, make changes, reassemble/relink, test.
If you want to make your assembly code standalone (i.e. not using system runtime libraries / making system calls directly), consult documentation on your operating system internals (how to make syscalls), and/or disassemble the runtime libraries that ordinarily do so on your behalf, to learn how it's done.
Since 5. is definitely homework, of the same sort like create a C for() loop equivalent to a given while() loop, don't expect too much help there. The instruction set reference manuals and experiments with the (dis)assembler are what you need here.
Additionally, if you're studying, attend lessons on how compilers work / how to write compilers - they do cover how assembly instruction selection is done by compilers, and I can well imagine it to be an interesting / challenging term project to e.g. write a compiler whose output is guaranteed to contain the character '?' (0x3f) but never '!' (0x21). You get the idea.
You mention the constant load via xor to clear plus inc and shl to get any set of bits you want.
The least fragile way I can think of to load an unknown constant (your unknown str) is to load the constant xor with some value like 0xAAAAAAAA and then xor that back out in a subsequent instruction. For example to load 0x1234:
0: 89 1d 9e b8 aa aa mov %ebx,0xaaaab89e
6: 31 1d aa aa aa aa xor %ebx,0xaaaaaaaa
You could even choose the 0xAAAAAAAA to be some interesting ascii!

What are segfault rip/rsp numbers and how to use them

When my linux application crashes, it produces a line in the logs something like:
segfault at 0000000 rip 00003f32a823 rsp 000123ade323 error 4
What are those rip and rsp addresses? How do I use them to pinpoint the problem? Do they correspond to something in the objdump or readelf outputs? Are they useful if my program gets its symbols stripped out (to a separate file, which can be used using gdb)?
Well the rip pointer tells you the instruction that caused the crash. You need to look it up in a map file.
In the map file you will have a list of functions and their starting address. When you load the application it is loaded to a base address. The rip pointer - the base address gives you the map file address. If you then search through the map file for a function that starts at an address slightly lower than your rip pointer and is followed, in the list, by a function with a higher address you have located the function that crashed.
From there you need to try and identify what went wrong in your code. Its not much fun but it, at least, gives you a starting point.
Edit: The "segfault at" bit is telling you, i'd wager, that you have dereferenced a NULL pointer. The rsp is the current stack pointer. Alas its probably not all that useful. With a memory dump you "may" be able to figure out more accurately where you'd got to in the function but it can be really hard to work out, exactly, where you are in an optimised build
I got the error, too. When I saw:
probe.out[28503]: segfault at 0000000000000180 rip 00000000004450c0 rsp 00007fff4d508178 error 4
probe.out is an app which using libavformat (ffmpeg). I disassembled it.
objdump -d probe.out
The rip is where the instruction will run:
00000000004450c0 <ff_rtp_queued_packet_time>:
4450c0: 48 8b 97 80 01 00 00 mov 0x180(%rdi),%rdx
44d25d: e8 5e 7e ff ff callq 4450c0 <ff_rtp_queued_packet_time>
finally, I found the app crashed in the function ff_rtp_queued_packet_time.
PS. sometimes the address doesn't exactly match, but it is almost there.

Resources