Storing Per-Process Data in Kernel Module / Passing Data Between sys_enter and sys_exit Probe - linux-kernel

Familiarity with how Linux Kernel Tracepoints work is not necessarily required to help with this question, it is just what is motivating this problem. In essence, I am looking for a way to store per-process data for a kernel module, without modifying the Linux source (e.g. struct task_struct), and ideally without using locks. Here is my specific question:
I have a kernel module that hooks into the sys_enter (defined here for x86_64, aarch64) and sys_exit (x86_64, aarch64) tracepoints. For each system call issued, I need to pass some data between the enter probe and the exit probe.
Some things I have considered: I could ...
...use one global variable -- but that will be shared between concurrently executing system calls on different CPUs, creating a race.
...use one global map from PID (of the process issuing the system call) to my data, together with locks -- but that will unnecessarily require synchronization between all CPUs on each system call. I would like to avoid this, since the data is "local" to each issued system call, so I feel like there should be a way to keep it local and not add costly synchronization.
...use a per-CPU global variable -- but (it is my understanding that) a process may move to another CPU during the system call execution, making this approach incorrect.
...kmallocing some memory for my custom data upon each system call entry, then pass the address to that memory by clobbering one of the registers in struct pt_regs (both the entry and exit probe receive a pointer to said struct) -- but then I will have a memory leak for system calls that do not trigger the exit probe (such as sys_exit, which never returns).
I am open to any suggestions how these ideas could be refined to address the problems I listed, or any completely different ideas that I am not thinking of.

I'd use an RCU enabled hashtable, for safety.
The first option isn't actually doable, as you stated.
The third one requires you to track which process is using which CPU, which seems unnecessary.
The leaking problem of the fourth option can probably be solved somehow, but allocating memory on each system call can introduce a serious delay.
Of course that accessing the hashtable will also slow down the system, but It won't trigger a memory allocation for each system call, so I assume it'll be less harmful.
Also, I may be wrong here, but if you assume that only process creation/destruction will introduce changes to table itself (not to the data within each entry, but the location and hash value of each row) than maybe you won't even have to synchronize on each system call, but only on ones that will cause process creation/destruction.

Related

boost interprocess does not clean up after itself

I am using boost::interprocess to attempt to share a block of memory between >2 processes. I am allocating the memory using:
std::unique_ptr<boost::interprocess::managed_shared_memory> tableStorage_;
When running the code inside docker/podman, I have to run with --ipc=host to be able to execute the code, else it will just happily sit there waiting forever. Not sure for what though.
I am seeing the same behavior in and outside docker/podman. Sometimes when the code exits it doesn't seem to not cleanup /dev/shm if it is the last process with a hold on that memory. Is there a way to make sure that /dev/shm gets cleaned out when the process exits and it is the last process to hold onto the file in /dev/shm?
Thanks!
That's something your program can/should take care of.
Boost Interprocess (famously) doesn't have a portable robust lock implementation. Meaning that unless you do a graceful shutdown, locks might be held, leading to potential deadlock.
I'd suggest using a timed open, guarded with an unconditional T::remove. Since that is a destructive operation, perhaps you want to only provide when a certain flag is set (e.g. --force)
To detect whether your process is last, you could use shared pointers.
See also e.g. Boost interprocess shared memory delete object without destroy

Force malloc to pre-fault/MAP_POPULATE/MADV_WILLNEED all allocations for an entire program/process

For the sake of some user-space performance profiling, I'd like to cleanly separate the costs of allocating memory from operations that access it. The application does no over-allocation, so every page that gets mapped will be faulted in, probably in code that runs shortly after its allocation.
What I'd like to do is set some flag, environment variable, something, to tell malloc that it should uniformly do the equivalent of calling mmap(..., MAP_POPULATE) or madvise(..., MADV_WILLNEED) or just touching every page of whatever it allocated itself. I haven't found any documentation, on any platform(!), that describes a way to do this. Is there some existing technique that's utterly undocumented, up to my ability to search? Is this a fundamentally misguided or bad idea?
If I wanted to implement this myself, I'm thinking of an LD_PRELOAD including just a reimplementation of malloc that calls the underlying malloc and then does the madvise thing (to be at least somewhat agnostic to huge pages behavior). Any reason that shouldn't work?
malloc is one of the most used, yet relatively slow functions in common use. As a result, it has received a lot of optimization attention over the years. I seriously doubt that any serious implementation of malloc does anything so slow as the string parsing that would be required to check an environment variable at every call.
LD_PRELOAD is not a bad idea, considering what you're doing, you wouldn't even need to recompile to switch between profile and release builds. If you're open to recompiling, I would suggest doing a #define malloc(size) { malloc(size); mmap(...);}. You could even do this at the compile command line via -Dmalloc=... (so long as the system malloc is not itself a define, which would overwrite the cli one).
Another option would be to find/implement a program that uses the debug interface to intercept and redirect calls to malloc. You could theoretically do this by messing with the post-compiled (or post-load) program's import section to point to your dll/so file.
Edit: On second thought, the define might not work on every allocation, since it is often implied by the compiler (e.g. new).

How to identify a process in Windows? Kernel and User mode

In Windows, what is the formal way of identifying a process uniquely? I am not talking about PID, which is allocated dynamically, but a unique ID or a name which is permanent to that process. I know that every program/process has a security descriptor but it seems to hold SIDs for loggedin user and group (not the process). We cannot use the path and name of executable from where the process starts as that can change.
My aim is to identify a process in the kernel mode and allow it to perform certain operation. What is the easiest and best way of doing this?
Your question is too vague to answer properly. For example how could the path possibly change (without poking around in kernel memory) after creation of a process? And yes, I am aware that one could hook into the memory-mapping process during process creation to replace the image originally destined to be loaded with another. Point is that a process is merely one instance of running a given executable. And it's not clear what exact tampering attempts you want to counter here.
But from kernel mode you do have the ability to simply use the pointer to the EPROCESS structure. No need to use the PID, although that will be unique while the process is still alive.
So assuming your process uses an IRP to communicate to the driver (whether it be WriteFile, ReadFile, DeviceIoControl or something more exotic), in order to register itself, you can use IoGetCurrentProcess to get the PEPROCESS value which will be unique to the process.
While the structure itself is not officially documented, hints can be gleaned from the "Windows Internals" book (in its various incarnations), the dt (Display Type) command in WinDbg (and friends) as well as from third-party resources on the internet (e.g. here, specific to Vista).
The process objects are kept in several linked lists. So if you know the (officially undocumented!!!) layout for a particular OS version, you may traverse the lists to get from one to the next process object (i.e. EPROCESS structure).
Cautionary notes
Make sure to reference the object of the process, by using the respective object manager routines. Otherwise you cannot be certain it's safe to both reach into these structures (which is anyway unsafe, since you cannot rely on their layout across OS versions) or to pass it to functions that expect a PEPROCESS.
As a side-note: Harry Johnston is of course right to assert that a privileged user can insert arbitrary (well almost arbitrary) code into the TCB in order to thwart your protective measures. In the end it is going to be an arms race.
Also keep in mind that similar to PIDs, theoretically the value of the PEPROCESS may be recycled. But in both cases you can simply counter this by invalidating whatever internal state you keep in your driver that allows the process to do its magic, whenever the process goes down. Using something like PsSetCreateProcessNotifyRoutine would seem to be a good method here. In order to translate your process handle from the callback to a PEPROCESS value, use ObReferenceObjectByHandle.
An alternative of countering recycling of the PID/PEPROCESS is by keeping a reference to the process object and thus keeping it in a kind of undead state (similar to not closing a handle in user mode), although the main thread may have finished.

what does __rcu stands for in linux?

I am new to linux kernel. My question is about the task_struct.
I know that Each task_struct has a reference to its parent process via a pointer to the task_struct of the parent.
After looking at the sched.h in the task_struct definition I noticed the following :
struct task_struct __rcu *real_parent; /* real parent process */
I found that it is referenced to compiler.h. I guess that the "__rcu" stands for "read copy update"
Can someone clarify the syntax ?
Read-copy-update is an algorithm that enables concurrent access to readers of a data structure without having to lock the structure. It can be read about here.
If the kernel is built with the CONFIG_SPARSE_RCU_POINTER config option, __rcu is defined in include/linux/compiler.h as
# define __rcu __attribute__((noderef, address_space(4)))
This is an annotation for a the Sparse code analysis tool that can warn about certain things the programmer may have overlooked. How this is relevant to RCU is explained in Documentation/RCU/checklist.txt:
__rcu sparse checks: tag the pointer to the RCU-protected data
structure with __rcu, and sparse will warn you if you
access that pointer without the services of one of the
variants of rcu_dereference().
rcu_dereference() returns a pointer that can be safely dereferenced by the code and documents the programmer's intention to protect the pointer with the RCU mechanism, enabling tools like Sparse to check for programming errors and omissions.
RCU stands for "read, copy, update". It is an algorithm that allows multiple readers to access data which can be updated or even deleted at the same time by writers.
Under RCU, writers still have to ensure mutual exclusion with regard to one another, but readers do not acquire a lock. Care has to be taken that the shared data structure is updated in ways that do not violate read integrity. If something has to be removed or deleted, the unlinking of that item from the data structure can be done in parallel with the readers but the actual deletion of the memory has to wait until the last reader has finished.
Rather than making the readers acquire a lock, the whereabouts of the readers are inferred in other ways. Threads can announce their intent to browse the data structure by joining a "read side critical section" which is not really a lock but a kind of global phase.
For instance, suppose that some threads entered the RCU read side critical section in phase 0. An updater has performed a deletion and want to free a piece of memory. It has to simply wait for all threads in the system to vacate phase 0. In the meanwhile, other readers are looking at the data structure already, but when they declare their intent to RCU, they do so by entering the RCU read-side critical section under phase 1. Only the phase 0 threads can possibly still have a pointer to the object that was deleted, and so when the last thread leaves phase 0, the object can safely be deleted. Newly arriving threads in phase 1 do not see the object, because the object has been removed from the data structure, so they have no way to find it.
RCU takes advantage of the idea that we do not need lock objects that are "owned" in order to know information like "no thread can be accessing this object any more".

free mem as function of command 'purge'

one of my app needs the function that free inactive/used/wired memory just like command 'purge'.
Check and google a lot, but can not get any hit
Welcome any comment
Purge doesn't do what you seem to think it does. It doesn't "free inactive/used/wired memory". As the manpage says:
It does not affect anonymous memory that has been allocated through malloc, vm_allocate, etc.
All it does is purge the disk cache. This is only useful if you're running performance tests and want to simulate the effects of "first run after cold boot" without actually cold booting. Again, from the manpage:
Purge can be used to approximate initial boot conditions with a cold disk buffer cache for performance analysis.
There is no public API for this, although a quick scan of the symbols shows that it seems to call a function CPOSXPurgeAllDiskBuffers from the CoreProfile private framework. I believe the underlying kernel and userland disk cache code is all or mostly available on http://www.opensource.apple.com, so you could do probably implement the same thing yourself, if you really want.
As iMysak says, you can just exec (or NSTask, etc.) the tool if you want to.
As a side note, it you could free used/wired memory, presumably that memory is used by something—even if you don't have pointers into it in your own data structures, malloc probably does. Are you trying to segfault your code?
Freeing inactive memory is a different story. Just freeing something up to malloc doesn't necessarily make malloc return it to the OS. And there's no way you can force it to. If you think about the way traditional UNIX works, it makes sense: When you ask it to allocate more memory, it uses sbrk to expand your data segment; if you free up memory at the top, it can sbrk back down, but if you free up memory in the middle, there's no way it can do that. Of course modern UNIX systems don't work that way, but the POSIX and C APIs are all designed to be compatible with systems that do. So, if you want to make sure memory gets freed, you have to handle memory allocation directly.
The simplest and most portable way to do this is to create and mmap a temporary backing file, or just MAP_ANON, and explicitly unmap pages when you're done with them. (This works on all POSIX systems—and, with a pretty simple wrapper, even Windows.) If you need even more control (e.g., to manually handle flushing pages to disk, etc.), you can use the mach/mach_vm.h APIs.
You can directly run it from OS // with exec() function

Resources