How to write to protected pages in the Linux kernel? - linux-kernel

I am trying to add a syscall in a module. My rationale is:
This is for a research project, so the exact implementation does not matter.
Adding syscalls in the kernel-core takes a prohibitively long time to re-compile. I can suck up compiling once with an expanded syscall table, but not every time. Even with incremental compiling, linking and archiving the final binary takes a long time.
Since the project is timing sensitive, using kprobes to intercept the syscall handler would slow down the syscall handler.
I am still open to other means of adding a syscall, but for the above reasons, I think that writing to the sys_call_table in a module is the cleanest way to do what I am trying to do.
I've gotten the address of the syscall table from the System.map, disabled kaslr, and I am trying to clear the page protections, but some write-protection is still holding me back.
// following https://web.iiit.ac.in/~arjun.nath/random_notes/modifying_sys_call.html
// clear cr0 write protection
write_cr0 (read_cr0 () & (~ 0x10000));
// clear page write protection
sys_call_table_page = virt_to_page(&sys_call_table[__NR_execves]);
set_pages_rw(sys_call_table_page, 1);
// do write
sys_call_table[__NR_execves] = sys_execves;
However, I'm still getting a permission error, but I don't know the mechanism by which it is enforced:
[ 11.145647] ------------[ cut here ]------------
[ 11.148893] CR0 WP bit went missing!?
[ 11.151539] WARNING: CPU: 0 PID: 749 at arch/x86/kernel/cpu/common.c:386 native_write_cr0+0x3e/0x70
...
Here was a call trace pointing to the write of sys_call_table
...
[ 11.332825] ---[ end trace c20c95651874c08b ]---
[ 11.336056] CPA protect Rodata RO: 0xffff888002804000 - 0xffff888002804fff PFN 2804 req 8000000000000063 prevent 0000000000000002
[ 11.343934] CPA protect Rodata RO: 0xffffffff82804000 - 0xffffffff82804fff PFN 2804 req 8000000000000163 prevent 0000000000000002
[ 11.351720] BUG: unable to handle page fault for address: ffffffff828040e0
[ 11.356418] #PF: supervisor write access in kernel mode
[ 11.359908] #PF: error_code(0x0003) - permissions violation
[ 11.363665] PGD 3010067 P4D 3010067 PUD 3011063 PMD 31e29063 PTE 8000000002804161
[ 11.368701] Oops: 0003 [#1] SMP KASAN PTI
full dmesg
Any guesses on how to disable it?

There is a way that does not need to recompile the kernel. Since the kernel will detect whether the wp bit has been modified in write_cr0, you can provide a custom function to bypass it.
inline void mywrite_cr0(unsigned long cr0) {
asm volatile("mov %0,%%cr0" : "+r"(cr0), "+m"(__force_order));
}
Here is the function that enables/disables write protection. We use
mywrite_cr0 instead of write_cr0
void enable_write_protection(void) {
unsigned long cr0 = read_cr0();
set_bit(16, &cr0);
mywrite_cr0(cr0);
}
void disable_write_protection(void) {
unsigned long cr0 = read_cr0();
clear_bit(16, &cr0);
mywrite_cr0(cr0);
}
In your mod_init function, you can use kallsyms_lookup_name("sys_call_table") to figure out the address of sys_call_table at runtime, instead of compile time. Fortunately, we can now directly write to sys_call_table without dealing with pageattr.
The code below is tested on Linux Kernel 5.1.4
inline void mywrite_cr0(unsigned long cr0) {
asm volatile("mov %0,%%cr0" : "+r"(cr0), "+m"(__force_order));
}
void enable_write_protection(void) {
unsigned long cr0 = read_cr0();
set_bit(16, &cr0);
mywrite_cr0(cr0);
}
void disable_write_protection(void) {
unsigned long cr0 = read_cr0();
clear_bit(16, &cr0);
mywrite_cr0(cr0);
}
static struct {
void **sys_call_table;
void *orig_fn;
} tinfo;
static int __init mod_init(void) {
printk(KERN_INFO "Init syscall hook\n");
tinfo.sys_call_table = (void **)kallsyms_lookup_name("sys_call_table");
tinfo.orig_fn = tinfo.sys_call_table[your_syscall_num];
disable_write_protection();
// modify sys_call_table directly
tinfo.sys_call_table[your_syscall_num] = sys_yourcall;
enable_write_protection();
return 0;
}
static void __exit mod_cleanup(void) {
printk(KERN_INFO "Cleaning up syscall hook.\n");
// backup syscall
disable_write_protection();
tinfo.sys_call_table[your_syscall_num] = tinfo.orig_fn;
enable_write_protection();
printk(KERN_INFO "Cleaned up syscall hook.\n");
}
module_init(mod_init);
module_exit(mod_cleanup);

The kernel has code to protect against this sort of action.
First, the kernel by default does not allow you to remove write protection from the cr0 register. It checks that in arch/x86/kernel/cpu/common.c:native_write_cr0
if (static_branch_likely(&cr_pinning)) {
if (unlikely((val & X86_CR0_WP) != X86_CR0_WP)) {
bits_missing = X86_CR0_WP;
val |= bits_missing;
goto set_register;
}
/* Warn after we've set the missing bits. */
WARN_ONCE(bits_missing, "CR0 WP bit went missing!?\n");
}
Second, the page table does not allow you to set a page that should be read-only to read-write. It does that check arch/x86/mm/pageattr.c:static_protections
/* Check the PFN directly */
res = protect_rodata(pfn, pfn + npg - 1);
check_conflict(warnlvl, prot, res, start, end, pfn, "Rodata RO");
forbidden |= res;
If you disable these two checks by removing both blobs, the code to change the pagetable works.

It is possible to just remap the sys_call_table as read-write using the set_memory_rw function, so it's possible to write to it without disabling write protection for the whole kernel. Used this method myself on aarch64, not sure if it works on x86.

Related

how to read a register in device driver?

in a linux device driver, in the init function for the device, I tried reading an address (which is SMMUv3 device for arm64) like below.
uint8_t *addr1;
addr1 = ioremap(0x09050000, 0x20000);
printk("SMMU_AIDR : 0x%X\n", *(addr1 + 0x1c));
but I get Internal error: synchronous external abort: 96000010 [#1] SMP error.
Is it not permitted to map an address to virtual address using ioremap and just reading that address?
I gave a fixed value 0x78789a9a to SMMU IDR[2] register. (at offset 0x8, 32 bit register. This is possible because it's qemu.)
SMMU starts at 0x09050000 and it has address space 0x20000.
__iomem uint32_t *addr1 = NULL;
static int __init my_driver_init(void)
{
...
addr1 = ioremap(0x09050000, 0x20000); // smmuv3
printk("SMMU_IDR[2] : 0x%X\n", readl(addr1 +0x08/4));
..}
This is the output when the driver is initialized.(The value is read ok)
[ 453.207261] SMMU_IDR[2] : 0x78789A9A
The first problem was that the access width was wrong for that address. Before, it was defined as uint8_t *addr1; and I used printk("SMMU_AIDR : 0x%X\n", *(addr1 + 0x1c)) so it was reading byte when it was not allowed by the SMMU model.
Second problem (I think this didn't cause the trap because arm64 provides memory mapped io) was that I used memory access(pointer dereferencing) for memory mapped IO registers. As people commented, I should have used readl function. (Mainly because to make the code portable. readl works also for iomap platforms like x86_64. using the mmio adderss as pointer will not work on such platforms. I later found that readl function takes care of the memory barrier problem too).
ADD : I fixed volatile to __iomem for variable addr1.(thanks #0andriy)

Kernel Error while using memcpy in the kernel module to change the content of the buffer passed as argument to the exported function

I have two kernel modules where first module had one function exported and second module uses this function to read spi data. sample program is given below
Module-1:
int spi_fun(uint8_t *tx_buf, uint8_t *rx_buf,int len)
{
spi_sync_txrx(tx_buf,rx_buf,len);
}
Module-2:
void dummy_fun()
{
uint8_t tx[4]={0};
uint8_t rx[4]={0};
spi_fun(tx,rx,4);
}
the above mentioned scenario is working fine. If I declare a local rx buffer(spi_data[4]) inside spi_fun(), and use memcpy to copy spi_data contents to the rx_buf, kernel is crashing with error as given below
New Module-2 fun:
Module-1:
int spi_fun(uint8_t *tx_buf, uint8_t *rx_buf,int len)
{
uint8_t spi_data[4];
spi_sync_txrx(tx_buf,spi_data,len);
memcpy(rx_buf, spi_data, len); //here error
}
Kernel Error:
Internal error: Accessing user space memory outside uaccess.h
routines: 96000045 [#1] PREEMPT SMP
I have used copy_from_user/copy_to_user functions, but i was getting target buffer as zeroes.
Does anyone experienced this issue???

Trap memory accesses inside a standard executable built with MinGW

So my problem sounds like this.
I have some platform dependent code (embedded system) which writes to some MMIO locations that are hardcoded at specific addresses.
I compile this code with some management code inside a standard executable (mainly for testing) but also for simulation (because it takes longer to find basic bugs inside the actual HW platform).
To alleviate the hardcoded pointers, i just redefine them to some variables inside the memory pool. And this works really well.
The problem is that there is specific hardware behavior on some of the MMIO locations (w1c for example) which makes "correct" testing hard to impossible.
These are the solutions i thought of:
1 - Somehow redefine the accesses to those registers and try to insert some immediate function to simulate the dynamic behavior. This is not really usable since there are various ways to write to the MMIO locations (pointers and stuff).
2 - Somehow leave the addresses hardcoded and trap the illegal access through a seg fault, find the location that triggered, extract exactly where the access was made, handle and return. I am not really sure how this would work (and even if it's possible).
3 - Use some sort of emulation. This will surely work, but it will void the whole purpose of running fast and native on a standard computer.
4 - Virtualization ?? Probably will take a lot of time to implement. Not really sure if the gain is justifiable.
Does anyone have any idea if this can be accomplished without going too deep? Maybe is there a way to manipulate the compiler in some way to define a memory area for which every access will generate a callback. Not really an expert in x86/gcc stuff.
Edit: It seems that it's not really possible to do this in a platform independent way, and since it will be only windows, i will use the available API (which seems to work as expected). Found this Q here:
Is set single step trap available on win 7?
I will put the whole "simulated" register file inside a number of pages, guard them, and trigger a callback from which i will extract all the necessary info, do my stuff then continue execution.
Thanks all for responding.
I think #2 is the best approach. I routinely use approach #4, but I use it to test code that is running in the kernel, so I need a layer below the kernel to trap and emulate the accesses. Since you have already put your code into a user-mode application, #2 should be simpler.
The answers to this question may provide help in implementing #2. How to write a signal handler to catch SIGSEGV?
What you really want to do, though, is to emulate the memory access and then have the segv handler return to the instruction after the access. This sample code works on Linux. I'm not sure if the behavior it is taking advantage of is undefined, though.
#include <stdint.h>
#include <stdio.h>
#include <signal.h>
#define REG_ADDR ((volatile uint32_t *)0x12340000f000ULL)
static uint32_t read_reg(volatile uint32_t *reg_addr)
{
uint32_t r;
asm("mov (%1), %0" : "=a"(r) : "r"(reg_addr));
return r;
}
static void segv_handler(int, siginfo_t *, void *);
int main()
{
struct sigaction action = { 0, };
action.sa_sigaction = segv_handler;
action.sa_flags = SA_SIGINFO;
sigaction(SIGSEGV, &action, NULL);
// force sigsegv
uint32_t a = read_reg(REG_ADDR);
printf("after segv, a = %d\n", a);
return 0;
}
static void segv_handler(int, siginfo_t *info, void *ucontext_arg)
{
ucontext_t *ucontext = static_cast<ucontext_t *>(ucontext_arg);
ucontext->uc_mcontext.gregs[REG_RAX] = 1234;
ucontext->uc_mcontext.gregs[REG_RIP] += 2;
}
The code to read the register is written in assembly to ensure that both the destination register and the length of the instruction are known.
This is how the Windows version of prl's answer could look like:
#include <stdint.h>
#include <stdio.h>
#include <windows.h>
#define REG_ADDR ((volatile uint32_t *)0x12340000f000ULL)
static uint32_t read_reg(volatile uint32_t *reg_addr)
{
uint32_t r;
asm("mov (%1), %0" : "=a"(r) : "r"(reg_addr));
return r;
}
static LONG WINAPI segv_handler(EXCEPTION_POINTERS *);
int main()
{
SetUnhandledExceptionFilter(segv_handler);
// force sigsegv
uint32_t a = read_reg(REG_ADDR);
printf("after segv, a = %d\n", a);
return 0;
}
static LONG WINAPI segv_handler(EXCEPTION_POINTERS *ep)
{
// only handle read access violation of REG_ADDR
if (ep->ExceptionRecord->ExceptionCode != EXCEPTION_ACCESS_VIOLATION ||
ep->ExceptionRecord->ExceptionInformation[0] != 0 ||
ep->ExceptionRecord->ExceptionInformation[1] != (ULONG_PTR)REG_ADDR)
return EXCEPTION_CONTINUE_SEARCH;
ep->ContextRecord->Rax = 1234;
ep->ContextRecord->Rip += 2;
return EXCEPTION_CONTINUE_EXECUTION;
}
So, the solution (code snippet) is as follows:
First of all, i have a variable:
__attribute__ ((aligned (4096))) int g_test;
Second, inside my main function, i do the following:
AddVectoredExceptionHandler(1, VectoredHandler);
DWORD old;
VirtualProtect(&g_test, 4096, PAGE_READWRITE | PAGE_GUARD, &old);
The handler looks like this:
LONG WINAPI VectoredHandler(struct _EXCEPTION_POINTERS *ExceptionInfo)
{
static DWORD last_addr;
if (ExceptionInfo->ExceptionRecord->ExceptionCode == STATUS_GUARD_PAGE_VIOLATION) {
last_addr = ExceptionInfo->ExceptionRecord->ExceptionInformation[1];
ExceptionInfo->ContextRecord->EFlags |= 0x100; /* Single step to trigger the next one */
return EXCEPTION_CONTINUE_EXECUTION;
}
if (ExceptionInfo->ExceptionRecord->ExceptionCode == STATUS_SINGLE_STEP) {
DWORD old;
VirtualProtect((PVOID)(last_addr & ~PAGE_MASK), 4096, PAGE_READWRITE | PAGE_GUARD, &old);
return EXCEPTION_CONTINUE_EXECUTION;
}
return EXCEPTION_CONTINUE_SEARCH;
}
This is only a basic skeleton for the functionality. Basically I guard the page on which the variable resides, i have some linked lists in which i hold pointers to the function and values for the address in question. I check that the fault generating address is inside my list then i trigger the callback.
On first guard hit, the page protection will be disabled by the system, but i can call my PRE_WRITE callback where i can save the variable state. Because a single step is issued through the EFlags, it will be followed immediately by a single step exception (which means that the variable was written), and i can trigger a WRITE callback. All the data required for the operation is contained inside the ExceptionInformation array.
When someone tries to write to that variable:
*(int *)&g_test = 1;
A PRE_WRITE followed by a WRITE will be triggered,
When i do:
int x = *(int *)&g_test;
A READ will be issued.
In this way i can manipulate the data flow in a way that does not require modifications of the original source code.
Note: This is intended to be used as part of a test framework and any penalty hit is deemed acceptable.
For example, W1C (Write 1 to clear) operation can be accomplished:
void MYREG_hook(reg_cbk_t type)
{
/** We need to save the pre-write state
* This is safe since we are assured to be called with
* both PRE_WRITE and WRITE in the correct order
*/
static int pre;
switch (type) {
case REG_READ: /* Called pre-read */
break;
case REG_PRE_WRITE: /* Called pre-write */
pre = g_test;
break;
case REG_WRITE: /* Called after write */
g_test = pre & ~g_test; /* W1C */
break;
default:
break;
}
}
This was possible also with seg-faults on illegal addresses, but i had to issue one for each R/W, and keep track of a "virtual register file" so a bigger penalty hit. In this way i can only guard specific areas of memory or none, depending on the registered monitors.

Why is this loop not executed?

I compile with GCC 5.3 2016q1 for STM32 microcontroller.
Right at the beginning of main I placed a small routine to fill stack with a pattern. Later I search the highest address that still holds this pattern to find out about stack usage, you surely know this. Here is my routine:
uint32_t* Stack_ptr = 0;
uint32_t Stack_bot;
uint32_t n = 0;
asm volatile ("str sp, [%0]" :: "r" (&Stack_ptr));
Stack_bot = (uint32_t)(&_estack - &_Min_Stack_Size);
//*
n = 0;
while ((uint32_t)(Stack_ptr) > Stack_bot)
{
Stack_ptr--;
n++;
*Stack_ptr = 0xAA55A55A;
} // */
After that I initialize hardware, also a UART and print out values of Stack_ptr, Stack_bot and n and then stack contents.
The results are 0x20007FD8 0x20007C00 0
Stack_bot is the expected value because I have 0x400 Bytes in 32k RAM starting at 0x20000000. But I would expect Stack_ptr to be 0x20008000 and n somewhat under 0x400 after the loop is finished. Also stack contents shows no entries of 0xAA55A55A. This means the loop is not executed.
I could only manage to get it executed by creating a small function that holds the above routine and disable optimization for this function.
Anybody knows why that is? And the strangest thing about it is that I could swear it worked a few days ago. I saw a lot of 0xAA55A55A in the stack dump.
Thanks a lot
Martin
Probably problem is with the assembler function, In my code I use this:
// defined by linker script, pointing to end of static allocation
extern unsigned _heap_start;
void fill_heap(unsigned fill=0x55555555) {
unsigned *dst = &_heap_start;
register unsigned *msp_reg;
__asm__("mrs %0, msp\n" : "=r" (msp_reg) );
while (dst < msp_reg) {
*dst++ = fill;
}
}
it will fill memory between _heap_start and current SP.

Problems doing syscall hooking

I use the following module code to hooks syscall, (code credited to someone else, e.g., Linux Kernel: System call hooking example).
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/unistd.h>
#include <asm/semaphore.h>
#include <asm/cacheflush.h>
void **sys_call_table;
asmlinkage int (*original_call) (const char*, int, int);
asmlinkage int our_sys_open(const char* file, int flags, int mode)
{
printk(KERN_ALERT "A file was opened\n");
return original_call(file, flags, mode);
}
int set_page_rw(long unsigned int _addr)
{
struct page *pg;
pgprot_t prot;
pg = virt_to_page(_addr);
prot.pgprot = VM_READ | VM_WRITE;
return change_page_attr(pg, 1, prot);
}
int init_module()
{
// sys_call_table address in System.map
sys_call_table = (void*)0xffffffff804a1ba0;
original_call = sys_call_table[1024];
set_page_rw(sys_call_table);
sys_call_table[1024] = our_sys_open;
return 0;
}
void cleanup_module()
{
// Restore the original call
sys_call_table[1024] = original_call;
}
When insmod the compiled .ko file, terminal throws "Killed". When looking into 'cat /proc/modules' file, I get the Loading status.
my_module 10512 1 - Loading 0xffffffff882e7000 (P)
As expected, I can not rmmod this module, as it complains its in use. The system is rebooted to get a clean-slate status.
Later on, after commenting two code lines in the above source sys_call_table[1024] = our_sys_open; and sys_call_table[1024] = original_call;, it can insmod successfully. More interestingly, when uncommenting these two lines (change back to the original code), the compiled module can be insmod successfully. I dont quite understand why this happens? And is there any way to successfully compile the code and insmod it directly?
I did all this on Redhat with linux kernel 2.6.24.6.
I think you should take a look to the kprobes API, which is well documented in Documentation/krpobes.txt. It gives you the ability to install handler on every address (e.g. syscall entry) so that you can do what you want. Added bonus is that your code would be more portable.
If you're only interested in tracing those syscalls you can use the audit subsystem, coding your own userland daemon which will be able to receive events on a NETLINK socket from the audit kthread. libaudit provides a simple API to register/read events.
If you do have a good reason with not using kprobes/audit, I would suggest that you check that the value you are trying to write to is not above the page that you set writable. A quick calculation shows that:
offset_in_sys_call_table * sizeof(*sys_call_table) = 1024 * 8 = 8192
which is two pages after the one you set writable if you are using 4K pages.

Resources