I'm trying to play around with the local APIC functions in the 2.6.32.40 linux kernel, but I am having some issues. I want to try to send a Non-Maskable Interrupts (NMI) to all of the processors on my system (I am using a Intel i7 Q740). First I read the documentation in Intel's Software Developer's Manual Volume 3 related to the APIC functions. It states that interrupts can be broadcast to all processors through the use of the Interrupt Command Register (ICR) located at address 0xFEE00300. So I wrote a kernel module with the following init function to try to write to this register:
#include <linux/init.h>
#include <linux/module.h>
#include <linux/fs.h>
MODULE_LICENSE("GPL");
#define SUCCESS 0
#define ICR_ADDRESS 0xFEE00300
#define ICR_PROGRAM 0x000C4C89
static int icr_init(void){
int * ICR = (int *)ICR_ADDRESS;
printk(KERN_ALERT "Programing ICR\n");
*ICR = ICR_PROGRAM;
return SUCCESS;
}
static void icr_exit(void){
printk(KERN_ALERT "Removing ICR Programing module removed");
}
module_init(icr_init);
module_exit(icr_exit);
However, when I insmod this module the kernel crashes and complains about being unable to handle the paging request # address 00000000fee00300. Looking under /proc/iomem I see that this address is in a ranged marked as "reserved"
fee00000-fee00fff : reserved
I've also tried using the functions under :
static inline void __default_local_send_IPI_allbutself(int vector)
but the kernel is still throwing "unable to handle paging request" messages and crashing. Does anyone have any suggestions? Why is this memory range marked as "reserved" and not marked as being used by the local APIC? Thanks in advance.
The APIC address is a physical memory address, but you are trying to access it as a linear memory address - that's why your first approach doesn't work. The memory is marked as "reserved" precisely because it belongs to the APIC, rather than real memory.
You should use the internal kernel functions. To do so, you should include <asm/apic.h> and use:
apic->send_IPI_allbutself(vector);
Related
Coming from the Windows world, I assume that Vmlinuz is equivalent to ntoskrnl.exe, and this is the kernel executable that gets mapped in Kernel memory.
Now I want to figure out whether an address inside kernel belongs to the kernel executable or not. Is using core_kernel_text the correct way of finding this out?
Because core_kernel_text doesn't return true for some of the addresses that clearly should belong to Linux kernel executable.
For example the core_kernel_text doesn't return true when i give it the syscall entry handler address which can be found with the following code:
unsigned long system_call_entry;
rdmsrl(MSR_LSTAR, system_call_entry);
return (void *)system_call_entry;
And when I use this code snippet, the address of the syscall entry handler doesn't belong to the core kernel text or to any kernel module (using get_module_from_addr).
So how can an address for a handler that clearly belongs to Linux kernel executable such as syscall entry, don't belong to neither the core kernel or any kernel module? Then what does it belong to?
Which API do I need to use for these type of addresses that clearly belong to Linux kernel executable to assure me that the address indeed belongs to kernel?
I need such an API because I need to write a detection for malicious kernel modules that patch such handlers, and for now I need to make sure the address belongs to kernel, and not some third party kernel module or random kernel address. (Please do not discuss methods that can be used to bypass my detection, obviously it can be bypassed but that's another story)
The target kernel version is 4.15.0-112-generic, and is Ubuntu 16.04 as a VMware guest.
Reproducible code as requested:
typedef int(*core_kernel_text_t)(unsigned long addr);
core_kernel_text_t core_kernel_text_;
core_kernel_text_ = (core_kernel_text_t)kallsyms_lookup_name("core_kernel_text");
unsigned long system_call_entry;
rdmsrl(MSR_LSTAR, system_call_entry);
int isInsideCoreKernel = core_kernel_text_((unsigned long)system_call_entry);
printk("%d , 0x%pK ", isInsideCoreKernel, system_call_entry);
EDIT1: So in the MSR_LSTAR example that I gave above, it turns out that It's related to Kernel Page Table Isolation and CONFIG_RETPOLINE=y in config:
system_call value is different each time when I use rdmsrl(MSR_LSTAR, system_call)
And that's why I am getting the address 0xfffffe0000006000 aka SYSCALL64_entry_trampoline, the same as the question above.
So now the question remains, why this SYSCALL64_entry_trampoline address doesn't belong to anything? It doesn't belong to any kernel module, and it doesn't belong to the core kernel, so which executable this address belongs to and how can I check that with an API similar to core_kernel_text? It seems like it belongs to cpu_entry_area, but what is that and how can I check if an address belongs to that?
You are seeing this "weird" address in MSR_LSTAR (IA32_LSTAR) because of Kernel Page-Table Isolation (KPTI), which mitigates Meltdown. As other existing answers(1) you already found point out, the address you see is the one of a small trampoline (entry_SYSCALL_64_trampoline) that is dynamically remapped at boot time by the kernel for each CPU, and thus does not have an address within the kernel text.
(1)By the way, the answer linked above wrongly states that the corresponding config option for KPTI is CONFIG_RETPOLINE=y. This is wrong, the "retpoline" is a mitigation for Spectre, not Meltdown. The config to enable KPTI is CONFIG_PAGE_TABLE_ISOLATION=y.
You don't have many options. Either:
Tell VMWare to emulate a recent CPU that is not vulnerable to Meltdown.
Detect and implement support for the KPTI trampoline.
You can implement support for this by detecting whether the kernel supports KPTI (CONFIG_PAGE_TABLE_ISOLATION), and if so check whether current CPU has KPTI enabled. The code at kernel/cpu/bugs.c that provides information for /sys/devices/system/cpu/vulnerabilities/meltdown shows how this can be detected:
ssize_t cpu_show_meltdown(struct device *dev,
struct device_attribute *attr, char *buf)
{
if (!boot_cpu_has_bug(X86_BUG_CPU_MELTDOWN))
return sprintf(buf, "Not affected\n");
if (boot_cpu_has(X86_FEATURE_PTI))
return sprintf(buf, "Mitigation: PTI\n");
return sprintf(buf, "Vulnerable\n");
}
The actual trampoline is set up at boot and its address is stored in each CPU's "entry area" for later use (e.g. here when setting up IA32_LSTAR). This answer on Unix & Linux SE explains the purpose of the cpu entry area and its relation to KPTI.
In your module you can do the following detection:
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/init.h>
#include <linux/kallsyms.h>
#include <asm/msr-index.h>
#include <asm/msr.h>
#include <asm/cpufeature.h>
#include <asm/cpu_entry_area.h>
// ...
typedef int(*core_kernel_text_t)(unsigned long addr);
core_kernel_text_t core_kernel_text_;
bool syscall_entry_64_ok(void)
{
unsigned long entry;
rdmsrl(MSR_LSTAR, entry);
if (core_kernel_text_(entry))
return true;
#ifdef CONFIG_PAGE_TABLE_ISOLATION
if (this_cpu_has(X86_FEATURE_PTI)) {
int cpu = smp_processor_id();
unsigned long trampoline = (unsigned long)get_cpu_entry_area(cpu)->entry_trampoline;
if ((entry & PAGE_MASK) == trampoline)
return true;
}
#endif
return false;
}
static int __init modinit(void)
{
core_kernel_text_ = (core_kernel_text_t)kallsyms_lookup_name("core_kernel_text");
if (!core_kernel_text_)
return -EOPNOTSUPP;
pr_info("syscall_entry_64_ok() -> %d\n", syscall_entry_64_ok());
return 0;
}
I'm trying to disable all Interrupts. Most of them are easy, but I have problems with the Non-Maskable Interrupts (NMIs).
To disable them, I want to manipulate the LVT Registers in the Local APIC.
Currently I am testing inside a Kernel Module, cause that's the Environment, the final code should run.
How can I read/write to the memory-mapped registers of the APIC?
I've already read many articles and everyone suggested this procedure.
I also tried to directly access the *mapped pointer, which resolves in the same result.
Instead of the foo() Function I implemented a lookup for the correct address. But according to the Intel manual and my personal inspections, the APIC always get's mapped to the physical address 0xFEE00000, which is interesting, cause I also tried the program on a Virtual Machine with 2 GB RAM.
phys_addr_t apic_base_phys = foo(); // fee00000
void __iomem *mapped = ioremap(apic_base_phys + 0x20, 0x4);
if(mapped == NULL){
printk(KERN_INFO "nullpointer\n");
} else {
uint32_t value = ioread32(mapped);
printk(KERN_INFO "Value: %x\n", value); // 0xffffffff
}
iounmap(mapped);
Output:
[ 1329.743182] apic_base_phys: fee00000
[ 1329.743198] Value: ffffffff
Address 0xFEE00020 should output the Local APIC ID, which probably not is 0xFFFFFFFF.
I also tried to read 0xFEE00030 which should output the LAPIC Version.
Got the solution by myself: On my System runs the newer x2APIC. This uses a different transfer mode.
This can be disabled by adding nox2apic to the boot options.
I have the following code:
static DEFINE_PER_CPU_ALIGNED(cpu_clock_t, cpu_clock);
static void func(void *info)
{
uint64_t cpu_clock_pa = per_cpu_ptr_to_phys(get_cpu_ptr(&cpu_clock));
__asm__ __volatile__ ... //Giving the PA to VMware kernel which is supposed to write something to there
put_cpu_ptr(cpu_clock);
}
Problem is, when this code runs as part of the kernel initialization, I get a message in VMware workstation "The CPU is disabled on the guest operating system" which means some kernel panic occurred and when I use the same code after the kernel boots (Call it as part of a module initialization) it works fine...
My code was running before setup_per_cpu_areas as a3f pointed out.
I'm new to kernel programming and I couldn't find enough information to know why this happens. Basically I'm trying to replace the page fault handler in the kernel's IDT with something simple that calls the original handler in the end. I just want this function to print a notification that it is called, and calling printk() inside it always results in a kernel panic. It runs fine otherwise.
#include <asm/desc.h>
#include <linux/mm.h>
#include <asm/traps.h>
#include <linux/types.h>
#include <linux/errno.h>
#include <linux/sched.h>
#include <asm/uaccess.h>
#include <linux/module.h>
#include <linux/kernel.h>
#include <asm/desc_defs.h>
#include <linux/moduleparam.h>
#define PAGEFAULT_INDEX 14
// Old and new IDT registers
static struct desc_ptr old_idt_reg, new_idt_reg;
static __attribute__((__used__)) unsigned long old_pagefault_pointer, new_page;
// The function that replaces the original handler
asmlinkage void isr_pagefault(void);
asm(" .text");
asm(" .type isr_pagefault,#function");
asm("isr_pagefault:");
asm(" callq print_something");
asm(" jmp *old_pagefault_pointer");
void print_something(void) {
// This printk causes the kernel to crash!
printk(KERN_ALERT "Page fault handler called\n");
return;
}
void my_idt_load(void *ptr) {
printk(KERN_ALERT "Loading on a new processor...\n");
load_idt((struct desc_ptr*)ptr);
return;
}
int module_begin(void) {
gate_desc *old_idt_addr, *new_idt_addr;
unsigned long idt_length;
store_idt(&old_idt_reg);
old_idt_addr = (gate_desc*)old_idt_reg.address;
idt_length = old_idt_reg.size;
// Get the pagefault handler pointer from the IDT's pagefault entry
old_pagefault_pointer = 0
| ((unsigned long)(old_idt_addr[PAGEFAULT_INDEX].offset_high) << 32 )
| ((unsigned long)(old_idt_addr[PAGEFAULT_INDEX].offset_middle) << 16 )
| ((unsigned long)(old_idt_addr[PAGEFAULT_INDEX].offset_low) );
printk(KERN_ALERT "Saved pointer to old pagefault handler: %p\n", (void*)old_pagefault_pointer);
// Allocate a new page for the new IDT
new_page = __get_free_page(GFP_KERNEL);
if (!new_page)
return -1;
// Copy the original IDT to the new page
memcpy((void*)new_page, old_idt_addr, idt_length);
// Set up the new IDT
new_idt_reg.address = new_idt_addr = new_page;
new_idt_reg.size = idt_length;
pack_gate(
&new_idt_addr[PAGEFAULT_INDEX],
GATE_INTERRUPT,
(unsigned long)isr_pagefault, // The interrupt written in assembly at the start of the code
0, 0, __KERNEL_CS
);
// Load the new table
load_idt(&new_idt_reg);
smp_call_function(my_idt_load, (void*)&new_idt_reg, 1); // Call load_idt on the rest of the cores
printk(KERN_ALERT "New IDT loaded\n\n");
return 0;
}
void module_end(void) {
printk( KERN_ALERT "Exit handler called now. Reverting changes and exiting...\n\n");
load_idt(&old_idt_reg);
smp_call_function(my_idt_load, (void*)&old_idt_reg, 1);
if (new_page)
free_page(new_page);
}
module_init(module_begin);
module_exit(module_end);
Many thanks to anyone who can tell me what I'm doing wrong here.
Sorry for resurrecting a dead post, but just for posterity:
I've run into similar issues when hooking IDT entries; one possibility is insufficient stack space. In 64-bit mode, when a trap or fault handler is called, the CPU determines a new stack pointer based on both the "interrupt stack table" (IST) field – bits 32 to 34 – of the corresponding interrupt descriptor, and the processor core's Task State Segment (TSS). From Volume 3A, section 6.14.5 of the Intel Software Developer's Manual:
In IA-32e mode, a new interrupt stack table (IST) mechanism is available as an alternative to the modified legacy stack-switching mechanism described above. This mechanism unconditionally switches stacks when it is enabled. It can be enabled on an individual interrupt-vector basis using a field in the IDT entry. This means that some interrupt vectors can use the modified legacy mechanism and others can use the IST mechanism.
The IST mechanism is only available in IA-32e mode. It is part of the 64-bit mode TSS. The motivation for the IST mechanism is to provide a method for specific interrupts (such as NMI, double-fault, and machine-check) to always execute on a known good stack. In legacy mode, interrupts can use the task-switch mechanism to set up a known good stack by accessing the interrupt service routine through a task gate located in the IDT. However, the legacy task-switch mechanism is not supported in IA-32e mode.
The IST mechanism provides up to seven IST pointers in the TSS. The pointers are referenced by an interrupt-gate descriptor in the interrupt-descriptor table (IDT); see Figure 6-8. The gate descriptor contains a 3-bit IST index field that provides an offset into the IST section of the TSS. Using the IST mechanism, the processor loads the value pointed by an IST pointer into the RSP.
... If the IST index is zero, the modified legacy stack-switching mechanism described above is used.
The "modified legacy stack-switching" mechanism is described in section 6.14.2 of the same chapter, and most importantly just loads the RSP0 entry of the TSS as the new stack pointer. Here is the figure that describes the TSS:
So, to summarize, if the IST field of the interrupt descriptor is 0, then the RSP0 entry of the TSS will be loaded as the new stack pointer, and if the IST field is non-zero, then the entry of the TSS indicated by it will be loaded as the new stack pointer. In x64 linux the IST field is 0 for page faults, so rsp is switched to the RSP0 entry of the TSS whenever a page fault occurs. Unfortunately, the stack space allocated here is rather small; playing around with a kernel debugger revealed that linux allocates only 512 bytes for this stack, and my suspicion is that printk perhaps requires greater stack space.
One possible solution to this is to, in the beginning of your page fault hook, manually switch the stack pointer to the RSP1 entry of the TSS, which should contain the current kernel stack and hence have ample room for printk. This is a very hacky and inelegant solution, but in my experience it does the trick. (To find the address of the TSS you should use str to get the Task Register (tr) and then get the base address from the corresponding entry of the GDT, which is called the "TSS Descriptor". See section 7.2.3 of Volume 3A for details.)
DISCLAIMER: there is however a major caveat today not relevant at the time you asked this question; the new Kernel Page Table Isolation mitigations introduced in response to Meltdown will cause a different fatal problem in this kind of hooking. In particular, your new Interrupt Descriptor Table will not be accessible from a user-mode value for cr3, so any fault in user-land will actually cause a triple fault once you've loaded in the new IDT (first the original fault, then a page fault because the IDT address is not present in the user-mode page tables, and then a triple fault because the double fault entry of the IDT will not be accessible either). Short of manually changing all of the user-mode page tables this renders your IDT hooking approach impossible.
The only solution is to manually overwrite an area of memory that you know will be present in user-mode page tables; for example, the original IRQ handler referenced in the IDT will point to a small segment of code that is always present in user-mode page tables and whose role is to change cr3 to the kernel-mode variant. Linux does this by clearing bits 11 and 12 of cr3, so you could overwrite this area of code with a small assembly stub that clears those bits and then jumps to your hook. As a proof of concept see here.
As far as I know, the printk() requires much resource and complexity(console/file system/storage) than ftrace.
If the crash only happens in case you have used printk(), why don't you use ftrace instead of printk()?
Many of Linux Kernel experts love ftrace.
I've started to learn Linux driver programs, but I'm finding it a little difficult.
I've been studying the i2c driver, and I got quite confused regarding the entry-point of the driver program. Does the driver program start at the MOUDULE_INIT() macro?
And I'd also like to know how I can know the process of how the driver program runs. I got the book, Linux Device Driver, but I'm still quite confused. Could you help me? Thanks a lot.
I'll take the i2c driver as an example. There are just so many functions in it, I just wanna know how I can get coordinating relation of the functions in the i2c drivers?
A device driver is not a "program" that has a main {} with a start point and exit point. It's more like an API or a library or a collection of routines. In this case, it's a set of entry points declared by MODULE_INIT(), MODULE_EXIT(), perhaps EXPORT_SYMBOL() and structures that list entry points for operations.
For block devices, the driver is expected to provide the list of operations it can perform by declaring its functions for those operations in (from include/linux/blkdev.h):
struct block_device_operations {
int (*open) ();
int (*release) ();
int (*ioctl) ();
int (*compat_ioctl) ();
int (*direct_access) ();
unsigned int (*check_events) ();
/* ->media_changed() is DEPRECATED, use ->check_events() instead */
int (*media_changed) ();
void (*unlock_native_capacity) ();
int (*revalidate_disk) ();
int (*getgeo)();
/* this callback is with swap_lock and sometimes page table lock held */
void (*swap_slot_free_notify) ();
struct module *owner;
};
For char devices, the driver is expected to provide the list of operations it can perform by declaring its functions for those operations in (from include/linux/fs.h):
struct file_operations {
struct module *owner;
loff_t (*llseek) ();
ssize_t (*read) ();
ssize_t (*write) ();
ssize_t (*aio_read) ();
ssize_t (*aio_write) ();
int (*readdir) ();
unsigned int (*poll) ();
long (*unlocked_ioctl) ();
long (*compat_ioctl) ();
int (*mmap) ();
int (*open) ();
int (*flush) ();
int (*release) ();
int (*fsync) ();
int (*aio_fsync) ();
int (*fasync) ();
int (*lock) ();
ssize_t (*sendpage) ();
unsigned long (*get_unmapped_area)();
int (*check_flags)();
int (*flock) ();
ssize_t (*splice_write)();
ssize_t (*splice_read)();
int (*setlease)();
long (*fallocate)();
};
For platform devices, the driver is expected to provide the list of operations it can perform by declaring its functions for those operations in (from include/linux/platform_device.h):
struct platform_driver {
int (*probe)();
int (*remove)();
void (*shutdown)();
int (*suspend)();
int (*resume)();
struct device_driver driver;
const struct platform_device_id *id_table;
};
The driver, especially char drivers, does not have to support every operation listed. Note that there are macros to facilitate the coding of these structures by naming the structure entries.
Does the driver program starts at the MOUDLUE_INIT() macro?
The driver's init() routine specified in MODULE_INIT() will be called during boot (when statically linked in) or when the module is dynamically loaded. The driver passes its structure of operations to the device's subsystem when it registers itself during its init().
These device driver entry points, e.g. open() or read(), are typically executed when the user app invokes a C library call (in user space) and after a switch to kernel space. Note that the i2c driver you're looking at is a platform driver for a bus that is used by leaf devices, and its functions exposed by EXPORT_SYMBOL() would be called by other drivers.
Only the driver's init() routine specified in MODULE_INIT() is guaranteed to be called. The driver's exit() routine specified in MODULE_EXIT() would only be executed if/when the module is dynamically unloaded. The driver's op routines will be called asynchronously (just like its interrupt service routine) in unknown order. Hopefully user programs will invoke an open() before issuing a read() or an ioctl() operation, and invoke other operations in a sensible fashion. A well-written and robust driver should accommodate any order or sequence of operations, and produce sane results to ensure system integrity.
It would probably help to stop thinking of a device driver as a program. They're completely different. A program has a specific starting point, does some stuff, and has one or more fairly well defined (well, they should, anyway) exit point. Drivers have some stuff to do when the first get loaded (e.g. MODULE_INIT() and other stuff), and may or may not ever do anything ever again (you can forcibly load a driver for hardware your system doesn't actually have), and may have some stuff that needs to be done if the driver is ever unloaded. Aside from that, a driver generally provides some specific entry points (system calls, ioctls, etc.) that user-land applications can access to request the driver to do something.
Horrible analogy, but think of a program kind of like a car - you get in, start it up, drive somewhere, and get out. A driver is more like a vending machine - you plug it in and make sure it's stocked, but then people just come along occasionaly and push buttons to make it do something.
Actually you are taking about (I2C) platform (Native)driver first you need to understand how MOUDULE_INIT() of platform driver got called versus other loadable modules.
/*
* module_init() - driver initialization entry point
* #x: function to be run at kernel boot time or module insertion
* module_init() will either be called during do_initcalls() (if
* builtin) or at module insertion time (if a module). There can only
* be one per module.*/
and for i2c driver you can refer this link http://www.linuxjournal.com/article/7136 and
http://www.embedded-bits.co.uk/2009/i2c-in-the-2632-linux-kernel/
Begin of a kernel module is starting from initialization function, which mainly addressed with macro __init just infront of the function name.
The __init macro indicate to linux kernel that the following function is an initialization function and the resource that will use for this initialization function will be free once the code of initialization function is executed.
There are other marcos, used for detect initialization and release function, named module_init() and module_exit() [as described above].
These two macro are used, if the device driver is targeted to operate as loadable and removeable kernel module at run time [i.e. using insmod or rmmod command]
IN short and crisp way : It starts from .probe and go all the way to init as soon you do insmod .This also registers the driver with the driver subsystem and also initiates the init.
Everytime the driver functionalities are called from the user application , functions are invoked using the call back.
"Linux Device Driver" is a good book but it's old!
Basic example:
#include <linux/module.h>
#include <linux/version.h>
#include <linux/kernel.h>
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Name and e-mail");
MODULE_DESCRIPTION("my_first_driver");
static int __init insert_mod(void)
{
printk(KERN_INFO "Module constructor");
return 0;
}
static void __exit remove_mod(void)
{
printk(KERN_INFO "Module destructor");
}
module_init(insert_mod);
module_exit(remove_mod);
An up-to-date tutorial, really well written, is "Linux Device Drivers Series"