I am looking at the Linux kernel flow, where I came across early_printk,
which is used for printk before console is initialized.
In my case early_printk is not enabled and still I am getting prints with printk before console_init.
From where i am getting these prints?
If you see the kernel source code as below, even though early_printk is disabled, kernel message needs to prints in scenario like panic etc.
#ifdef CONFIG_EARLY_PRINTK
extern asmlinkage __printf(1, 2)
void early_printk(const char *fmt, ...);
#else
static inline __printf(1, 2) __cold
void early_printk(const char *s, ...) { }
From the above, its enable or not early_printk() is called and in disable case it will be called if any early_panic() is there.
early_panic() means if kernel is trying to recover some problem like below and if not then panic() will happen and it will die.
[40462.661285] Attribute myarray: Invalid permissions 0700
|
here I tried to give permission to `variable` to verify whether `panic` happens or not.
Related
Coming from the Windows world, I assume that Vmlinuz is equivalent to ntoskrnl.exe, and this is the kernel executable that gets mapped in Kernel memory.
Now I want to figure out whether an address inside kernel belongs to the kernel executable or not. Is using core_kernel_text the correct way of finding this out?
Because core_kernel_text doesn't return true for some of the addresses that clearly should belong to Linux kernel executable.
For example the core_kernel_text doesn't return true when i give it the syscall entry handler address which can be found with the following code:
unsigned long system_call_entry;
rdmsrl(MSR_LSTAR, system_call_entry);
return (void *)system_call_entry;
And when I use this code snippet, the address of the syscall entry handler doesn't belong to the core kernel text or to any kernel module (using get_module_from_addr).
So how can an address for a handler that clearly belongs to Linux kernel executable such as syscall entry, don't belong to neither the core kernel or any kernel module? Then what does it belong to?
Which API do I need to use for these type of addresses that clearly belong to Linux kernel executable to assure me that the address indeed belongs to kernel?
I need such an API because I need to write a detection for malicious kernel modules that patch such handlers, and for now I need to make sure the address belongs to kernel, and not some third party kernel module or random kernel address. (Please do not discuss methods that can be used to bypass my detection, obviously it can be bypassed but that's another story)
The target kernel version is 4.15.0-112-generic, and is Ubuntu 16.04 as a VMware guest.
Reproducible code as requested:
typedef int(*core_kernel_text_t)(unsigned long addr);
core_kernel_text_t core_kernel_text_;
core_kernel_text_ = (core_kernel_text_t)kallsyms_lookup_name("core_kernel_text");
unsigned long system_call_entry;
rdmsrl(MSR_LSTAR, system_call_entry);
int isInsideCoreKernel = core_kernel_text_((unsigned long)system_call_entry);
printk("%d , 0x%pK ", isInsideCoreKernel, system_call_entry);
EDIT1: So in the MSR_LSTAR example that I gave above, it turns out that It's related to Kernel Page Table Isolation and CONFIG_RETPOLINE=y in config:
system_call value is different each time when I use rdmsrl(MSR_LSTAR, system_call)
And that's why I am getting the address 0xfffffe0000006000 aka SYSCALL64_entry_trampoline, the same as the question above.
So now the question remains, why this SYSCALL64_entry_trampoline address doesn't belong to anything? It doesn't belong to any kernel module, and it doesn't belong to the core kernel, so which executable this address belongs to and how can I check that with an API similar to core_kernel_text? It seems like it belongs to cpu_entry_area, but what is that and how can I check if an address belongs to that?
You are seeing this "weird" address in MSR_LSTAR (IA32_LSTAR) because of Kernel Page-Table Isolation (KPTI), which mitigates Meltdown. As other existing answers(1) you already found point out, the address you see is the one of a small trampoline (entry_SYSCALL_64_trampoline) that is dynamically remapped at boot time by the kernel for each CPU, and thus does not have an address within the kernel text.
(1)By the way, the answer linked above wrongly states that the corresponding config option for KPTI is CONFIG_RETPOLINE=y. This is wrong, the "retpoline" is a mitigation for Spectre, not Meltdown. The config to enable KPTI is CONFIG_PAGE_TABLE_ISOLATION=y.
You don't have many options. Either:
Tell VMWare to emulate a recent CPU that is not vulnerable to Meltdown.
Detect and implement support for the KPTI trampoline.
You can implement support for this by detecting whether the kernel supports KPTI (CONFIG_PAGE_TABLE_ISOLATION), and if so check whether current CPU has KPTI enabled. The code at kernel/cpu/bugs.c that provides information for /sys/devices/system/cpu/vulnerabilities/meltdown shows how this can be detected:
ssize_t cpu_show_meltdown(struct device *dev,
struct device_attribute *attr, char *buf)
{
if (!boot_cpu_has_bug(X86_BUG_CPU_MELTDOWN))
return sprintf(buf, "Not affected\n");
if (boot_cpu_has(X86_FEATURE_PTI))
return sprintf(buf, "Mitigation: PTI\n");
return sprintf(buf, "Vulnerable\n");
}
The actual trampoline is set up at boot and its address is stored in each CPU's "entry area" for later use (e.g. here when setting up IA32_LSTAR). This answer on Unix & Linux SE explains the purpose of the cpu entry area and its relation to KPTI.
In your module you can do the following detection:
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/init.h>
#include <linux/kallsyms.h>
#include <asm/msr-index.h>
#include <asm/msr.h>
#include <asm/cpufeature.h>
#include <asm/cpu_entry_area.h>
// ...
typedef int(*core_kernel_text_t)(unsigned long addr);
core_kernel_text_t core_kernel_text_;
bool syscall_entry_64_ok(void)
{
unsigned long entry;
rdmsrl(MSR_LSTAR, entry);
if (core_kernel_text_(entry))
return true;
#ifdef CONFIG_PAGE_TABLE_ISOLATION
if (this_cpu_has(X86_FEATURE_PTI)) {
int cpu = smp_processor_id();
unsigned long trampoline = (unsigned long)get_cpu_entry_area(cpu)->entry_trampoline;
if ((entry & PAGE_MASK) == trampoline)
return true;
}
#endif
return false;
}
static int __init modinit(void)
{
core_kernel_text_ = (core_kernel_text_t)kallsyms_lookup_name("core_kernel_text");
if (!core_kernel_text_)
return -EOPNOTSUPP;
pr_info("syscall_entry_64_ok() -> %d\n", syscall_entry_64_ok());
return 0;
}
I am not a driver writer and have a question about what functions are actually called within the bsg driver when one does a write(2)/read(2) from user-land. My CentOS system is using Linux 2.6.32. Surprisingly, though I have the sources for the build used by this CentOS system installed, the bsg.c file isn't there (huh?). So, I downloaded from kernel.org the 2.6.32 sources.
I'm looking in .../linux-2.6.32.61/block/bsg.c. For that source version, my question, is this function (on line 661) called when I call write(2) from user land?
static ssize_t
bsg_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos)
I'm trying to track down why I'm getting EINVAL when calling write(2) in some cases but not in others when attempting to get SCSI Log Sense data. If I'm on the right track in the driver sources, the only time that EINVAL is returned to the caller is the size of the data being written to the descriptor is not evenly divisible by sizeof(sg_io_v4) (defined in /usr/include/linux/bsg.h).
Andy
Yes it is the right function. In the same file you can find this static const struct file_operations bsg_fops which is the definition of the function to use when userspace does something with the device
This is my code that works only on Xcode (version 4.5):
#include <stdio.h>
#include <mach/mach_init.h>
#include <mach/mach_vm.h>
#include <sys/types.h>
#include <mach/mach.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <Security/Authorization.h>
int main(int argc, const char * argv[]) {
char test[14] = "Hello World! "; //0x7fff5fbff82a
char value[14] = "Hello Hacker!";
char test1[14];
pointer_t buf;
uint32_t sz;
task_t task;
task_for_pid(current_task(), getpid(), &task);
if (vm_write(current_task(), 0x7fff5fbff82a, (pointer_t)value, 14) == KERN_SUCCESS) {
printf("%s\n", test);
//getchar();
}
if (vm_read(task, 0x7fff5fbff82a, sizeof(char) * 14, &buf, &sz) == KERN_SUCCESS) {
memcpy(test1, (const void *)buf, sz);
printf("%s", test1);
}
return 0;
}
I was trying also ptrace and other things, this is why I include other libraries too.
The first problem is that this works only on Xcode, I can find with the debugger the position (memory address) of a variable (in this case of test), so I change the string with the one on value and then I copy the new value on test on test1.
I actually don't understand how vm_write works (not completely) and the same for task_for_pid(), the 2° problem is that I need to read and write on another process, this is only a test for see if the functions works on the same process, and it works (only on Xcode).
How I can do that on other processes? I need to read a position (how I can find the address of "something"?), this is the first goal.
For your problems, there are solutions:
The first problem: OS X has address space layout randomization. If you want to make your memory images fixed and predictable, you have to compile your code with NOPIE setting. This setting (PIE = Position Independent Executable), is responsible for allowing ASLR, which "slides" the memory by some random value, which changes on every instance.
I actually don't understand how vm_write works (not completely) and the same for task_for_pid():
The Mach APIs operate on the lower level abstractions of "task" and "Thread" which correspond roughly to that of the BSD "process" and "(u)thread" (there are some exceptions, e.g. kernel_task, which does not have a PID, but let's ignore that for now). task_for_pid obtains the task port (think of it as a "handle"), and if you get the port - you are free to do whatever you wish. Basically, the vm_* functions operate on any task port - you can use it on your own process (mach_task_self(), that is), or a port obtained from task_for_pid.
Task for PID actually doesn't necessarily require root (i.e. "sudo"). It requires getting past taskgated on OSX, which traditionally verified membership in procmod or procview groups. You can configure taskgated ( /System/Library/LaunchDaemons/com.apple.taskgated.plist) for debugging purposes. Ultimately, btw, getting the task port will require an entitlement (the same as it now does on iOS). That said, the easiest way, rather than mucking around with system authorizations, etc, is to simply become root.
Did you try to run your app with "sudo"?
You can't read/write other app's memory without sudo.
I've started to learn Linux driver programs, but I'm finding it a little difficult.
I've been studying the i2c driver, and I got quite confused regarding the entry-point of the driver program. Does the driver program start at the MOUDULE_INIT() macro?
And I'd also like to know how I can know the process of how the driver program runs. I got the book, Linux Device Driver, but I'm still quite confused. Could you help me? Thanks a lot.
I'll take the i2c driver as an example. There are just so many functions in it, I just wanna know how I can get coordinating relation of the functions in the i2c drivers?
A device driver is not a "program" that has a main {} with a start point and exit point. It's more like an API or a library or a collection of routines. In this case, it's a set of entry points declared by MODULE_INIT(), MODULE_EXIT(), perhaps EXPORT_SYMBOL() and structures that list entry points for operations.
For block devices, the driver is expected to provide the list of operations it can perform by declaring its functions for those operations in (from include/linux/blkdev.h):
struct block_device_operations {
int (*open) ();
int (*release) ();
int (*ioctl) ();
int (*compat_ioctl) ();
int (*direct_access) ();
unsigned int (*check_events) ();
/* ->media_changed() is DEPRECATED, use ->check_events() instead */
int (*media_changed) ();
void (*unlock_native_capacity) ();
int (*revalidate_disk) ();
int (*getgeo)();
/* this callback is with swap_lock and sometimes page table lock held */
void (*swap_slot_free_notify) ();
struct module *owner;
};
For char devices, the driver is expected to provide the list of operations it can perform by declaring its functions for those operations in (from include/linux/fs.h):
struct file_operations {
struct module *owner;
loff_t (*llseek) ();
ssize_t (*read) ();
ssize_t (*write) ();
ssize_t (*aio_read) ();
ssize_t (*aio_write) ();
int (*readdir) ();
unsigned int (*poll) ();
long (*unlocked_ioctl) ();
long (*compat_ioctl) ();
int (*mmap) ();
int (*open) ();
int (*flush) ();
int (*release) ();
int (*fsync) ();
int (*aio_fsync) ();
int (*fasync) ();
int (*lock) ();
ssize_t (*sendpage) ();
unsigned long (*get_unmapped_area)();
int (*check_flags)();
int (*flock) ();
ssize_t (*splice_write)();
ssize_t (*splice_read)();
int (*setlease)();
long (*fallocate)();
};
For platform devices, the driver is expected to provide the list of operations it can perform by declaring its functions for those operations in (from include/linux/platform_device.h):
struct platform_driver {
int (*probe)();
int (*remove)();
void (*shutdown)();
int (*suspend)();
int (*resume)();
struct device_driver driver;
const struct platform_device_id *id_table;
};
The driver, especially char drivers, does not have to support every operation listed. Note that there are macros to facilitate the coding of these structures by naming the structure entries.
Does the driver program starts at the MOUDLUE_INIT() macro?
The driver's init() routine specified in MODULE_INIT() will be called during boot (when statically linked in) or when the module is dynamically loaded. The driver passes its structure of operations to the device's subsystem when it registers itself during its init().
These device driver entry points, e.g. open() or read(), are typically executed when the user app invokes a C library call (in user space) and after a switch to kernel space. Note that the i2c driver you're looking at is a platform driver for a bus that is used by leaf devices, and its functions exposed by EXPORT_SYMBOL() would be called by other drivers.
Only the driver's init() routine specified in MODULE_INIT() is guaranteed to be called. The driver's exit() routine specified in MODULE_EXIT() would only be executed if/when the module is dynamically unloaded. The driver's op routines will be called asynchronously (just like its interrupt service routine) in unknown order. Hopefully user programs will invoke an open() before issuing a read() or an ioctl() operation, and invoke other operations in a sensible fashion. A well-written and robust driver should accommodate any order or sequence of operations, and produce sane results to ensure system integrity.
It would probably help to stop thinking of a device driver as a program. They're completely different. A program has a specific starting point, does some stuff, and has one or more fairly well defined (well, they should, anyway) exit point. Drivers have some stuff to do when the first get loaded (e.g. MODULE_INIT() and other stuff), and may or may not ever do anything ever again (you can forcibly load a driver for hardware your system doesn't actually have), and may have some stuff that needs to be done if the driver is ever unloaded. Aside from that, a driver generally provides some specific entry points (system calls, ioctls, etc.) that user-land applications can access to request the driver to do something.
Horrible analogy, but think of a program kind of like a car - you get in, start it up, drive somewhere, and get out. A driver is more like a vending machine - you plug it in and make sure it's stocked, but then people just come along occasionaly and push buttons to make it do something.
Actually you are taking about (I2C) platform (Native)driver first you need to understand how MOUDULE_INIT() of platform driver got called versus other loadable modules.
/*
* module_init() - driver initialization entry point
* #x: function to be run at kernel boot time or module insertion
* module_init() will either be called during do_initcalls() (if
* builtin) or at module insertion time (if a module). There can only
* be one per module.*/
and for i2c driver you can refer this link http://www.linuxjournal.com/article/7136 and
http://www.embedded-bits.co.uk/2009/i2c-in-the-2632-linux-kernel/
Begin of a kernel module is starting from initialization function, which mainly addressed with macro __init just infront of the function name.
The __init macro indicate to linux kernel that the following function is an initialization function and the resource that will use for this initialization function will be free once the code of initialization function is executed.
There are other marcos, used for detect initialization and release function, named module_init() and module_exit() [as described above].
These two macro are used, if the device driver is targeted to operate as loadable and removeable kernel module at run time [i.e. using insmod or rmmod command]
IN short and crisp way : It starts from .probe and go all the way to init as soon you do insmod .This also registers the driver with the driver subsystem and also initiates the init.
Everytime the driver functionalities are called from the user application , functions are invoked using the call back.
"Linux Device Driver" is a good book but it's old!
Basic example:
#include <linux/module.h>
#include <linux/version.h>
#include <linux/kernel.h>
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Name and e-mail");
MODULE_DESCRIPTION("my_first_driver");
static int __init insert_mod(void)
{
printk(KERN_INFO "Module constructor");
return 0;
}
static void __exit remove_mod(void)
{
printk(KERN_INFO "Module destructor");
}
module_init(insert_mod);
module_exit(remove_mod);
An up-to-date tutorial, really well written, is "Linux Device Drivers Series"
On my Windows 7 box, this simple program causes the memory use of the application to creep up continuously, with no upper bound. I've stripped out everything non-essential, and it seems clear that the culprit is the Microsoft Iphlpapi function "GetIpAddrTable()". On each call, it leaks some memory. In a loop (e.g. checking for changes to the network interface list), it is unsustainable. There seems to be no async notification API which could do this job, so now I'm faced with possibly having to isolate this logic into a separate process and recycle the process periodically -- an ugly solution.
Any ideas?
// IphlpLeak.cpp - demonstrates that GetIpAddrTable leaks memory internally: run this and watch
// the memory use of the app climb up continuously with no upper bound.
#include <stdio.h>
#include <windows.h>
#include <assert.h>
#include <Iphlpapi.h>
#pragma comment(lib,"Iphlpapi.lib")
void testLeak() {
static unsigned char buf[16384];
DWORD dwSize(sizeof(buf));
if (GetIpAddrTable((PMIB_IPADDRTABLE)buf, &dwSize, false) == ERROR_INSUFFICIENT_BUFFER)
{
assert(0); // we never hit this branch.
return;
}
}
int main(int argc, char* argv[]) {
for ( int i = 0; true; i++ ) {
testLeak();
printf("i=%d\n",i);
Sleep(1000);
}
return 0;
}
#Stabledog:
I've ran your example, unmodified, for 24 hours but did not observe that the program's Commit Size increased indefinitely. It always stayed below 1024 kilobyte. This was on Windows 7 (32-bit, and without Service Pack 1).
Just for the sake of completeness, what happens to memory usage if you comment out the entire if block and the sleep? If there's no leak there, then I would suggest you're correct as to what's causing it.
Worst case, report it to MS and see if they can fix it - you have a nice simple test case to work from which is more than what I see in most bug reports.
Another thing you may want to try is to check the error code against NO_ERROR rather than a specific error condition. If you get back a different error than ERROR_INSUFFICIENT_BUFFER, there may be a leak for that:
DWORD dwRetVal = GetIpAddrTable((PMIB_IPADDRTABLE)buf, &dwSize, false);
if (dwRetVal != NO_ERROR) {
printf ("ERROR: %d\n", dwRetVal);
}
I've been all over this issue now: it appears that there is no acknowledgment from Microsoft on the matter, but even a trivial application grows without bounds on Windows 7 (not XP, though) when calling any of the APIs which retrieve the local IP addresses.
So the way I solved it -- for now -- was to launch a separate instance of my app with a special command-line switch that tells it "retrieve the IP addresses and print them to stdout". I scrape stdout in the parent app, the child exits and the leak problem is resolved.
But it wins "dang ugly solution to an annoying problem", at best.