I want a user process in guest machine call a custom hypercall and qemu receives it. I don't want to make any modification of a guest kernel.
From this answer and other materials, I know that vmcall instruction will cause VMEXIT and VMM will receive the its exit reason and arguments.
According to Intel® 64 and IA-32 Architectures Software Developer’s Manual p.1201, vmcall instruction will trigger an exception when CPL > 0.
So I conclude that I need a (guest) kernel interface to invoke a hypercall.
I found that arch/x86/include/asm/kvm_para.h in Linux kernel has kvm_hypercallx functions (where x is a number of arguments). But I can't find a call site of these functions.
Is it possible to invoke a hypercall without any modification of a guest kernel? If so, how to do it? If not, is there any alternative?
VMCALL causes a VM exit at any CPL level when in a guest (VMX non-root mode). The check for CPL is done only if it is in VMX root mode.
Another way to cause an unconditional VM exit is with the CPUID instruction. The VMM can distinguish a hypercall from a regular CPUID invocation by the value in EAX.
Is it possible to invoke a hypercall without any modification of a guest kernel?
hypercall just a way to transfer message between guest & host, you may trigger the hypercall (like virtio used hypercall2), but it is useful for you?
Related
W^X ("write xor execute", pronounced W xor X) is a security feature in operating systems and virtual machines. It is a memory protection policy whereby every page in a process's or kernel's address space may be either writable or executable, but not both.
My basic perspective on why this is a good security feature is that the owner of the system theoretically has an opportunity to, within the kernel, or specifically within the VirtualAlloc function, to hook some analysis function to perform some security validation before allowing newly written code to be executed on the machine.
I was already familiar with DEP, but only just now realizing it has something to do with W^X in Windows:
Executable space protection on Windows is called "Data Execution Prevention" (DEP).
Under Windows XP or Server 2003 NX protection was used on critical Windows services exclusively by default. If the x86 processor supported this feature in hardware, then the NX features were turned on automatically in Windows XP/Server 2003 by default. If the feature was not supported by the x86 processor, then no protection was given.
Early implementations of DEP provided no address space layout randomization (ASLR), which allowed potential return-to-libc attacks that could have been feasibly used to disable DEP during an attack.
It was my impression that W^X applied to Windows in general, without requiring configuration of the process. But I just noticed that VirtualProtect allows the option PAGE_EXECUTE_READWRITE, which is documented as:
Enables execute, read-only, or read/write access to the committed region of pages.
This seems to entirely defy the concept of W^X. So is W^X not an enforced security policy on Windows, except when DEP is enabled?
If you turn DEP off, W^X is not enforced. When DEP is on, W^X is enforced by all memory pages that ask for it (when the hardware supports it). It is bit 63 in the page table on x86, known as the NX bit.
Now the question becomes, when is this bit set?
The PE header has a bit indicating if DEP/W^X is supported (IMAGE_DLLCHARACTERISTICS_NX_COMPAT) and if so, the code sections in the file without the write attribute gets the NX bit set when that code is mapped into memory.
For memory dynamically allocated at run-time, the developer gets to choose. PAGE_EXECUTE_READWRITE does not get the NX bit set on purpose. This is useful if they have legacy code that dynamically alters executable code while still having the DEP bit set on the PE so the majority of their code is W^X.
Early x86 CPUs had no support for pages without eXec permission. In legacy 32-bit x86 page tables, there was only a bit for write permission, the R/W bit. (Read permission is always implicit in the page being valid, whether the page is writeable or not). The PAE format for page-table entries, which x86-64 also uses, added an NX bit ("no exec"), aka XD (eXecute Disable).
An OS still had to decide which pages to make non-executable.
Windows seems to use DEP to describe the feature of actually mapping logical page permissions to the hardware page tables, to be enforced by the CPU.
Some programs written in the bad old days when every readable page was executable may have been sloppy about telling the OS that they wanted a page to be executable. Especially ones that only targeted 32-bit x86. This is what Windows caters for by requiring executables to opt in to DEP, to indicate that they're aware of and compatible with not having exec permission for pages that aren't explicitly marked that way.
Some OSes, notably OpenBSD, truly enforce W^X. For example, mmap(..., PROT_WRITE | PROT_EXEC, ...) will return an error on OpenBSD. Their mmap(2) man page documents that such an mmap or mprotect system call will return
[ENOTSUP] The accesses requested in the prot argument are not allowed. In particular, PROT_WRITE | PROT_EXEC mappings are not permitted unless the filesystem is mounted wxallowed and the process is link-time tagged with wxneeded. (See also kern.wxabort in sysctl(2) for a method to diagnose failure).
Most other OSes (including Linux and Windows) allow user-space to create pages that are writeable and executable at the same time. But the standard toolchains and dynamic linking mechanisms aim for W^X compliance by default, if you don't use any options like gcc -zexecstack that will get the OS to create a process image with some R|W|X pages.
Older 32-bit x86 Linux for example used to use PLT entries (dynamic linking stubs) with jmp rel32 direct jumps, and rewrite the machine code to have the right displacement to reach wherever the shared library got loaded in memory. But these days, the PLT code uses indirect jumps (through the GOT = global offset table), so the executable PLT code can be in read-only page(s).
Changes like this have weeded out any need for write+exec pages in a normal process built with the standard tools.
But on Windows, MacOS, and Linux, W^X is not enforced by the OS. System calls like Windows VirtualAlloc / VirtualProtect and their POSIX equivalents mmap / mprotect will work just fine.
#Ander's answer says DEP does not enforce W^X, just gets the OS to respect the exec permission settings in the executable when creating the initial mappings for .text / .data / .bss and stack space, and stuff like that during process startup.
I know about the system calls that OS provides to protect programs from accessing other programs memory. But that can only help if I have used the system call library provided by OS. What if I write a assembly code myself that sets CPU bit for kernel mode and executes a privileged instruction ( let's say modify OS' program segment in memory ). Can OS protect against that ?
P.S. Out of curiosity question. If any good blog or book reference can be provided, that would be helpful as I want to study OS in as much detail as possible.
The processor protects again such malicious mischief by (1) requiring you to be in an elevated mode (for our example here, KERNEL); and (2) limiting access to kernel mode.
In order to enter kernel mode from user mode there either has to be an interrupt (not applicable here) or an exception. Usually both are handled the same way but there are some bizarre processors (Did anyone say Intel?) that do things a bit differently
The operating system exception and interrupt handlers must limits what the user mode program can do.
What if I write a assembly code myself that sets CPU bit for kernel mode and executes a privileged instruction
You cant just set the kernel mode bit in the processor status register to enter kernel mode.
Can OS protect against that ?
The CPU protects against that.
If any good blog or book reference can be provided, that would be helpful as I want to study OS in as much detail as possible.
The VAX/VMS Systems Internals book is old but it is cheap and shows how a real OS has been implemented.
This blog clearly explains what my confusion was.
http://minnie.tuhs.org/CompArch/Lectures/week05.html
Even though user programs can switch to kernel mode, but they have to do it through a interrupt instruction ( int in case x86) and for this interrupt, the interrupt handler is written by the OS. ( probably when it was in kernel mode at bootup time). So this way all priviliged instructions can only be executed by the OS code only.
I have ARM board at remote location. Some time I had a kernel panic error in it. At this same time there is no option to hardware restart. bus no one is available at this place to restart it.
I want to restart my board automatically after kernel panic error. so what to do in kernel.
If your hardware contains watchdog timer, then compile the kernel with watchdog support and configure it. I suggest to follow this blog http://www.jann.cc/2013/02/02/linux_watchdog.html
Caution :: I never tried this. If the problem is solved, request you to update here.
You can modify the panic() function kernel/panic.c to call the kernel_restart(*cmd) at the point you want it to restart (like probably after printing the required debug information).
I am assuming you are bringing up a board, so Please note that you need to supply the ops for the associated functions in machine_restart() - (called by kernel_restart) in accordance to the MACH . If you are just using the board as is , then i guess rebuilding the kernel with kernel_restart(*cmd) should do.
The panic() is usually due to events that the kernel can not recover from. If you do not have a watchdog, you need to look at your hardware to see if a GPIO, etc is connected to the RESET line. If so, you can toggle this pin to reboot the CPU. Trying to alter panic() may just make things worse, depending on the root cause and the type of features you use.
You may hook arm_pm_restart with your custom restart functionality. You can test it with the shell command reboot, if present. panic() should call the same routine. With current ARM Linux versions
You may wish to turn off the MMU and block interrupts in this routine. It will make it more resilient when called from panic(). As you are going to reset, you can copy the routine to any physical address you like.
The watchdog maybe better; it may catch cases where even panic() may not be called. You may have a watchdog and not realize it. Many Cortex-A CPUs, have one built in. It is fairly rare for hardware not to have a watchdog.
However, if you don't have the watchdog, you can use the GPIO mechanism above; hardware should usually provide someway for software to restart the device (and peripherals). The panic() maybe due to some mis-behaving device tromping memory, latched up DRAM/Flash, etc. Toggling a RESET line maybe better than a watchdog in this case; if the RESET is also connected to other hardware, besides the CPU.
Related: How to debug kernel freeze, How to change watchdog timer
AFAIK, a simple way to restart the board after kernel panic is to pass a kernel parameter (from the bootloader usually)
panic=1
The board will then auto-reboot '1' second(s) after a panic.
Search the Documentation for more.
Some examples from the documentation:
...
panic= [KNL] Kernel behaviour on panic: delay <timeout>
timeout > 0: seconds before rebooting
timeout = 0: wait forever
timeout < 0: reboot immediately
Format: <timeout>
...
oops=panic Always panic on oopses. Default is to just kill the
process, but there is a small probability of
deadlocking the machine.
This will also cause panics on machine check exceptions.
Useful together with panic=30 to trigger a reboot.
...
As suggested in previous comments watchdog timer is your friend here. If your hardware contains watchdog timer, Enable it in kernel option and configure it.
Other alternative is use Phidget. If you usb connection available at remote location. Phidget controller/software is used to control your board using USB. Check for board support.
I want to be able to monitor kernel panics - know if and when they have happened.
Is there a way to know, after the machine has booted, that it went down due to a kernel panic (and not, for example, an ordered reboot or a power failure)?
The machine may be configured with KDUMP and/or KDB, but I prefer not to assume that either is or is not installed.
Patching the kernel is an option, though I prefer to avoid it. But even if I do it, I'm not sure what can the patch do.
I'm using kernel 2.6.18 (ancient, I know). Solutions for newer kernels may be interesting too.
Thanks.
The kernel module 'netconsole' may help you to log kernel printk messages over UDP.
You can view the log message in remote syslog server, event if the machine is rebooted.
Introduction:
=============
This module logs kernel printk messages over UDP allowing debugging of
problem where disk logging fails and serial consoles are impractical.
It can be used either built-in or as a module. As a built-in,
netconsole initializes immediately after NIC cards and will bring up
the specified interface as soon as possible. While this doesn't allow
capture of early kernel panics, it does capture most of the boot
process.
Check kernel document for more information: https://www.kernel.org/doc/Documentation/networking/netconsole.txt
http://www.makelinux.net/ldd3/chp-2-sect-3#chp-2-ITERM-4135 this link describes the user space and kernel space communication.
could anyone explain it with a simple user space application program in c that links & communicates(send / receives values) to the kernel object.?
The program insmod, available on most Linux machines (but requiring sudo privileges to run) instructs the kernel to load a specified module (kernel object) through the system call init_module.
More generally, user-space programs communicate with the kernel through these system calls, which are essentially requests to the kernel from user space. Any application you write in C must use system calls in some way to interact with the system (for example, printf uses the write system call under the hood to put characters on the screen).
Just open a file with open(2). The compiler will add code to the application for this call which will put the function arguments on the stack and make it crash in a certain way (see system call). The kernel catches all the crashes and handles them.
Since this is a "good" crash, the kernel will look up which function to invoke, get the arguments from the stack and invoke the function.
The reason for this complicated approach is security: By "crashing", the application completely relinquishes control. The CPU will switch to a different mode, too. In this mode, it can access the hardware (in "application" mode, any access to the hardware leads to an "illegal access" crash which terminates your app).
The open(2) function itself can't do much. Instead, it will check which file system can handle the request and invoke the open function of the file system. File systems are implemented as kernel modules.