I'm implementing a simple device driver. The program that uses this driver takes in arguments from the user whether to use demand paging or prefetching(fetches next page only). But when the user requests for prefetching is should send this information to the driver. The problem is vm_fault has a standard structure as follows:
int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
So how to incorporate this additional information of prefetching into these, so that I can use it to write a different routine for prefetching?
Or is there any other way to achieve this?
[EDIT]
To give a clearer picture:
This how a program takes input.
./user_prog [filename] --prefetch
The user_prog sets some flags in it, now how to send these flags information to dev.c(the driver file), as all the arguments to functions are fixed like above fault(). I hope this gives more clarification.
You can use the flags in mmap() to pass your custom flags too.
void *mmap(void *addr, size_t length, int prot, int flags,
int fd, off_t offset);
Make sure your custom flag values uses bits different from the flag values used by mmap(). From the manpage, the macros are defined in sys/mman.h. Find the exact values(may vary across systems) with echo '#include <sys/mman.h>' | gcc -E - -dM | grep MAP_*. My system has this:
#define MAP_32BIT 0x40
#define MAP_TYPE 0x0f
#define MAP_EXECUTABLE 0x01000
#define MAP_FAILED ((void *) -1)
#define MAP_PRIVATE 0x02
#define MAP_ANON MAP_ANONYMOUS
#define MAP_LOCKED 0x02000
#define MAP_STACK 0x20000
#define MAP_NORESERVE 0x04000
#define MAP_HUGE_SHIFT 26
#define MAP_POPULATE 0x08000
#define MAP_DENYWRITE 0x00800
#define MAP_FILE 0
#define MAP_SHARED 0x01
#define MAP_GROWSDOWN 0x00100
#define MAP_HUGE_MASK 0x3f
#define MAP_HUGETLB 0x40000
#define MAP_FIXED 0x10
#define MAP_ANONYMOUS 0x20
#define MAP_NONBLOCK 0x10000
Some non-clashing flags would be 0x200 and 0x400.
I have a situation with catching the performance monitoring interrupt (PMI - especially instruction counter) on qemu-kvm. The code below works fine on real machine (Intel Core TM i5-4300U) but on qemu-kvm (qemu-system-x86_64 -cpu host), I do not see even one PMI. Though the counter works normally. I can check it increments well.
However, I have tested with Linux kernel, and it catches the overflow interrupt very well on the same qemu-kvm. So there is obviously a step I am missing when it comes to configure the performance monitoring counter on Qemu-kvm.
Can someone point it out to me?
Here is the pseudo-code:
#define LAPIC_SVR 0xF0
#define LAPIC_LVT_PERFM 0x340,
#define CPU_LOCAL_APIC 0xFFFFFFFFBFFFE000
#define NMI_DELIVERY_MODE 0x4 << 8 //NMI
#define MSR_PERF_GLOBAL_CTRL 0x38F
#define MSR_PERF_FIXED_CTRL 0x38D
#define MSR_PERF_FIXED_CTR0 0x309
#define MSR_PERF_GLOBAL_OVF_CTRL 0x390
/*Configure LAPIC*/
apic_base = Msr::read<Paddr>(Msr::IA32_APIC_BASE)
map(CPU_LOCAL_APIC, apic_base & 0xFFFFF000) // No caching, etc.
Msr::write (Msr::IA32_APIC_BASE, apic_base | 0x800);
write (LAPIC_SVR, read (LAPIC_SVR) | 0x100);
*reinterpret_cast<uint32 volatile *>(CPU_LOCAL_APIC + LAPIC_LVT_PERFM) = NMI_DELIVERY_MODE;
/*Configure MSR_PERF_FIXED_CTR0 to have overflow interrupt*/
Msr::write(Msr::MSR_PERF_GLOBAL_CTRL, Msr::read<uint64>(Msr::MSR_PERF_GLOBAL_CTRL) | (1ull<<32)); // enable IA32_PERF_FIXED_CTR0
Msr::write(Msr::MSR_PERF_FIXED_CTRL, 0xa); // configure IA32_PERF_FIXED_CTR0 to count in user mode and interrupt on overflow
Msr::write(Msr::MSR_PERF_FIXED_CTR0, (1<<48) - 0x1000); // overflow after 0x1000 instruction
Msr::write(Msr::MSR_PERF_GLOBAL_OVF_CTRL, Msr::read<uint64>(Msr::MSR_PERF_GLOBAL_OVF_CTRL) & ~(1UL<<32)); // clear overflow condition
I'm compiling a 32 bit binary but want to embed some 64 bit assembly in it.
void method() {
asm("...64 bit assembly...");
}
Of course when I compile I get errors about referring to bad registers because the registers are 64 bit.
evil.c:92: Error: bad register name `%rax'
Is it possible to add some annotations so gcc will process the asm sections using the 64bit assembler instead. I have a workaround which is compile separately, map in a page with PROT_EXEC|PROT_WRITE and copy in my code but this is very awkward.
No, this isn't possible. You can't run 64-bit assembly from a 32-bit binary, as the processor will not be in long mode while running your program.
Copying 64-bit code to an executable page will result in that code being interpreted incorrectly as 32-bit code, which will have unpredictable and undesirable results.
Don't try to put 64-bit machine-code inside a compiler-generated function. It might work since the encoding for function prologue/epilogue is the same in 32 and 64-bit, but it would be cleaner to just have a separate block of 64-bit code.
The easiest thing is probably to assemble that block in a separate file, using GAS .code64 or NASM BITS 64 to get 64-bit code in an object file you can link into a 32-bit executable.
You said in a comment you're thinking of using this for a kernel exploit against a 64-bit kernel from a 32-bit user-space process, so you just need some code bytes in an executable part of your process's memory and a way to get a pointer to that block. This is certainly plausible; if you can gain control of the kernel's RIP from a 32-bit process, this is what you want, because kernel code will always be running in long mode.
If you were doing something with 64-bit userspace code in a process that started in 32-bit mode, you could maybe far jmp to the block of 64-bit code (as #RossRidge suggests), using a known value for the kernel's __USER_CS 64-bit code segment descriptor. syscall from 64-bit code should return in 64-bit mode, but if not, try the int 0x80 ABI. It always returns to the mode you were in, saving/restoring cs and ss along with rip and rflags. (What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?)
.rodata is part of the test segment of your executable, so just get the compiler to put bytes in a const array. Fun fact: const int main = 195; compiles to a program that exits without segfaulting, because 195 = 0xc3 = the x86 encoding for ret (and x86 is little-endian). For an arbitrary-length machine-code sequence, const char funcname[] = { 0x90, 0x90, ..., 0xc3 } will work. The const is necessary, otherwise it will go in .data (read/write/noexec) instead of .rodata.
You could use const char funcname[] __attribute__((section(".text"))) = { ... }; to control what section it goes in (e.g. .text along with compiler-generated functions), or even a linker script to get more control.
If you really want to do it all in one .c file, instead of using the easier solution of a separately-assembled pure asm source:
To assemble some 64-bit code along with compiler-generated 32-bit code, use the .code64 GAS directive in an asm statement *outside of any functions. IDK if there's any guarantee on what section will be active when gcc emits your asm how gcc will mix that asm with its asm, but it won't put it in the middle of a function.
asm(".pushsection .text \n\t" // AFAIK, there's no guarantee how this will mix with compiler asm output
".code64 \n\t"
".p2align 4 \n\t"
".globl my_codebytes \n\t" // optional
"my_codebytes: \n\t"
"inc %r10d \n\t"
"my_codebytes_end: \n\t"
//"my_codebytes_len: .long . - my_codebytes\n\t" // store the length in memory. Optional
".popsection \n\t"
#ifdef __i386
".code32" // back to 32-bit interpretation for gcc's code
// "\n\t inc %r10" // uncomment to check that it *doesn't* assemble
#endif
);
#ifdef __cplusplus
extern "C" {
#endif
// put C names on the labels.
// They are *not* pointers, their addresses are link-time constants
extern char my_codebytes[], my_codebytes_end[];
//extern const unsigned my_codebytes_len;
#ifdef __cplusplus
}
#endif
// This expression for the length isn't a compile-time constant, so this isn't legal C
//static const unsigned len = &my_codebytes_end - &my_codebytes;
#include <stddef.h>
#include <unistd.h>
int main(void) {
size_t len = my_codebytes_end - my_codebytes;
const char* bytes = my_codebytes;
// do whatever you want. Writing it to stdout is one option!
write(1, bytes, len);
}
This compiles and assembles with gcc and clang (compiler explorer).
I tried it on my desktop to double check:
peter#volta$ gcc -m32 -Wall -O3 /tmp/foo.c
peter#volta$ ./a.out | hd
00000000 41 ff c2 |A..|
00000003
This is the correct encoding for inc %r10d :)
The program also works when compiled without -m32, because I used #ifdef to decide whether to use .code32 at the end or not. (There's no push/pop mode directive like there is for sections.)
Of course, disassembling the binary will show you:
00000580 <my_codebytes>:
580: 41 inc ecx
581: ff c2 inc edx
because the disassembler doesn't know to switch to 64-bit disassembly for that block. (I wonder if ELF has attributes for that... I didn't use any assembler directives or linker scripts to generate such attributes, if such a thing exists.)
Switching between long mode and compatibility mode is done by changing CS. User mode code cannot modify the descriptor table, but it can perform a far jump or far call to a code segment that is already present in the descriptor table. In Linux the required descriptor is present (in my experience; this may not be true for all installations).
Here is sample code for 64-bit Linux (Ubuntu) that starts in 32-bit mode, switches to 64-bit mode, runs a function, and then switches back to 32-bit mode. Build with gcc -m32.
#include <stdlib.h>
#include <stdio.h>
#include <stdbool.h>
extern bool switch_cs(int cs, bool (*f)());
extern bool check_mode();
int main(int argc, char **argv)
{
int cs = 0x33;
if (argc > 1)
cs = strtoull(argv[1], 0, 16);
printf("switch to CS=%02x\n", cs);
bool r = switch_cs(cs, check_mode);
if (r)
printf("cs=%02x: 64-bit mode\n", cs);
else
printf("cs=%02x: 32-bit mode\n", cs);
return 0;
}
.intel_syntax noprefix
.text
.code32
.globl switch_cs
switch_cs:
mov eax, [esp+4]
mov edx, [esp+8]
push 0
push edx
push eax
push offset .L2
lea eax, [esp+8]
lcall [esp]
add esp, 16
ret
.L2:
call [eax]
lret
.code64
.globl check_mode
check_mode:
xor eax, eax
// In 32-bit mode, this instruction is executed as
// inc eax; test eax, eax
test rax, rax
setz al
ret
Microsoft's documentation for PE/COFF says of the type field in the symbol table:
"The most significant byte specifies whether the symbol is a pointer to, function returning, or array of the base type that is specified in the LSB. Microsoft tools use this field only to indicate whether the symbol is a function, so that the only two resulting values are 0x0 and 0x20 for the Type field."
However, the documentation and winnt.h both specify that IMAGE_SYM_DTYPE_FUNCTION = 2, not 0x20. Even if this is taken to be the value of the MSB, that would give a value for the entire field of 0x200, not 0x20.
What am I missing?
Check winnt.h for following lines:
// type packing constants
#define N_BTMASK 0x000F
#define N_TMASK 0x0030
#define N_TMASK1 0x00C0
#define N_TMASK2 0x00F0
#define N_BTSHFT 4
#define N_TSHIFT 2
// MACROS
// Basic Type of x
#define BTYPE(x) ((x) & N_BTMASK)
// Is x a pointer?
#ifndef ISPTR
#define ISPTR(x) (((x) & N_TMASK) == (IMAGE_SYM_DTYPE_POINTER << N_BTSHFT))
#endif
// Is x a function?
#ifndef ISFCN
#define ISFCN(x) (((x) & N_TMASK) == (IMAGE_SYM_DTYPE_FUNCTION << N_BTSHFT))
#endif
So it seems official MSB, LSB description is wrong - they are not bytes but nibbles. So 0x20 would be a function (MS nibble = 2) returning base type of IMAGE_SYM_TYPE_NULL (LS nibble = 0) .
Linux 3.4.6 defines the following macros in arch/x86/include/asm/segment.h. Can anybody explain why the __USER macros add 3 to the defined constant and why this is not done for __KERNEL macros?
#define __KERNEL_CS (GDT_ENTRY_KERNEL_CS*8)
#define __KERNEL_DS (GDT_ENTRY_KERNEL_DS*8)
#define __USER_DS (GDT_ENTRY_DEFAULT_USER_DS*8+3)
#define __USER_CS (GDT_ENTRY_DEFAULT_USER_CS*8+3)
These four symbols represent segment descriptors. The two least-significant bits of these descriptors contain the privilege level associated with them, and the third least-significant bit contains the descriptor table type (GDT or LDT). This is made clearer by code occurring a little later:
/* User mode is privilege level 3 */
#define USER_RPL 0x3
/* LDT segment has TI set, GDT has it cleared */
#define SEGMENT_LDT 0x4
#define SEGMENT_GDT 0x0
/* Bottom two bits of selector give the ring privilege level */
#define SEGMENT_RPL_MASK 0x3
/* Bit 2 is table indicator (LDT/GDT) */
#define SEGMENT_TI_MASK 0x4
To achieve this, the descriptor table entry is multiplied by 8, which shifts it three bits to the left, and then ORed with the table type and privilege level (using addition):
/* GDT, ring 0 (kernel mode) */
#define __KERNEL_CS (GDT_ENTRY_KERNEL_CS*8)
/* GDT, ring 3 (user mode) */
#define __USER_CS (GDT_ENTRY_DEFAULT_USER_CS*8+3)