Running code at memory location in my OS - gcc

I am developing an OS in C (and some assembly of course) and now I want to allow it to load/run external (placed in the RAM-disk) programs. I have assembled a test program as raw machine code with nasm using '-f bin'. Everything else i found on the subject is loading code while running Windows or Linux. I load the program into memory using the following code:
#define BIN_ADDR 0xFF000
int run_bin(char *file) //Too many hacks at the moment
{
u32int size = 0;
char *bin = open_file(file, &size);
printf("Loaded [%d] bytes of [%s] into [%X]\n", size, file, bin);
char *reloc = (char *)BIN_ADDR; //no malloc because of the org statement in the prog
memset(reloc, 0, size);
memcpy(reloc, bin, size);
jmp_to_bin();
}
and the code to jump to it:
[global jmp_to_bin]
jmp_to_bin:
jmp [bin_loc] ;also tried a plain jump
bin_loc dd 0xFF000
This caused a GPF when I ran it. I could give you the registers at the GPF and/or a screenshot if needed.
Code for my OS is at https://github.com/farlepet/retro-os
Any help would be greatly appreciated.

You use identity mapping and flat memory space, hence address 0xff000 is gonna be in the BIOS ROM range. No wonder you can't copy stuff there. Better change that address ;)

Related

Cortex M0 HardFault_Handler and getting the fault address

I'm having a HardFault when executing my program. I've found dozens of ways to get PC's value, but I'm using Keil uVision 5 and none of them has worked.
As far as I know I'm not in a multitasking context, and PSP contains 0xFFFFFFF1, so adding 24 to it would cause overflow.
Here's what I've managed to get working (as in, it compiles and execute):
enum { r0, r1, r2, r3, r12, lr, pc, psr};
extern "C" void HardFault_Handler()
{
uint32_t *stack;
__ASM volatile("MRS stack, MSP");
stack += 0x20;
pc = stack[pc];
psr = stack[psr];
__ASM volatile("BKPT #01");
}
Note the "+= 0x20", which is here to compensate for C function stack.
Whenever I read the PC's value, it's 0.
Would anyone have working code for that?
Otherwise, here's how I do it manually:
Put a breakpoint on HardFault_Handler (the original one)
When it breaks, look as MSP
Add 24 to its value.
Dump memory at that address.
And there it is, 0x00000000.
What am I doing wrong?
A few problems with your code
uint32_t *stack;
__ASM volatile("MRS stack, MSP");
MRS supports register destinations only. Your assembler migt be clever enough to transfer it to a temporary register first, but I'd like to see the machine code generated from that.
If you are using some kind of multitasking system, it might use PSP instead of MSP. See the linked code below on how one can distinguish that.
pc = stack[pc];
psr = stack[psr];
It uses the previous values of pc and psr as an index. Should be
pc = stack[6];
psr = stack[7];
Whenever I read the PC's value, it's 0.
Your program might actually have jumped to address 0 (e.g. through a null function pointer), tried to execute the value found there, which was probably not a valid instruction but the initial SP value from the vector table, and faulted on that. This code
void (*f)(void) = 0;
f();
does exactly that, I'm seeing 0x00000000 at offset 24.
Would anyone have working code for that?
This works for me. Note the code choosing between psp and msp, and the __attribute__((naked)) directive. You could try to find some equivalent for your compiler, to prevent the compiler from allocating a stack frame at all.

Writing to .text section of a user process from kernel space

I'm writing a kernel space component for a research project which requires me to intercept and checkpoint a user space process at different points in its execution (specific instructions.) For various reasons I cannot modify the user-space program or ptrace that process.
To accomplish this goal I'm attempting to insert an breakpoint (INT 3 instruction) in the user-space process at the point I need to checkpoint it, and then intercept the SIGTRAP in kernel space. Unfortunately, I can't seem to figure out how to properly modify the read-only text section of the user-space code from the kernel space of that process. I'm currently attempting to use the get_user_pages API to force the pages writable, and modify them, but the text data doesn't seem to change. The relevant portions of the code I'm attempting to use are below. user_addr is the user-space address to insert a breakpoint at (unsigned long); page is a struct page *.
char *addr;
unsigned long aligned_user_addr = user_addr & ~((unsigned long)PAGE_SIZE - 1);
down_read(&current->mm->mmap_sem);
rc = get_user_pages(current, current->mm, aligned_user_addr,
1, 1, 1, &page, &vma);
up_read(&current->mm->mmap_sem);
BUG_ON(rc != 1);
addr = kmap(page);
BUG_ON(!addr);
offs = user_addr % PAGE_SIZE;
/* NOTE: INT3_INSTR is defined to be 0xCC */
addr[offs] = INT3_INSTR;
BUG_ON(addr[offs] != INT3_INSTR); // Assertion fails
set_page_dirty(page);
kunmap(page);
page_cache_release(page);
I'm hoping someone with more kernel knowledge and experience will be able to tell me what I'm doing wrong, or the proper way to go about accomplishing my task.
Thank you for your help.
It turns out that my issue was actually with C sign extension. INT3_INSTR was defined as:
#define INT3_INSTR 0xCC
Which makes it an integer, and the line:
BUG_ON(addr[offs] != INT3_INSTR);
evaluated addr[offs] to be a signed char. In c when a signed char is compared to an int its type is elevated to that of int, and since its signed it will be signed extended if its MSB is 1. As 0xCC's MSB is always 1 the comparison always evaluated to:
BUG_ON(0xFFFFFFCC != 0xCC);
Which evaluated as false. Changing addr to a unsigned char * resolves the issue. and then the above code works.

To understand the concept of the loader in LINUX by simple example?

As I understand, the core of a boot loader is a loader program. By loader, I mean the program that will load another program. Or to be more specific first it will load itself then the high level image - for example kernel. Instead of making a bootloader, I thought to clear my doubts on loader by running on an OS that will load another program. I do understand that every process map is entirely independent to another. So, what I am trying to do is make a simple program hello_world.c this will print the great "hello world". Now, I want to make a loader program that will load this program hello world. As I understand the crux is in two steps
Load the hello world program on the RAM - loader address.
JMP to the Entry Address.
Since, this is to understand the concept, I am using the readymade utility readelf to read the address of the hello world binary. The intention here is not to make a ELF parser.
As all the process are independent and use virtual memory. This will fail, If I use the virtual memory addresses. Now, I am stuck over here, how can I achieve this?
#include "stdio.h"
#include <sys/mman.h>
int main( int argc, char **argv)
{
char *mem_ptr;
FILE *fp;
char *val;
char *exec;
mem_ptr = (char*) malloc(10*1024);
fp = fopen("./hello_world.out","rb");
fread(mem_ptr, 10240, 1, fp);
//val = mem_ptr + 0x8048300;
printf("The mem_ptr is %p\r\n",mem_ptr);
exec = mmap(NULL, 10240, PROT_READ | PROT_WRITE | PROT_EXEC,
MAP_PRIVATE | MAP_ANONYMOUS, 0x9c65008, 0);
memcpy(mem_ptr,exec,10240);
__asm__("jmp 0x9c65008");
fclose(fp);
return 0;
}
my rep is not enough to let me add comments.
As Chris Stratton said, your problem sounds ambiguous(still after editing!). Do you want to
Write a bootloader, that will load "Hello, World" instead of real OS? <--Actual Problem is saying this OR
Write a program, that will be running on OS(so full fledged OS will be there), and load another executable using this program?<--Comments are saying this
Answers will vary a lot depending on this.
In first case, bootloader is present on BIOS, that will fetch some predefined memory block to RAM. So what u need to do is just place your Hello, World at this place. There are many things regarding this, such as chain loading and all, but not sure if this is what you want achieve. If this is NOT something you wanted, why is bootstrap tag used?
In second case, fork() + exec() will do it for you. But be sure that this way, there will be two different address spaces. If you want them in the same address space, I am doubtful about daily used OS(for normal guys). Most of the your part sounds like this is what you want to do.
If you want to ask something different than this, please edit almost entire question and ask ONLY that part.(Avoid telling why you are trying to do something, what you think you already understand etc)

How can I force MacOS to release MADV_FREE'd pages?

My program has a custom allocator which gets memory from the OS using mmap(MAP_ANON | MAP_PRIVATE). When it no longer needs memory, the allocator calls either munmap or madvise(MADV_FREE). MADV_FREE keeps the mapping around, but tells the OS that it can throw away the physical pages associated with the mapping.
Calling MADV_FREE on pages you're going to need again eventually is much faster than calling munmap and later calling mmap again.
This almost works perfectly for me. The only problem is that, on MacOS, MADV_FREE is very lazy about getting rid of the pages I've asked it to free. In fact, it only gets rid of them when there's memory pressure from another application. Until it gets rid of the pages I've freed, MacOS reports that my program is still using that memory; in the Activity Monitor, its "Real Memory" column doesn't reflect the freed memory.
This makes it difficult for me to measure how much memory my program is actually using. (This difficulty in measuring RSS is keeping us from landing the custom allocator on 10.5.)
I could allocate a whole bunch of memory to force the OS to free up these pages, but in addition to taking a long time, that could have other side-effects, such as causing parts of my program to be paged out to disk.
On a lark, I tried the purge command, but that has no effect.
How can I force MacOS to clean out these MADV_FREE'd pages? Or, how can I ask MacOS how many MADV_FREE'd pages my process has in memory?
Here's a test program, if it helps. The Activity Monitor's "Real Memory" column shows 512MB after the program goes to sleep. On my Linux box, top shows 256MB of RSS, as desired.
#include <sys/mman.h>
#include <stdio.h>
#include <unistd.h>
#define SIZE (512 * 1024 * 1024)
// We use MADV_FREE on Mac and MADV_DONTNEED on Linux.
#ifndef MADV_FREE
#define MADV_FREE MADV_DONTNEED
#endif
int main()
{
char *x = mmap(0, SIZE, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, -1, 0);
// Touch each page we mmap'ed so it gets a physical page.
int i;
for (i = 0; i < SIZE; i += 1024) {
x[i] = i;
}
madvise(x, SIZE / 2, MADV_FREE);
fprintf(stderr, "Sleeping. Now check my RSS. Hopefully it's %dMB.\n", SIZE / (2 * 1024 * 1024));
sleep(1024);
return 0;
}
mprotect(addr, length, PROT_NONE);
mprotect(addr, length, PROT_READ | PROT_WRITE);
Note as you say, madvise is lazier, and that is probably better for performance (just in case anyone is tempted to use this for performance rather than measurement).
Use MADV_FREE_REUSABLE on macOS. According to Apple's magazine_malloc implementation:
On OS X we use MADV_FREE_REUSABLE, which signals the kernel to remove the given pages from the memory statistics for our process. However, on returning that memory to use we have to signal that it has been reused.
https://opensource.apple.com/source/libmalloc/libmalloc-53.1.1/src/magazine_malloc.c.auto.html
Chromium, for example, also uses it:
MADV_FREE_REUSABLE is similar to MADV_FREE, but also marks the pages with the reusable bit, which allows both Activity Monitor and memory-infra to correctly track the pages.
https://github.com/chromium/chromium/blob/master/base/memory/discardable_shared_memory.cc#L377
I've looked and looked, and I don't think this is possible. :\
We're solving the problem by adding code to the allocator which explicitly decommits MADV_FREE'd pages when we ask it to.

How to align stack at 32 byte boundary in GCC?

I'm using MinGW64 build based on GCC 4.6.1 for Windows 64bit target. I'm playing around with the new Intel's AVX instructions. My command line arguments are -march=corei7-avx -mtune=corei7-avx -mavx.
But I started running into segmentation fault errors when allocating local variables on the stack. GCC uses the aligned moves VMOVAPS and VMOVAPD to move __m256 and __m256d around, and these instructions require 32-byte alignment. However, the stack for Windows 64bit has only 16 byte alignment.
How can I change the GCC's stack alignment to 32 bytes?
I have tried using -mstackrealign but to no avail, since that aligns only to 16 bytes. I couldn't make __attribute__((force_align_arg_pointer)) work either, it aligns to 16 bytes anyway. I haven't been able to find any other compiler options that would address this. Any help is greatly appreciated.
EDIT:
I tried using -mpreferred-stack-boundary=5, but GCC says that 5 is not supported for this target. I'm out of ideas.
I have been exploring the issue, filed a GCC bug report, and found out that this is a MinGW64 related problem. See GCC Bug#49001. Apparently, GCC doesn't support 32-byte stack alignment on Windows. This effectively prevents the use of 256-bit AVX instructions.
I investigated a couple ways how to deal with this issue. The simplest and bluntest solution is to replace of aligned memory accesses VMOVAPS/PD/DQA by unaligned alternatives VMOVUPS etc. So I learned Python last night (very nice tool, by the way) and pulled off the following script that does the job with an input assembler file produced by GCC:
import re
import fileinput
import sys
# fix aligned stack access
# replace aligned vmov* by unaligned vmov* with 32-byte aligned operands
# see Intel's AVX programming guide, page 39
vmova = re.compile(r"\s*?vmov(\w+).*?((\(%r.*?%ymm)|(%ymm.*?\(%r))")
aligndict = {"aps" : "ups", "apd" : "upd", "dqa" : "dqu"};
for line in fileinput.FileInput(sys.argv[1:],inplace=1):
m = vmova.match(line)
if m and m.group(1) in aligndict:
s = m.group(1)
print line.replace("vmov"+s, "vmov"+aligndict[s]),
else:
print line,
This approach is pretty safe and foolproof. Though I observed a performance penalty on rare occasions. When the stack is unaligned, the memory access crosses the cache line boundary. Fortunately, the code performs as fast as aligned accesses most of the time. My recommendation: inline functions in critical loops!
I also attempted to fix the stack allocation in every function prolog using another Python script, trying to align it always at the 32-byte boundary. This seems to work for some code, but not for other. I have to rely on the good will of GCC that it will allocate aligned local variables (with respect to the stack pointer), which it usually does. This is not always the case, especially when there is a serious register spilling due to the necessity to save all ymm register before a function call. (All ymm registers are callee-save). I can post the script if there's an interest.
The best solution would be to fix GCC MinGW64 build. Unfortunately, I have no knowledge of its internal workings, just started using it last week.
You can get the effect you want by
Declaring your variables not as variables, but as fields in a struct
Declaring an array that is larger than the structure by an appropriate amount of padding
Doing pointer/address arithmetic to find a 32 byte aligned address in side the array
Casting that address to a pointer to your struct
Finally using the data members of your struct
You can use the same technique when malloc() does not align stuff on the heap appropriately.
E.g.
void foo() {
struct I_wish_these_were_32B_aligned {
vec32B foo;
char bar[32];
}; // not - no variable definition, just the struct declaration.
unsigned char a[sizeof(I_wish_these_were_32B_aligned) + 32)];
unsigned char* a_aligned_to_32B = align_to_32B(a);
I_wish_these_were_32B_aligned* s = (I_wish_these_were_32B_aligned)a_aligned_to_32B;
s->foo = ...
}
where
unsigned char* align_to_32B(unsiged char* a) {
uint64_t u = (unit64_t)a;
mask_aligned32B = (1 << 5) - 1;
if (u & mask_aligned32B == 0) return (unsigned char*)u;
return (unsigned char*)((u|mask_aligned_32B) + 1);
}
I just ran in the same issue of having segmentation faults when using AVX inside my functions. And it was also due to the stack misalignment. Given the fact that this is a compiler issue (and the options that could help are not available in Windows), I worked around the stack usage by:
Using static variables (see this issue). Given the fact that they are not stored in the stack, you can force their alignment by using __attribute__((align(32))) in your declaration. For example: static __m256i r __attribute__((aligned(32))).
Inlining the functions/methods receiving/returning AVX data. You can force GCC to inline your function/method by adding inline and __attribute__((always_inline)) to your function prototype/declaration. Inlining your functions increase the size of your program, but they also prevent the function from using the stack (and hence, avoids the stack-alignment issue). Example: inline __m256i myAvxFunction(void) __attribute__((always_inline));.
Be aware that the usage of static variables is no thread-safe, as mentioned in the reference. If you are writing a multi-threaded application you may have to add some protection for your critical paths.

Resources