Shared memory changes aren't visible in different process - windows

I have a problem. As far as I know, processes in Windows share dynamic-linked libraries among each other, allowing only one instance of every library to exist at once in the memory. Knowing that, I wrote a small program in C, which can change some data in this shared section. In my example, I chose to change the beginning of MessageBoxW function. This is the code:
#include <Windows.h>
#include <stdio.h>
#include <tchar.h>
#define SIZE 12 // size of JMP byte array defined below
int WINAPI CustomMessageBoxW(HWND, LPCWSTR, LPCWSTR, UINT);
void BeginRedirect(LPVOID);
// JMP bytes translated to assembly (x64):
// mov rax, 0x1234567890ABCDEF - this value will be changed to newFunction address, in BeginRedirect procedure
// jmp rax - jump to newFunction
BYTE JMP[SIZE] = { 0x48, 0xB8, 0xEF, 0xCD, 0xAB, 0x90, 0x78, 0x56, 0x34, 0x12, 0xFF, 0xE0 };
DWORD oldProtect, myProtect = PAGE_EXECUTE_READWRITE;
int main()
{
printf("MessageBoxW address: %p\n", MessageBoxW);
printf("redirect? ");
char res[4];
scanf_s("%4s", res, _countof(res));
if (strcmp(res, "yes") == 0) // redirect
{
printf("redirecting...\n");
BeginRedirect(MessageBoxW, CustomMessageBoxW);
}
while (1)
{
MessageBoxW(NULL, L"This is original MessageBoxW", L"Caption", MB_OKCANCEL);
printf("MessageBoxW address: %p\nBytes:\n", MessageBoxW);
for (int i = 0; i < 20; i++)
{
printf("0x%hhX ", *((char*)MessageBoxW + i));
}
puts("\n");
SleepEx(1000, FALSE);
}
}
void BeginRedirect(LPVOID oldFunction, LPVOID newFunction)
{
BYTE tempJMP[SIZE];
memcpy(tempJMP, JMP, SIZE);
BOOL result = VirtualProtect(oldFunction, SIZE, PAGE_EXECUTE_READWRITE, &oldProtect);
printf("\tVirtualProtect result: %u\n", result);
memcpy(tempJMP + 2, &newFunction, 8); // change the basic 0x1234... address to the actual function address
memcpy(oldFunction, tempJMP, SIZE);
result = VirtualProtect(oldFunction, SIZE, oldProtect, &myProtect);
printf("\tVirtualProtect result: %u\n", result);
}
int WINAPI CustomMessageBoxW(HWND hWnd, LPCWSTR lpText, LPCWSTR lpCaption, UINT uiType)
{
printf("MyMessageBoxW: Custom message\n");
}
The program allows to choose if I want to redirect the function in the current instance of the program. Here comes the interesting part.
I run a first instance. When asked if to redirect, I type "yes", so the program does. It changes the beginning code of MessageBoxW, so that it points to my CustomMessageBoxW. Then, in the while loop, the program executes MessageBoxW every second and outputs some debugging information (first 20 bytes of the function). In this instance, the redirection works properly and instead of popup, the program outputs "MyMessageBoxW: Custom message" every second (as expected in CustomMessageBoxW)
Then, I run the second instance of the program (the first instance still executes!). Now, I decide not to redirect the function (type anything apart from "yes"). From the information printed by both instances about their MessageBoxW addresses, I can see that they're clearly identical. At that point, I thought that if the addresses are the same (both instances share one instance of user32.dll which contains MessageBoxW), then the second instance which didn't modify the MessageBoxW function itself will still attempt to execute the CustomMessageBoxW, which will probably result in memory access violation. But no. It turns out that the second instance works just fine and pops up a standard Windows message box, while the first instance (which still runs) still executes the redirected function (remember that in both program instances, the addresses of MessageBoxW are the same). Apart from that, the bytes outputed by
printf("MessageBoxW address: %p\nBytes:\n", MessageBoxW);
for (int i = 0; i < 20; i++)
{
printf("0x%hhX ", *((char*)MessageBoxW + i));
}
are completely different in both instances, while the function address is still the same.
I even decided to debug both instances at the same time using WinDbg, and it also showed that both instances stored different values under the same address. I'd really appreciate it if someone figured out what is actually going round here. Thanks!

Related

JTAG debugging adding pointer variable to Expressions

I am learning about debugging. I use JTAG to debug my ESP32-S3 microcontroller and created a very simple program:
uint8_t counter = 0;
void remote_task(void* param){
uint8_t * ptr;
ptr = &counter;
for (;;)
{
counter ++;
printf("hello from remote task \n");
printf("ptr value = %u \n",*ptr);
vTaskDelay(1000/portTICK_PERIOD_MS);
}
}
I have a created a variable counter that is incrementing every 1 second. Then inside my remote task I created a pointer that points to the counter. I have setup expressions to monitor both (counter and ptr) but there seems to be an issue with pointers and Expressions.
Could someone help me understand why I cannot see the value and the address that ptr points to?

mmap() RWX page on MacOS (ARM64 architecture)?

I've been trying to map a page that both writable AND executable.
mov x0, 0 // start address
mov x1, 4096 // length
mov x2, 7 // rwx
mov x3, 0x1001 // flags
mov x4, -1 // file descriptor
mov x5, 0 // offset
movl x16, 0x200005c // mmap
svc 0
This gives me a 0xD error code (EACCESS, which the documentation unhelpfully blames on an invalid file descriptor, although same documentation says to use '-1'). I think the code is correct, it returns a valid mmap if I just pass 'r--' for permissions.
I know the same code works in Catalina and x64 architecture. I tested the same error happens when SIP mode is disabled.
For more context, I'm trying to port a FORTH implementation to MacOs/ARM64, and this FORTH, like many others, heavily uses self modifying code/assembling code at runtime. And the code that is doing the assembling/compiling resides in the middle of the newly created code (in fact part the compiler will be generated in machine language as part of running FORTH), so it's very hard/infeasible to separate the FORTH JIT compiler (if you call it that) from the generated code.
Now, I'd really don't want to end up with the answer: "Apple thinks they know better than you, no FORTH for you!", but that is what it looks like so far. Thanks for any help!
You need to toggle the thread between being writable or executable, it can not be both at the same time. I think it is actually possible to do both with the same memory using 2 different threads but I haven't tried.
Before you write to the memory you mmap, call this:
pthread_jit_write_protect_np(0);
sys_icache_invalidate(addr, size);
Then when you are done writing to it you can switch back again like this:
pthread_jit_write_protect_np(1);
sys_icache_invalidate(addr, size);
This is the full code I am using right now
#include <stdio.h>
#include <sys/mman.h>
#include <pthread.h>
#include <libkern/OSCacheControl.h>
#include <stdlib.h>
#include <stdint.h>
uint32_t* c_get_memory(uint32_t size) {
int prot = PROT_READ | PROT_WRITE | PROT_EXEC;
int flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_JIT;
int fd = -1;
int offset = 0;
uint32_t* addr = 0;
addr = (uint32_t*)mmap(0, size, prot, flags, fd, offset);
if (addr == MAP_FAILED){
printf("failure detected\n");
exit(-1);
}
pthread_jit_write_protect_np(0);
sys_icache_invalidate(addr, size);
return addr;
}
void c_jit(uint32_t* addr, uint32_t size) {
pthread_jit_write_protect_np(1);
sys_icache_invalidate(addr, size);
void (*foo)(void) = (void (*)())addr;
foo();
}

In windows, Why the handle value is in multiple of 4?

If I'm not wrong, A handle is an index inside a table maintained on per process basis.
For 64bit Windows, Each entry in this table is made up of 8 byte address to the kernel object + 4 byte of access mask making the entry 12 byte long. However as I understood, for alignment purpose each entry made 16 byte long.
But when you you look at handle opened by a process using process explorer, Value of handle are in multiple of 4. Shouldn't this be in multiple of 16 instead?
A Windows handle is just an index per se, it could be a multiple of 1 in principle. It has been probably more efficient to implent a word (16 bit value) alignment than the byte alignment you're implying.
The lowest two bits of a kernel handle are called "tag bits" and are available for application use. This has nothing to do with the size of an entry in the handle table.
The comment in ntdef.h (in Include\10.0.x.x\shared) says:
//
// Low order two bits of a handle are ignored by the system and available
// for use by application code as tag bits. The remaining bits are opaque
// and used to store a serial number and table index.
//
#define OBJ_HANDLE_TAGBITS 0x00000003L
My guess is that it's a similar misuse like using the most significant bit of 32 bit pointers as a boolean flag, which is why we have LAA (large address aware) and non-LAA applications.
You could (but should not) add 1, 2 or 3 to a HANDLE and it should not affect other Windows API methods. E.g. WaitForSingleObject():
#include <iostream>
#include <windows.h>
int main()
{
STARTUPINFO si;
PROCESS_INFORMATION pi;
ZeroMemory(&si, sizeof(si));
si.cb = sizeof(si);
ZeroMemory(&pi, sizeof(pi));
auto created = CreateProcess(L"C:\\Windows\\System32\\cmd.exe",
nullptr, nullptr, nullptr, FALSE, 0, nullptr, nullptr, &si, &pi
);
if (created)
{
pi.hProcess = static_cast<byte*>(pi.hProcess) + 3;
const auto result = WaitForSingleObject(pi.hProcess, INFINITE);
if (result == 0)
std::cout << "Completed!\n";
else
std::cout << "Failed!\n" << result << "\n";
CloseHandle(pi.hProcess);
CloseHandle(pi.hThread);
}
else
std::cout << "Not created";
}

Trap memory accesses inside a standard executable built with MinGW

So my problem sounds like this.
I have some platform dependent code (embedded system) which writes to some MMIO locations that are hardcoded at specific addresses.
I compile this code with some management code inside a standard executable (mainly for testing) but also for simulation (because it takes longer to find basic bugs inside the actual HW platform).
To alleviate the hardcoded pointers, i just redefine them to some variables inside the memory pool. And this works really well.
The problem is that there is specific hardware behavior on some of the MMIO locations (w1c for example) which makes "correct" testing hard to impossible.
These are the solutions i thought of:
1 - Somehow redefine the accesses to those registers and try to insert some immediate function to simulate the dynamic behavior. This is not really usable since there are various ways to write to the MMIO locations (pointers and stuff).
2 - Somehow leave the addresses hardcoded and trap the illegal access through a seg fault, find the location that triggered, extract exactly where the access was made, handle and return. I am not really sure how this would work (and even if it's possible).
3 - Use some sort of emulation. This will surely work, but it will void the whole purpose of running fast and native on a standard computer.
4 - Virtualization ?? Probably will take a lot of time to implement. Not really sure if the gain is justifiable.
Does anyone have any idea if this can be accomplished without going too deep? Maybe is there a way to manipulate the compiler in some way to define a memory area for which every access will generate a callback. Not really an expert in x86/gcc stuff.
Edit: It seems that it's not really possible to do this in a platform independent way, and since it will be only windows, i will use the available API (which seems to work as expected). Found this Q here:
Is set single step trap available on win 7?
I will put the whole "simulated" register file inside a number of pages, guard them, and trigger a callback from which i will extract all the necessary info, do my stuff then continue execution.
Thanks all for responding.
I think #2 is the best approach. I routinely use approach #4, but I use it to test code that is running in the kernel, so I need a layer below the kernel to trap and emulate the accesses. Since you have already put your code into a user-mode application, #2 should be simpler.
The answers to this question may provide help in implementing #2. How to write a signal handler to catch SIGSEGV?
What you really want to do, though, is to emulate the memory access and then have the segv handler return to the instruction after the access. This sample code works on Linux. I'm not sure if the behavior it is taking advantage of is undefined, though.
#include <stdint.h>
#include <stdio.h>
#include <signal.h>
#define REG_ADDR ((volatile uint32_t *)0x12340000f000ULL)
static uint32_t read_reg(volatile uint32_t *reg_addr)
{
uint32_t r;
asm("mov (%1), %0" : "=a"(r) : "r"(reg_addr));
return r;
}
static void segv_handler(int, siginfo_t *, void *);
int main()
{
struct sigaction action = { 0, };
action.sa_sigaction = segv_handler;
action.sa_flags = SA_SIGINFO;
sigaction(SIGSEGV, &action, NULL);
// force sigsegv
uint32_t a = read_reg(REG_ADDR);
printf("after segv, a = %d\n", a);
return 0;
}
static void segv_handler(int, siginfo_t *info, void *ucontext_arg)
{
ucontext_t *ucontext = static_cast<ucontext_t *>(ucontext_arg);
ucontext->uc_mcontext.gregs[REG_RAX] = 1234;
ucontext->uc_mcontext.gregs[REG_RIP] += 2;
}
The code to read the register is written in assembly to ensure that both the destination register and the length of the instruction are known.
This is how the Windows version of prl's answer could look like:
#include <stdint.h>
#include <stdio.h>
#include <windows.h>
#define REG_ADDR ((volatile uint32_t *)0x12340000f000ULL)
static uint32_t read_reg(volatile uint32_t *reg_addr)
{
uint32_t r;
asm("mov (%1), %0" : "=a"(r) : "r"(reg_addr));
return r;
}
static LONG WINAPI segv_handler(EXCEPTION_POINTERS *);
int main()
{
SetUnhandledExceptionFilter(segv_handler);
// force sigsegv
uint32_t a = read_reg(REG_ADDR);
printf("after segv, a = %d\n", a);
return 0;
}
static LONG WINAPI segv_handler(EXCEPTION_POINTERS *ep)
{
// only handle read access violation of REG_ADDR
if (ep->ExceptionRecord->ExceptionCode != EXCEPTION_ACCESS_VIOLATION ||
ep->ExceptionRecord->ExceptionInformation[0] != 0 ||
ep->ExceptionRecord->ExceptionInformation[1] != (ULONG_PTR)REG_ADDR)
return EXCEPTION_CONTINUE_SEARCH;
ep->ContextRecord->Rax = 1234;
ep->ContextRecord->Rip += 2;
return EXCEPTION_CONTINUE_EXECUTION;
}
So, the solution (code snippet) is as follows:
First of all, i have a variable:
__attribute__ ((aligned (4096))) int g_test;
Second, inside my main function, i do the following:
AddVectoredExceptionHandler(1, VectoredHandler);
DWORD old;
VirtualProtect(&g_test, 4096, PAGE_READWRITE | PAGE_GUARD, &old);
The handler looks like this:
LONG WINAPI VectoredHandler(struct _EXCEPTION_POINTERS *ExceptionInfo)
{
static DWORD last_addr;
if (ExceptionInfo->ExceptionRecord->ExceptionCode == STATUS_GUARD_PAGE_VIOLATION) {
last_addr = ExceptionInfo->ExceptionRecord->ExceptionInformation[1];
ExceptionInfo->ContextRecord->EFlags |= 0x100; /* Single step to trigger the next one */
return EXCEPTION_CONTINUE_EXECUTION;
}
if (ExceptionInfo->ExceptionRecord->ExceptionCode == STATUS_SINGLE_STEP) {
DWORD old;
VirtualProtect((PVOID)(last_addr & ~PAGE_MASK), 4096, PAGE_READWRITE | PAGE_GUARD, &old);
return EXCEPTION_CONTINUE_EXECUTION;
}
return EXCEPTION_CONTINUE_SEARCH;
}
This is only a basic skeleton for the functionality. Basically I guard the page on which the variable resides, i have some linked lists in which i hold pointers to the function and values for the address in question. I check that the fault generating address is inside my list then i trigger the callback.
On first guard hit, the page protection will be disabled by the system, but i can call my PRE_WRITE callback where i can save the variable state. Because a single step is issued through the EFlags, it will be followed immediately by a single step exception (which means that the variable was written), and i can trigger a WRITE callback. All the data required for the operation is contained inside the ExceptionInformation array.
When someone tries to write to that variable:
*(int *)&g_test = 1;
A PRE_WRITE followed by a WRITE will be triggered,
When i do:
int x = *(int *)&g_test;
A READ will be issued.
In this way i can manipulate the data flow in a way that does not require modifications of the original source code.
Note: This is intended to be used as part of a test framework and any penalty hit is deemed acceptable.
For example, W1C (Write 1 to clear) operation can be accomplished:
void MYREG_hook(reg_cbk_t type)
{
/** We need to save the pre-write state
* This is safe since we are assured to be called with
* both PRE_WRITE and WRITE in the correct order
*/
static int pre;
switch (type) {
case REG_READ: /* Called pre-read */
break;
case REG_PRE_WRITE: /* Called pre-write */
pre = g_test;
break;
case REG_WRITE: /* Called after write */
g_test = pre & ~g_test; /* W1C */
break;
default:
break;
}
}
This was possible also with seg-faults on illegal addresses, but i had to issue one for each R/W, and keep track of a "virtual register file" so a bigger penalty hit. In this way i can only guard specific areas of memory or none, depending on the registered monitors.

How to calculate required stack size for tree of called functions in GCC

Is it possible to determine the demand a non recursive function on the stack without external computation, right in the text of the program? I need this to allocate a memory resource for the thread in very small micro-controllers, such as AVR. And I need know this before function calling. Directive --stack-usage is very non informative, unfortunately. Or I something do not understand?
Getting the address of a passed argument yields it's place on the stack. Therefore running this:
#include <stdio.h>
void my_fun(int dummy);
int get_stack_space(int dummy);
int main(void)
{
int dummy = 0;
my_fun(dummy);
return 0;
}
void my_fun(int dummy)
{
// do stuff
printf("%d\n", get_stack_space((int)&dummy));
return;
}
int get_stack_space(int dummy)
{
return dummy - (int)&dummy;
}
should get you the distance in bytes on the stack between the point of calling my_fun() and calling get_stack_space(). Hope it helps.
Edit: on x86 you get the distance + a machine word for the push of the return address when calling my_fun() + a machine word for the push of ebp at the start of my_fun()

Resources