Why is this loop not executed? - gcc

I compile with GCC 5.3 2016q1 for STM32 microcontroller.
Right at the beginning of main I placed a small routine to fill stack with a pattern. Later I search the highest address that still holds this pattern to find out about stack usage, you surely know this. Here is my routine:
uint32_t* Stack_ptr = 0;
uint32_t Stack_bot;
uint32_t n = 0;
asm volatile ("str sp, [%0]" :: "r" (&Stack_ptr));
Stack_bot = (uint32_t)(&_estack - &_Min_Stack_Size);
//*
n = 0;
while ((uint32_t)(Stack_ptr) > Stack_bot)
{
Stack_ptr--;
n++;
*Stack_ptr = 0xAA55A55A;
} // */
After that I initialize hardware, also a UART and print out values of Stack_ptr, Stack_bot and n and then stack contents.
The results are 0x20007FD8 0x20007C00 0
Stack_bot is the expected value because I have 0x400 Bytes in 32k RAM starting at 0x20000000. But I would expect Stack_ptr to be 0x20008000 and n somewhat under 0x400 after the loop is finished. Also stack contents shows no entries of 0xAA55A55A. This means the loop is not executed.
I could only manage to get it executed by creating a small function that holds the above routine and disable optimization for this function.
Anybody knows why that is? And the strangest thing about it is that I could swear it worked a few days ago. I saw a lot of 0xAA55A55A in the stack dump.
Thanks a lot
Martin

Probably problem is with the assembler function, In my code I use this:
// defined by linker script, pointing to end of static allocation
extern unsigned _heap_start;
void fill_heap(unsigned fill=0x55555555) {
unsigned *dst = &_heap_start;
register unsigned *msp_reg;
__asm__("mrs %0, msp\n" : "=r" (msp_reg) );
while (dst < msp_reg) {
*dst++ = fill;
}
}
it will fill memory between _heap_start and current SP.

Related

Trap memory accesses inside a standard executable built with MinGW

So my problem sounds like this.
I have some platform dependent code (embedded system) which writes to some MMIO locations that are hardcoded at specific addresses.
I compile this code with some management code inside a standard executable (mainly for testing) but also for simulation (because it takes longer to find basic bugs inside the actual HW platform).
To alleviate the hardcoded pointers, i just redefine them to some variables inside the memory pool. And this works really well.
The problem is that there is specific hardware behavior on some of the MMIO locations (w1c for example) which makes "correct" testing hard to impossible.
These are the solutions i thought of:
1 - Somehow redefine the accesses to those registers and try to insert some immediate function to simulate the dynamic behavior. This is not really usable since there are various ways to write to the MMIO locations (pointers and stuff).
2 - Somehow leave the addresses hardcoded and trap the illegal access through a seg fault, find the location that triggered, extract exactly where the access was made, handle and return. I am not really sure how this would work (and even if it's possible).
3 - Use some sort of emulation. This will surely work, but it will void the whole purpose of running fast and native on a standard computer.
4 - Virtualization ?? Probably will take a lot of time to implement. Not really sure if the gain is justifiable.
Does anyone have any idea if this can be accomplished without going too deep? Maybe is there a way to manipulate the compiler in some way to define a memory area for which every access will generate a callback. Not really an expert in x86/gcc stuff.
Edit: It seems that it's not really possible to do this in a platform independent way, and since it will be only windows, i will use the available API (which seems to work as expected). Found this Q here:
Is set single step trap available on win 7?
I will put the whole "simulated" register file inside a number of pages, guard them, and trigger a callback from which i will extract all the necessary info, do my stuff then continue execution.
Thanks all for responding.
I think #2 is the best approach. I routinely use approach #4, but I use it to test code that is running in the kernel, so I need a layer below the kernel to trap and emulate the accesses. Since you have already put your code into a user-mode application, #2 should be simpler.
The answers to this question may provide help in implementing #2. How to write a signal handler to catch SIGSEGV?
What you really want to do, though, is to emulate the memory access and then have the segv handler return to the instruction after the access. This sample code works on Linux. I'm not sure if the behavior it is taking advantage of is undefined, though.
#include <stdint.h>
#include <stdio.h>
#include <signal.h>
#define REG_ADDR ((volatile uint32_t *)0x12340000f000ULL)
static uint32_t read_reg(volatile uint32_t *reg_addr)
{
uint32_t r;
asm("mov (%1), %0" : "=a"(r) : "r"(reg_addr));
return r;
}
static void segv_handler(int, siginfo_t *, void *);
int main()
{
struct sigaction action = { 0, };
action.sa_sigaction = segv_handler;
action.sa_flags = SA_SIGINFO;
sigaction(SIGSEGV, &action, NULL);
// force sigsegv
uint32_t a = read_reg(REG_ADDR);
printf("after segv, a = %d\n", a);
return 0;
}
static void segv_handler(int, siginfo_t *info, void *ucontext_arg)
{
ucontext_t *ucontext = static_cast<ucontext_t *>(ucontext_arg);
ucontext->uc_mcontext.gregs[REG_RAX] = 1234;
ucontext->uc_mcontext.gregs[REG_RIP] += 2;
}
The code to read the register is written in assembly to ensure that both the destination register and the length of the instruction are known.
This is how the Windows version of prl's answer could look like:
#include <stdint.h>
#include <stdio.h>
#include <windows.h>
#define REG_ADDR ((volatile uint32_t *)0x12340000f000ULL)
static uint32_t read_reg(volatile uint32_t *reg_addr)
{
uint32_t r;
asm("mov (%1), %0" : "=a"(r) : "r"(reg_addr));
return r;
}
static LONG WINAPI segv_handler(EXCEPTION_POINTERS *);
int main()
{
SetUnhandledExceptionFilter(segv_handler);
// force sigsegv
uint32_t a = read_reg(REG_ADDR);
printf("after segv, a = %d\n", a);
return 0;
}
static LONG WINAPI segv_handler(EXCEPTION_POINTERS *ep)
{
// only handle read access violation of REG_ADDR
if (ep->ExceptionRecord->ExceptionCode != EXCEPTION_ACCESS_VIOLATION ||
ep->ExceptionRecord->ExceptionInformation[0] != 0 ||
ep->ExceptionRecord->ExceptionInformation[1] != (ULONG_PTR)REG_ADDR)
return EXCEPTION_CONTINUE_SEARCH;
ep->ContextRecord->Rax = 1234;
ep->ContextRecord->Rip += 2;
return EXCEPTION_CONTINUE_EXECUTION;
}
So, the solution (code snippet) is as follows:
First of all, i have a variable:
__attribute__ ((aligned (4096))) int g_test;
Second, inside my main function, i do the following:
AddVectoredExceptionHandler(1, VectoredHandler);
DWORD old;
VirtualProtect(&g_test, 4096, PAGE_READWRITE | PAGE_GUARD, &old);
The handler looks like this:
LONG WINAPI VectoredHandler(struct _EXCEPTION_POINTERS *ExceptionInfo)
{
static DWORD last_addr;
if (ExceptionInfo->ExceptionRecord->ExceptionCode == STATUS_GUARD_PAGE_VIOLATION) {
last_addr = ExceptionInfo->ExceptionRecord->ExceptionInformation[1];
ExceptionInfo->ContextRecord->EFlags |= 0x100; /* Single step to trigger the next one */
return EXCEPTION_CONTINUE_EXECUTION;
}
if (ExceptionInfo->ExceptionRecord->ExceptionCode == STATUS_SINGLE_STEP) {
DWORD old;
VirtualProtect((PVOID)(last_addr & ~PAGE_MASK), 4096, PAGE_READWRITE | PAGE_GUARD, &old);
return EXCEPTION_CONTINUE_EXECUTION;
}
return EXCEPTION_CONTINUE_SEARCH;
}
This is only a basic skeleton for the functionality. Basically I guard the page on which the variable resides, i have some linked lists in which i hold pointers to the function and values for the address in question. I check that the fault generating address is inside my list then i trigger the callback.
On first guard hit, the page protection will be disabled by the system, but i can call my PRE_WRITE callback where i can save the variable state. Because a single step is issued through the EFlags, it will be followed immediately by a single step exception (which means that the variable was written), and i can trigger a WRITE callback. All the data required for the operation is contained inside the ExceptionInformation array.
When someone tries to write to that variable:
*(int *)&g_test = 1;
A PRE_WRITE followed by a WRITE will be triggered,
When i do:
int x = *(int *)&g_test;
A READ will be issued.
In this way i can manipulate the data flow in a way that does not require modifications of the original source code.
Note: This is intended to be used as part of a test framework and any penalty hit is deemed acceptable.
For example, W1C (Write 1 to clear) operation can be accomplished:
void MYREG_hook(reg_cbk_t type)
{
/** We need to save the pre-write state
* This is safe since we are assured to be called with
* both PRE_WRITE and WRITE in the correct order
*/
static int pre;
switch (type) {
case REG_READ: /* Called pre-read */
break;
case REG_PRE_WRITE: /* Called pre-write */
pre = g_test;
break;
case REG_WRITE: /* Called after write */
g_test = pre & ~g_test; /* W1C */
break;
default:
break;
}
}
This was possible also with seg-faults on illegal addresses, but i had to issue one for each R/W, and keep track of a "virtual register file" so a bigger penalty hit. In this way i can only guard specific areas of memory or none, depending on the registered monitors.

How to calculate required stack size for tree of called functions in GCC

Is it possible to determine the demand a non recursive function on the stack without external computation, right in the text of the program? I need this to allocate a memory resource for the thread in very small micro-controllers, such as AVR. And I need know this before function calling. Directive --stack-usage is very non informative, unfortunately. Or I something do not understand?
Getting the address of a passed argument yields it's place on the stack. Therefore running this:
#include <stdio.h>
void my_fun(int dummy);
int get_stack_space(int dummy);
int main(void)
{
int dummy = 0;
my_fun(dummy);
return 0;
}
void my_fun(int dummy)
{
// do stuff
printf("%d\n", get_stack_space((int)&dummy));
return;
}
int get_stack_space(int dummy)
{
return dummy - (int)&dummy;
}
should get you the distance in bytes on the stack between the point of calling my_fun() and calling get_stack_space(). Hope it helps.
Edit: on x86 you get the distance + a machine word for the push of the return address when calling my_fun() + a machine word for the push of ebp at the start of my_fun()

Need help understanding stack frame layout

While implementing a stack walker for a debugger I am working on I reached the point to extract the arguments to a function call and display them. To make it simple I started with the cdecl convention in pure 32-bit (both debugger and debuggee), and a function that takes 3 parameters. However, I cannot understand why the arguments in the stack trace are out of order compared to what cdecl defines (right-to-left, nothing in registers), despite trying to figure it out for a few days now.
Here is a representation of the function call I am trying to stack trace:
void Function(unsigned long long a, const void * b, unsigned int c) {
printf("a=0x%llX, b=%p, c=0x%X\n", a, b, c);
_asm { int 3 }; /* Because I don't have stepping or dynamic breakpoints implemented yet */
}
int main(int argc, char* argv[]) {
Function(2, (void*)0x7A3FE8, 0x2004);
return 0;
}
This is what the function (unsurprisingly) printed to the console:
a=0x2, c=0x7a3fe8, c=0x2004
This is the stack trace generated at the breakpoint (the debugger catches the breakpoint and there I try to walk the stack):
0x3EF5E0: 0x10004286 /* previous pc */
0x3EF5DC: 0x3EF60C /* previous fp */
0x3EF5D8: 0x7A3FE8 /* arg b --> Wait... why is b _above_ c here? */
0x3EF5D4: 0x2004 /* arg c */
0x3EF5D0: 0x0 /* arg a, upper 32 bit */
0x3EF5CC: 0x2 /* arg a, lower 32 bit */
The code that's responsible for dumping the stack frames (implemented using the DIA SDK, though, I don't think that is relevant to my problem) looks like this:
ULONGLONG stackframe_top = 0;
m_frame->get_base(&stackframe_top); /* IDiaStackFrame */
/* dump 30 * 4 bytes */
for (DWORD i = 0; i < 30; i++)
{
ULONGLONG address = stackframe_top - (i * 4);
DWORD value;
SIZE_T read_bytes;
if (ReadProcessMemory(m_process, reinterpret_cast<LPVOID>(address), &value, sizeof(value), &read_bytes) == TRUE)
{
debugprintf(L"0x%llX: 0x%X\n", address, value); /* wrapper around OutputDebugString */
}
}
I am compiling the test program without any optimization in vs2015 update 3.
I have validated that I am indeed compiling it as cdecl by looking in the pdb with the dia2dump sample application.
I do not understand what is causing the stack to look like this, it doesn't match anything I learned, nor does it match the documentation provided by Microsoft.
I also checked google a whole lot (including osdev wiki pages, msdn blog posts, and so on), and checked my (by now probably outdated) books on 32-bit x86 assembly programming (that were released before 64-bit CPUs existed).
Thank you very much in advance for any explanations or links!
I had somehow misunderstood where the arguments to a function call end up in memory compared to the base of the stack frame, as pointed out by Raymond. This is the fixed code snippet:
ULONGLONG stackframe_top = 0;
m_frame->get_base(&stackframe_top); /* IDiaStackFrame */
/* dump 30 * 4 bytes */
for (DWORD i = 0; i < 30; i++)
{
ULONGLONG address = stackframe_top + (i * 4); /* <-- Read before the stack frame */
DWORD value;
SIZE_T read_bytes;
if (ReadProcessMemory(m_process, reinterpret_cast<LPVOID>(address), &value, sizeof(value), &read_bytes) == TRUE)
{
debugprintf(L"0x%llX: 0x%X\n", address, value); /* wrapper around OutputDebugString */
}
}

How do I ask the assembler to "give me a full size register"?

I'm trying to allow the assembler to give me a register it chooses, and then use that register with inline assembly. I'm working with the program below, and its seg faulting. The program was compiled with g++ -O1 -g2 -m64 wipe.cpp -o wipe.exe.
When I look at the crash under lldb, I believe I'm getting a 32-bit register rather than a 64-bit register. I'm trying to compute an address (base + offset) using lea, and store the result in a register the assembler chooses:
"lea (%0, %1), %2\n"
Above, I'm trying to say "use a register, and I'll refer to it as %2".
When I perform a disassembly, I see:
0x100000b29: leal (%rbx,%rsi), %edi
-> 0x100000b2c: movb $0x0, (%edi)
So it appears the code being generated calculates and address using 64-bit values (rbx and rsi), but saves it to a 32-bit register (edi) (that the assembler chose).
Here are the values at the time of the crash:
(lldb) type format add --format hex register
(lldb) p $edi
(unsigned int) $3 = 1063330
(lldb) p $rbx
(unsigned long) $4 = 4296030616
(lldb) p $rsi
(unsigned long) $5 = 10
A quick note on the Input Operands below. If I drop the "r" (2), then I get a compiler error when I refer to %2 in the call to lea: invalid operand number in inline asm string.
How do I tell the assembler to "give me a full size register" and then refer to it in my program?
int main(int argc, char* argv[])
{
string s("Hello world");
cout << s << endl;
char* ptr = &s[0];
size_t size = s.length();
if(ptr && size)
{
__asm__ __volatile__
(
"%=:\n" /* generate a unique label for TOP */
"subq $1, %1\n" /* 0-based index */
"lea (%0, %1), %2\n" /* calcualte ptr[idx] */
"movb $0, (%2)\n" /* 0 -> ptr[size - 1] .. ptr[0] */
"jnz %=b\n" /* Back to TOP if non-zero */
: /* no output */
: "r" (ptr), "r" (size), "r" (2)
: "0", "1", "2", "cc"
);
}
return 0;
}
Sorry about these inline assembly questions. I hope this is the last one. I'm not really thrilled with using inline assembly in GCC because of pain points like this (and my fading memory). But its the only legal way I know to do what I want to do given GCC's interpretation of the qualifier volatile in C.
If interested, GCC interprets C's volatile qualifier as hardware backed memory, and anything else is an abuse and it results in an illegal program. So the following is not legal for GCC:
volatile void* g_tame_the_optimizer = NULL;
...
unsigned char* ptr = ...
size_t size = ...;
for(size_t i = 0; i < size; i++)
ptr[i] = 0x00;
g_tame_the_optimizer = ptr;
Interestingly, Microsoft uses a more customary interpretation of volatile (what most programmers expect - namely, anything can change the memory, and not just memory mapped hardware), and the code above is acceptable.
gcc inline asm is a complicated beast. "r" (2) means allocate an int sized register and load it with the value 2. If you just need an arbitrary scratch register you can declare a 64 bit early-clobber dummy output, such as "=&r" (dummy) in the output section, with void *dummy declared earlier. You can consult the gcc manual for more details.
As to the final code snippet looks like you want a memory barrier, just as the linked email says. See the manual for example.

atomic_inc and atomic_xchg in gcc assembly

I have written the following user-level code snippet to test two sub functions, atomic inc and xchg (refer to Linux code).
What I need is just try to perform operations on 32-bit integer, and that's why I explicitly use int32_t.
I assume global_counter will be raced by different threads, while tmp_counter is fine.
#include <stdio.h>
#include <stdint.h>
int32_t global_counter = 10;
/* Increment the value pointed by ptr */
void atomic_inc(int32_t *ptr)
{
__asm__("incl %0;\n"
: "+m"(*ptr));
}
/*
* Atomically exchange the val with *ptr.
* Return the value previously stored in *ptr before the exchange
*/
int32_t atomic_xchg(uint32_t *ptr, uint32_t val)
{
uint32_t tmp = val;
__asm__(
"xchgl %0, %1;\n"
: "=r"(tmp), "+m"(*ptr)
: "0"(tmp)
:"memory");
return tmp;
}
int main()
{
int32_t tmp_counter = 0;
printf("Init global=%d, tmp=%d\n", global_counter, tmp_counter);
atomic_inc(&tmp_counter);
atomic_inc(&global_counter);
printf("After inc, global=%d, tmp=%d\n", global_counter, tmp_counter);
tmp_counter = atomic_xchg(&global_counter, tmp_counter);
printf("After xchg, global=%d, tmp=%d\n", global_counter, tmp_counter);
return 0;
}
My 2 questions are:
Are these two subfunctions written properly?
Will this behave the same when I compile this on 32-bit or
64-bit platform? For example, could the pointer address have a different
length. or could incl and xchgl will conflict with the operand?
My understanding of this question is below, please correct me if I'm wrong.
All the read-modify-write instructions (ex: incl, add, xchg) need a lock prefix. The lock instruction is to lock the memory accessed by other CPUs by asserting LOCK# signal on the memory bus.
The __xchg function in Linux kernel implies no "lock" prefix because xchg always implies lock anyway. http://lxr.linux.no/linux+v2.6.38/arch/x86/include/asm/cmpxchg_64.h#L15
However, the incl used in atomic_inc does not have this assumption so a lock_prefix is needed.
http://lxr.linux.no/linux+v2.6.38/arch/x86/include/asm/atomic.h#L105
btw, I think you need to copy the *ptr to a volatile variable to avoid gcc optimization.
William

Resources