I have an ARM assembly function that is called from a C function.
At some point, I do something like this:
.syntax unified
.arm
.text
.globl myfunc
.extern printf
myfunc:
stmdb sp!, {r4-r11} // save stack from C call
... do stuff ...
// (NOT SHOWN): Load values into r1 and r2 to be printed by format string above
ldr r0, =message // Load format string above
push {lr} // me attempting to preserve my stack
bl printf // actual call to printf
pop {lr} // me attempting to recover my stack
ldmia sp!, {r4-r11} // recover stack from C call
mov r0, r2 // Move return value into r0
mov pc, lr // Return to C
.section data
message:
.asciz "Output: %d, %d\n"
.end
This runs sometimes, crashes sometimes, runs a few times then crashes, etc. It actually runs on a quasi bare-metal context, so I can't run a debugger. I'm 99% sure it's a stack -- or alignment? -- thing, as per this Printf Change values in registers, ARM Assembly and this Call C function from Assembly -- the application freezes at "call printf" and I have no idea why.
Can anyone provide some specific ideas for how to get the above chunk of code running, and perhaps general ideas for best practices here? Ideally I'd like to be able to call the same output function multiple times in my assembly file, to debug things as I go.
Thanks in advance!
I could see the following issues in that code:
.align 2 (could be 3 or any higher value) before function entry point (myfunc:)
.align 2 // guarantee that instruction address is 4B aligned
myfunc:
as was mentioned in comments, stack is expected to be 8B aligned. push {lr} breaks that.
message: doesn't need to be in 'data' section. It might be placed in code section behind 'myfunc'. Check linker map that data is actually present & address loaded into r0 is correct.
Since that a bare-metal, check that stack is set properly and enough room is reserved for it.
Related
This question is similar to Incorrect relative address using GNU LD w/ 16-bit x86, but I could not solve by building a cross-compiler.
The scenario is, I have a second stage bootloader that starts as 16bit, and brings itself up to 32 bit. As a result, I have mixed assembly and C code with 32 and 16 bit code mixed together.
I have included an assembly file which defines a global that I will call from C, basically with the purpose of dropping back to REAL mode to perform BIOS interrupts from the protected mode C environment on demand. So far, the function doesn't do anything except get called, push and pop some registers, and return:
[bits 32]
BIOS_Interrupt:
PUSHF
...
ret
global BIOS_Interrupt
this is included in my main bootloader.asm file that is loaded by the stage 1 mbr.
In C, I have defined:
extern void BIOS_Interrupt(uint32_t intno, uint32_t ax, uint32_t bx, uint32_t cx, uint32_t dx, uint32_t es, uint32_t ds, uint32_t di, uint32_t si);
in a header, and
BIOS_Interrupt(0x15,0,0,0,0,0,0,0,0);
in code, just to test calling
I can see in the resultant disassembled linked binary that the call is invariably set 2 bytes too low in RAM:
00000132 0100 add [bx+si],ax
00000134 009CFA9D add [si-0x6206],bl
00000138 6650 push eax
0000013A 6653 push ebx
...
00001625 6A00 push byte +0x0
00001627 6A00 push byte +0x0
00001629 6A00 push byte +0x0
0000162B 6A00 push byte +0x0
0000162D 6A00 push byte +0x0
0000162F 6A00 push byte +0x0
00001631 6A00 push byte +0x0
00001633 6A00 push byte +0x0
00001635 6A15 push byte +0x15
00001637 E8F9EA call 0x133
The instruction at 135 should be the first instruction reached (0x9C = PUSHF), but the call is for 2 bytes less in memory at 133, causing runtime errors.
I have noticed that by using the NASM .align keyword, the extra NOPs that are generated do compensate for the incorrect relative address.
Is this an issue with the linking process? I have LD running with -melf_i386, NASM with -f elf and GCC with -m32 -ffreestanding -0s -fno-pie -nostdlib
edit: images added for #MichaelPetch. Code is loaded at 0x9000 by MBR. Interestingly, the call shows a correct relative jump, to 0x135, but the executing disassembly at 0x135 looks like the code at 0x133 (0x00, 0x00).
Bochs about to call BIOS_Interrupt
Bochs at call start
edit 2: correction to image 2 after refreshing memdump after call
memdump and dissasembly after calling BIOS_Interrupt (call 0x135)
Thanks again to #MichaelPetch for giving me a few pointers.
I don't think there is an issue with the linker, and that the dissassembly was "tricking" me, in that the combination of 16 and 32 bit code led to inaccurate code.
In the end, it was due to overriding of memory values from prior operations. In the code immediately before the BIOS_Interrupt label, I had defined a dword, dd IDT_REAL, designed to store the IDT for real mode processing. However, I did not realise (or forgot) that the SIDT/LIDT instructions take 6 bytes of data, so when I was calling SIDT, it was overriding the first 2 bytes of the label's location in RAM, resulting in runtime errors. After increasing the size of the variable from dword to qword, I can run just fine w/o error.
The linker/compiler suggestion seems to be a red-herring that I fell for courtesy of objdump. However, I've at least learned from this the benefits of Bochs and double checking code before jumping to conclusions!
I am trying to perform a live patch to some 64 bit intel code.
Basically, I like to replace a few instructions with a larger set of instructions.
The patching will be performed by a program written in C, and I like to make use of GCC's inline assembler for that.
What I am struggling with is how to calculate the jmp and call instructions, knowing only the absolute 64bit addresses of the original code and my replacement routine.
To give some background, I am trying to perform the patch that's outlined here: Hacking the OS X Kernel for Fun and Profiles.
But instead of modifying the kernel file, I am writing a KEXT that is already able to locate the relevant set of these 6 instructions that need to be changed:
mov %r15,%rdi
xor %esi,%esi
xor %edx,%edx
xor %ecx,%ecx
mov $0x1b,%r8d
callq <threadsignal+224>
I know the address of each of these above instructions.
Now, I would want to write the replacement code like this:
void patchcode ()
{
current_thread (); // leaves result in rax, apparently
__asm__ volatile (
"xor %edi,%edi \n"
"xor %esi,%esi \n"
"mov %rax,%rdx \n" // current_thread()'s result
"mov $0x4,%ecx \n"
"pushl $0x01234567 \n"
"pushl $0x89ABCDEF \n"
"ret \n"
);
}
The challenges for me are:
Apparently, the compiler puts some code at the start of the patchcode() function that changes the stack. That's not good because I leave the routine with a jmp, not by the function's usual return path. How do I deal with that, can I tell the compiler not to generate that lead-in code?
How do I code the two jmps, one jmp from the to-be-patched area to the above patchcode() function, and then the jmp inside patchcode() back to the instruction of which I only know the absolute address, at runtime? As you can see above, I imagine pushing two 32 bit values onto the stack (I'll patch those values at runtime once I know the actual address where I want to jump to) and then jump to them using a ret instruction. Will that work?
I am happy to ask questions in stack overflow due to prompt reply from experts world wide:-) I wish to explain clearly the issue I am facing.
What I wish to do?
I wish to evaluate NEON instruction set through various examples available online in-order to write some algorithm on my own.
For evaluation purpose, I'm making use of memcpy samples available at ARM official website. Here is the link http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html.
My Environment
I am compiling NEON instruction set on Visual Studio 2008 with Platform Builder for Windows CE 7.0. Latest platform builder supports NEON instruction compilation.
I am running my code on OMAP3530 Mistral EVM board.
I have created a simple static library (NEONLIB.lib) that contains NEON instructions to perform the required operation. I have created simple Stream driver (stream_interface.dll) that uses this static library to perform memcpy operation on 1280X720X2 bytes buffer. I am loading and unloading this driver dynamically using a simple application (Neon_Test.exe).
Issue I'm facing
Once the OS boots, I am launching this application manually and following the exception I receive.
Exception 'Data Abort' (0x4): Thread-Id=047d002a(pth=c049c990), Proc-Id=00400002(pprc=8a3425e0) 'NK.EXE', VM-active=05420012(pprc=c04a1344) 'Neon_Test.exe'
PID:00400002 TID:047D002A PC=ef135120(stream_interface.dll+0x00005120) RA=ef133c18(stream_interface.dll+0x00003c18) SP=d0f3fc84, BVA=00000000
NeonMemcpy is function in my driver that calls NEON function.
Stream_Interface.map file
....
0001:000029f0 ?NeonInit##YAHXZ 100039f0 f Neon_Process.obj
0001:00002bb4 ?NeonMemcpy##YAXXZ 10003bb4 f Neon_Process.obj
0001:00002c58 NKDbgPrintfW 10003c58 f coredll:COREDLL.dll
0001:00002c68 SetLastError 10003c68 f coredll:COREDLL.dll
....
Neon_Process.cod file
.......
; 108 : MemcpyCustom((void*)g_pOUTVirtualAddr, (void*)g_pINPVirtualAddr, 1280 * 720 * 2);
00050 e5951000 ldr r1,[r5]
00054 e1a04000 mov r4,r0
00058 e5950004 ldr r0,[r5,#4]
0005c e3a02ae1 mov r2,#0xE1000
00060 eb000000 bl MemcpyCustom
; 109 : RETAILMSG(1, (L"Time for Copy using Neon %d\r\n", GetTickCount() - dwStartTime));
00064 eb000000 bl GetTickCount
00068 e1a03000 mov r3,r0
.......
My assembly source
AREA omap_neoncoding, CODE, READONLY
EXPORT MemcpyCustom
INCLUDE omap_neoncoding.inc
MemcpyCustom
stmfd sp!, {r4-r12,lr}
NEONCopyPLD
PLD [r1, #0xC0]
VLDM r1!,{d0-d7}
VSTM r0!,{d0-d7}
SUBS r2,r2,#0x40
BGE NEONCopyPLD
END
Based on article by Bruce Eitman, http://geekswithblogs.net/BruceEitman/archive/2008/05/19/windows-ce--finding-the-cause-of-a-data-abort.aspx, the location where the exception occurs was
00064 eb000000 bl GetTickCount
But I am sure that there is no issue in GetTickCount(), if I remove MemcpyCustom function, everything goes fine. Hope I have given all the information to help to sort out this issue. Please help me to find out the exact reason for the exception. Do i need to do any steps before calling neon functions or any other special neon instructions should be followed?
Thanks in advance for your help.
Spark
You are pushing registers in the function's prolog:
stmfd sp!, {r4-r12,lr}
But there is no corresponding pop at the end, and no return instruction. So the execution continues to whatever code happens to be after the function and what happens next is anyone's guess. The following, placed after the BGE should fix the problem:
ldmfd sp!, {r4-r12,pc}
EDIT: By the way, since you're not actually using r4-r12 in the function, you don't need to save them. You also don't need to save d0-d7 as they're considered volatile. So you can remove stmfd and replace ldmfd by just bx lr.
MemcpyCustom
PLD [r1, #0xC0]
VLDM r1!,{d0-d7}
VSTM r0!,{d0-d7}
SUBS r2,r2,#0x40
BGE MemcpyCustom
BX lr
I'm trying to write a firmware mod (to existing firmware, for which i don't have source code)
All Thumb code.
does anybody have any idea how to do this, in gcc as (GAS) assembler:
Use BL without having to manually calculate offsets, when BL'ing to some existing function (not in my code.. but i know its address)
Currently, if i want to use BL ...i have to :
-go back in my code
-figure out and add all the bytes that would result from assembling all the previous instructions in the function i'm writing
-add the begining address of my function to that (i specify the starting address of what i'm writing, in the linker script)
-and then substract the address of the firmfunc function i want to call
All this... just to calculate the offset... to be able to write abl offset... to call an existing firmware function?
And if i change any code before that BL, i have to do it all over again manually !
See.. this is why i want to learn to use BX right... instead of BL
Also, i don't quite understand the BX. If i use BX to jump to an absolute address, do i have to increase the actual address by 1, when caling Thumb code from Thumb code (to keep the lsb byte 1)... and the CPU will know it's thumb code ?
BIG EDIT:
Changing the answer based on what I have learned recently and a better understanding of the question
First off I dont know how to tell the linker to generate a bl to an address that is a hardcoded address and not actually in this code. You might try to rig up an elf file that has labels and such but dummy or no code, dont know if that will fool the linker or not. You would have to modify the linker script as well. not worth it.
your other question that was spawned from this one:
Arm/Thumb: using BX in Thumb code, to call a Thumb function, or to jump to a Thumb instruction in another function
For branching this works just fine:
LDR R6, =0x24000
ADD R6, #1 # (set lsb to 1)
BX R6
or save an instruction and just do this
LDR R6, =0x24001
BX R6
if you want to branch link and you know the address and you are in thumb mode and want to get to thumb code then
ldr r6,=0x24001
bl thumb_trampoline
;#returns here
...
.thumb_func
thumb_trampoline:
bx r6
And almost the exact same if you are starting in arm mode, and want to get to thumb code at an address you already know.
ldr r6,=0x24001
bl arm_trampoline
;#returns here
...
arm_trampoline:
bx r6
You have to know that you can trash r6 in this way (make sure r6 isnt saving some value being used by some code that called this code).
Very sorry misleading you with the other answer, I could swear that mov lr,pc pulled in the lsbit as a mode, but it doesnt.
The accepted answer achieves the desired goal, but to address the answer exactly as asked you can use the .equ directive to associate a constant vale with a symbol, that can then be used as an operand to instructions. This has the assembler synthesise the trampoline if/when necessary:
equ myFirmwareFunction, 0x12346570
.globl _start
mov r0, #42
b myFirmwareFunction
Which generates the following assembly[1]
01000000 <_start>:
1000000: e3a0002a mov r0, #42 ; 0x2a
1000004: eaffffff b 1000008 <__*ABS*0x12346570_veneer>
01000008 <__*ABS*0x12346570_veneer>:
__*ABS*0x12346570_veneer():
1000008: e51ff004 ldr pc, [pc, #-4] ; 100000c <__*ABS*0x12346570_veneer+0x4>
100000c: 12346570 data: #0x12345670
If the immediate value is close enough to PC that the offset will fit in the immediate field, then the verneer (trampoline) is skipped and you will get a single branch instruction to the specified constant address.
[1] using the codesorcery (2009q1) toolchain with:
arm-none-eabi-gcc -march=armv7-a -x assembler test.spp -o test.elf -Ttext=0x1000000 -nostdlib
Treat this more as pseudocode than anything. If there's some macro or other element that you feel should be included, let me know.
I'm rather new to assembly. I programmed on a pic processor back in college, but nothing since.
The problem here (segmentation fault) is the first instruction after "Compile function entrance, setup stack frame." or "push %ebp". Here's what I found out about those two instructions:
http://unixwiz.net/techtips/win32-callconv-asm.html
Save and update the %ebp :
Now that we're in the new function, we need a new local stack frame pointed to by %ebp, so this is done by saving the current %ebp (which belongs to the previous function's frame) and making it point to the top of the stack.
push ebp
mov ebp, esp // ebp « esp
Once %ebp has been changed, it can now refer directly to the function's arguments as 8(%ebp), 12(%ebp). Note that 0(%ebp) is the old base pointer and 4(%ebp) is the old instruction pointer.
Here's the code. This is from a JIT compiler for a project I'm working on. I'm doing this more for the learning experience than anything.
IL_CORE_COMPILE(avs_x86_compiler_compile)
{
X86GlobalData *gd = X86_GLOBALDATA(ctx);
ILInstruction *insn;
avs_debug(print("X86: Compiling started..."));
/* Initialize X86 Assembler opcode context */
x86_context_init(&gd->ctx, 4096, 1024*1024);
/* Compile function entrance, setup stack frame*/
x86_emit1(&gd->ctx, pushl, ebp);
x86_emit2(&gd->ctx, movl, esp, ebp);
/* Setup floating point rounding mode to integer truncation */
x86_emit2(&gd->ctx, subl, imm(8), esp);
x86_emit1(&gd->ctx, fstcw, disp(0, esp));
x86_emit2(&gd->ctx, movl, disp(0, esp), eax);
x86_emit2(&gd->ctx, orl, imm(0xc00), eax);
x86_emit2(&gd->ctx, movl, eax, disp(4, esp));
x86_emit1(&gd->ctx, fldcw, disp(4, esp));
for (insn=avs_il_tree_base(tree); insn != NULL; insn = insn->next) {
avs_debug(print("X86: Compiling instruction: %p", insn));
compile_opcode(gd, obj, insn);
}
/* Restore floating point rounding mode */
x86_emit1(&gd->ctx, fldcw, disp(0, esp));
x86_emit2(&gd->ctx, addl, imm(8), esp);
/* Cleanup stack frame */
x86_emit0(&gd->ctx, emms);
x86_emit0(&gd->ctx, leave);
x86_emit0(&gd->ctx, ret);
/* Link machine */
obj->run = (AvsRunnableExecuteCall) gd->ctx.buf;
return 0;
}
And when obj->run is called, it's called with obj as its only argument:
obj->run(obj);
If it helps, here are the instructions for the entire function call. It's basically an assignment operation: foo=3*0.2;. foo is pointing to a float in C.
0x8067990: push %ebp
0x8067991: mov %esp,%ebp
0x8067993: sub $0x8,%esp
0x8067999: fnstcw (%esp)
0x806799c: mov (%esp),%eax
0x806799f: or $0xc00,%eax
0x80679a4: mov %eax,0x4(%esp)
0x80679a8: fldcw 0x4(%esp)
0x80679ac: flds 0x806793c
0x80679b2: fsts 0x805f014
0x80679b8: fstps 0x8067954
0x80679be: fldcw (%esp)
0x80679c1: add $0x8,%esp
0x80679c7: emms
0x80679c9: leave
0x80679ca: ret
Edit: Like I said above, in the first instruction in this function, %ebp is void. This is also the instruction that causes the segmentation fault. Is that because it's void, or am I looking for something else?
Edit: Scratch that. I keep typing edp instead of ebp. Here are the values of ebp and esp.
(gdb) print $esp
$1 = (void *) 0xbffff14c
(gdb) print $ebp
$3 = (void *) 0xbffff168
Edit: Those values above are wrong. I should have used the 'x' command, like below:
(gdb) x/x $ebp
0xbffff168: 0xbffff188
(gdb) x/x $esp
0xbffff14c: 0x0804e481
Here's a reply from someone on a mailing list regarding this. Anyone care to illuminate what he means a bit? How do I check to see how the stack is set up?
An immediate problem I see is that the
stack pointer is not properly aligned.
This is 32-bit code, and the Intel
manual says that the stack should be
aligned at 32-bit addresses. That is,
the least significant digit in esp
should be 0, 4, 8, or c.
I also note that the values in ebp and
esp are very far apart. Typically,
they contain similar values --
addresses somewhere in the stack.
I would look at how the stack was set
up in this program.
He replied with corrections to the above comments. He was unable to see any problems after further input.
Another edit: Someone replied that the code page may not be marked executable. How can I insure it is marked as such?
The problem had nothing to do with the code. Adding -z execstack to the linker fixed the problem.
If push %ebp is causing a segfault, then your stack pointer isn't pointing at valid stack. How does control reach that point? What platform are you on, and is there anything odd about the runtime environment? At the entry to the function, %esp should point to the return address in the caller on the stack. Does it?
Aside from that, the whole function is pretty weird. You go out of your way to set the rounding bits in the fp control word, and then don't perform any operations that are affected by rounding. All the function does is copy some data, but uses floating-point registers to do it when you could use the integer registers just as well. And then there's the spurious emms, which you need after using MMX instructions, not after doing x87 computations.
Edit See Scott's (the original questioner) answer for the actual reason for the crash.