Is it possible using GNU tools (gcc, binutils, etc) to modify all occurrences of an assembly instruction into a no-op? Specifically, gcc with the -pg option generates the following assembly (ARM):
0x0: e1a0c00d mov ip, sp
0x4: e92dd800 stmdb sp!, {fp, ip, lr, pc}
0x8: e24cb004 sub fp, ip, #4 ; 0x4
0xc: ebfffffe bl 0 <mcount>
I want to record the address of this last instruction, and then change it to a nop like in the following code
0x0: e1a0c00d mov ip, sp
0x4: e92dd800 stmdb sp!, {fp, ip, lr, pc}
0x8: e24cb004 sub fp, ip, #4 ; 0x4
0xc: e1a00000 nop (mov r0,r0)
The Linux kernel can do something similar to this at run-time, but I'm looking for a build-time solution.
You can compile the code with gcc -S to output an assembler listing, instead of compiling fully into an object file or executable. Then, just replace the desired instructions with no-ops (e.g. using sed), and continue compilation from there.
If you also want to do this for object files or libraries that you don't have the original source code for, you'll instead have to use a tool such as objdump(1) to disassemble them and get the addresses of the instructions you wish to replace. Then, parse the object file headers to find the offsets within the file of those instructions, and then replace the machine instructions with no-ops directly in the object files. This is a little trickier, but doable.
This will certainly be easier with a RISC-ish fixed-length instruction format than for e.g. x86.
It should be relatively straightforward to use libelf (nice tutorial here: http://people.freebsd.org/~jkoshy/download/libelf/article.html) or libbfd (http://sourceware.org/binutils/docs-2.19/bfd/index.html) to open the object file, modify instructions within the .text section, and write it out again using provided APIs. Whether it's worth the effort or not will depend on non-technical considerations (I am a bit curious though...).
It's worth mentioning that there might be a few wrinkles with using libelf or libbfd if this needs to work in a cross-development environment.
Related
I'm writing some ARM64 assembly code for macOS, and it needs to access a global variable.
I tried to use the solution in this SO answer, and it works fine if I just call the function as is. However, my application needs to patch some instructions of this function, and the way I'm doing it, the function gets moved somewhere else in memory in the process. Note the adrp/ldr pair is untouched during patching.
However, if I try to run the function after moving it elsewhere in memory, it no longer returns correct results. This happens even if I just memcpy() the code as is, without patching. After tracing with a debugger, I isolated the issue to the address of the global valuable being incorrectly loaded by the adrp/ldr pair (and weirdly, the ldr is assembled as an add, as seen with objdump straight after compiling the binary -- not sure if it's somehow related to the issue here.)
What would be the correct way to load a global variable, so that it survives the function being copied somewhere else and run from there?
Note the adrp/ldr pair is untouched during patching.
There's the issue. If you rip code out of the binary it's in, then you effectively need to re-link it.
There's two ways of dealing with this:
If you have complete control over the segment layout, then you could have one executable segment with all of your assembly in it, and right next to it one segment with all addresses that code needs, and make sure the assembly ONLY has references to things on that page. Then wherever you copy your assembly, you'd also copy the data page next to it. This would enable you to make use of static addresses that get rebased by the dynamic linker at the time your binary is loaded. This might look something like:
.section __ASM,__asm,regular
.globl _asm_stub
.p2align 2
_asm_stub:
adrp x0, _some_ref#PAGE
ldr x0, [x0, _some_ref#PAGEOFF]
ret
.section __REF,__ref
.globl _some_ref
.p2align 3
_some_ref:
.8byte _main
Compile that with -Wl,-segprot,__ASM,rx,rx and you'll get an executable __ASM and a writeable __REF segment. Those two would have to maintain their relative position to each other when they get copied around.
(Note that on arm64 macOS you cannot put symbol references into executable segments for the dynamic linker to rebase, because it will fault and crash while trying to do so, and even if it were able to do that, it would invalidate the code signature.)
You act as a linker, scanning for PC-relative instructions and re-linking them as you go. The list of PC-relative instructions in arm64 is quite short, so it should be a feasible amount of work:
adr and adrp
b and bl
b.cond (and bc.cond with FEAT_HBC)
cbz and cbnz
tbz and tbnz
ldr and ldrsw (literal)
ldr (SIMD & FP literal)
prfm (literal)
(You can look for the string PC[] in the ARMv8 Reference Manual to find all uses.)
For each of those you'd have to check whether their target address lies within the range that's being copied or not. If it does, then you'd leave the instruction alone (unless you copy the code to a different offset within the 4K page than it was before, in which case you have to fix up adrp instructions). If it isn't then you'll have to recalculate the offset and emit a new instruction. Some of the instructions have a really low maximum offset (tbz/tbnz ±32KiB). But usually the only instructions that reference addresses across function boundaries are adr, adrp, b, bl and ldr. If all code on the page is written by you then you can do adrp+add instead of adr and adrp+ldr instead of just ldr, and if you have compiler-generated code on there, then all adr's and ldr's will have a nop before or after, which you can use to turn them into an adrp combo. That should get your maximum reference range up to ±128MiB.
I've got a program that I'm running on an ARM and I'm writing one function of it in assembly. I've made good progress on this, although I've found it difficult sometimes to figure out exactly how to write certain instructions for go's assembler, for example, I didn't expect a right shift to be written like this:
MOVW R3>>8, R3
Now I want to do a multiply and accumulate (MLA), according to this doc not all opcodes are supported, so maybe MLA isn't, but I don't know how to tell if it is or not. I see mentions of MLA with regards to ARM in the golang repo, but I'm not really sure what to make of what I see there.
Is there anywhere that documents what instructions are supported and how to write them? Can anyone give me any useful pointers?
Here is a bit of a scrappy doc i wrote on how to write ARM assembler
I wrote it from the point of view of an experienced ARM person trying to figure out how Go assembler works.
Here is an excerpt from the start. Feel free to email me if you have more questions!
The Go assembler is based on the plan 9 assembler which is documented here.
http://plan9.bell-labs.com/sys/doc/asm.html
Nice introduction to ARM
http://www.davespace.co.uk/arm/introduction-to-arm/index.html
Opcodes
http://simplemachines.it/doc/arm_inst.pdf
Instructions
Destination goes last not first
Parameters seem to be completely reversed
May be condensed to 2 operands, so
ADD r0, r0, r1 ; [ARM] r0 <- r0 + r1
is written as
ADD r1, r0, r0
or
ADD r1, r0
Constants denoted with '$' not '#'
I am happy to ask questions in stack overflow due to prompt reply from experts world wide:-) I wish to explain clearly the issue I am facing.
What I wish to do?
I wish to evaluate NEON instruction set through various examples available online in-order to write some algorithm on my own.
For evaluation purpose, I'm making use of memcpy samples available at ARM official website. Here is the link http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html.
My Environment
I am compiling NEON instruction set on Visual Studio 2008 with Platform Builder for Windows CE 7.0. Latest platform builder supports NEON instruction compilation.
I am running my code on OMAP3530 Mistral EVM board.
I have created a simple static library (NEONLIB.lib) that contains NEON instructions to perform the required operation. I have created simple Stream driver (stream_interface.dll) that uses this static library to perform memcpy operation on 1280X720X2 bytes buffer. I am loading and unloading this driver dynamically using a simple application (Neon_Test.exe).
Issue I'm facing
Once the OS boots, I am launching this application manually and following the exception I receive.
Exception 'Data Abort' (0x4): Thread-Id=047d002a(pth=c049c990), Proc-Id=00400002(pprc=8a3425e0) 'NK.EXE', VM-active=05420012(pprc=c04a1344) 'Neon_Test.exe'
PID:00400002 TID:047D002A PC=ef135120(stream_interface.dll+0x00005120) RA=ef133c18(stream_interface.dll+0x00003c18) SP=d0f3fc84, BVA=00000000
NeonMemcpy is function in my driver that calls NEON function.
Stream_Interface.map file
....
0001:000029f0 ?NeonInit##YAHXZ 100039f0 f Neon_Process.obj
0001:00002bb4 ?NeonMemcpy##YAXXZ 10003bb4 f Neon_Process.obj
0001:00002c58 NKDbgPrintfW 10003c58 f coredll:COREDLL.dll
0001:00002c68 SetLastError 10003c68 f coredll:COREDLL.dll
....
Neon_Process.cod file
.......
; 108 : MemcpyCustom((void*)g_pOUTVirtualAddr, (void*)g_pINPVirtualAddr, 1280 * 720 * 2);
00050 e5951000 ldr r1,[r5]
00054 e1a04000 mov r4,r0
00058 e5950004 ldr r0,[r5,#4]
0005c e3a02ae1 mov r2,#0xE1000
00060 eb000000 bl MemcpyCustom
; 109 : RETAILMSG(1, (L"Time for Copy using Neon %d\r\n", GetTickCount() - dwStartTime));
00064 eb000000 bl GetTickCount
00068 e1a03000 mov r3,r0
.......
My assembly source
AREA omap_neoncoding, CODE, READONLY
EXPORT MemcpyCustom
INCLUDE omap_neoncoding.inc
MemcpyCustom
stmfd sp!, {r4-r12,lr}
NEONCopyPLD
PLD [r1, #0xC0]
VLDM r1!,{d0-d7}
VSTM r0!,{d0-d7}
SUBS r2,r2,#0x40
BGE NEONCopyPLD
END
Based on article by Bruce Eitman, http://geekswithblogs.net/BruceEitman/archive/2008/05/19/windows-ce--finding-the-cause-of-a-data-abort.aspx, the location where the exception occurs was
00064 eb000000 bl GetTickCount
But I am sure that there is no issue in GetTickCount(), if I remove MemcpyCustom function, everything goes fine. Hope I have given all the information to help to sort out this issue. Please help me to find out the exact reason for the exception. Do i need to do any steps before calling neon functions or any other special neon instructions should be followed?
Thanks in advance for your help.
Spark
You are pushing registers in the function's prolog:
stmfd sp!, {r4-r12,lr}
But there is no corresponding pop at the end, and no return instruction. So the execution continues to whatever code happens to be after the function and what happens next is anyone's guess. The following, placed after the BGE should fix the problem:
ldmfd sp!, {r4-r12,pc}
EDIT: By the way, since you're not actually using r4-r12 in the function, you don't need to save them. You also don't need to save d0-d7 as they're considered volatile. So you can remove stmfd and replace ldmfd by just bx lr.
MemcpyCustom
PLD [r1, #0xC0]
VLDM r1!,{d0-d7}
VSTM r0!,{d0-d7}
SUBS r2,r2,#0x40
BGE MemcpyCustom
BX lr
I'm trying to call a function that is coded in ARM NEON assembly in an .s file that looks like this:
AREA myfunction, code, readonly, ARM
global fun
align 4
fun
push {r4, r5, r6, r7, lr}
add r7, sp, #12
push {r8, r10, r11}
sub r4, sp, #64
bic r4, r4, #15
mov sp, r4
vst1.64 {d8, d9, d10, d11}, [r4]!
vst1.64 {d12, d13, d14, d15}, [r4]
[....]
and I'm assembling it like this:
armasm.exe -32 func.s func.obj
Unfortunately this doesn't work, and I'm getting illegal instruction exception when I try and call the function. When I used dumpbin.exe to disassemble the .obj, it seem to be disassembling as though it was Thumb code, despite the ARM directive in the assembly (see code above).
I suspect the function is being called in Thumb mode, and that all functions are assumed to be in Thumb mode by default on Windows. Can't see to find any info on this though.
Does anyone know what is going on here?
EDIT: This happens on Microsoft Surface as well
VS 2012 by default produces thumb code for both Windows RT and Windows Phone 8 so the error you got is probably caused by calling into arm code from thumb code. You have two options:
1. Switch from thumb mode to arm mode before calling your function (you can use BX asm instruction for it), or
2. You can try to rewrite your NEON code in C++ using ARM/NEON intrinsics - they are supported by VS 2012. Just include "arm_neon.h" and you're done.
For the ARM intrinsics reference check out the following link: http://msdn.microsoft.com/en-us/library/hh875058.aspx
For NEON intrinsics reference check out this link: http://infocenter.arm.com/help/topic/com.arm.doc.dui0491c/DUI0491C_arm_compiler_reference.pdf
These NEON intrinsics from the link above are generally supported by VS 2012, there might be some small differences though - if unsure, check the "arm_neon.h" include to find out.
You could start your assembly code with a bx instruction in Thumb mode and simply branch to the ARM part in the same source file.
And you don't have to switch back to Thumb mode at the end since you'll finish the ARM function in bx or pop {pc} anyway which does the switching automatically.
My answer is WAAAAAAY late, but I'm really curious if it works on a WP. (I don't have any)
I'm trying to write a firmware mod (to existing firmware, for which i don't have source code)
All Thumb code.
does anybody have any idea how to do this, in gcc as (GAS) assembler:
Use BL without having to manually calculate offsets, when BL'ing to some existing function (not in my code.. but i know its address)
Currently, if i want to use BL ...i have to :
-go back in my code
-figure out and add all the bytes that would result from assembling all the previous instructions in the function i'm writing
-add the begining address of my function to that (i specify the starting address of what i'm writing, in the linker script)
-and then substract the address of the firmfunc function i want to call
All this... just to calculate the offset... to be able to write abl offset... to call an existing firmware function?
And if i change any code before that BL, i have to do it all over again manually !
See.. this is why i want to learn to use BX right... instead of BL
Also, i don't quite understand the BX. If i use BX to jump to an absolute address, do i have to increase the actual address by 1, when caling Thumb code from Thumb code (to keep the lsb byte 1)... and the CPU will know it's thumb code ?
BIG EDIT:
Changing the answer based on what I have learned recently and a better understanding of the question
First off I dont know how to tell the linker to generate a bl to an address that is a hardcoded address and not actually in this code. You might try to rig up an elf file that has labels and such but dummy or no code, dont know if that will fool the linker or not. You would have to modify the linker script as well. not worth it.
your other question that was spawned from this one:
Arm/Thumb: using BX in Thumb code, to call a Thumb function, or to jump to a Thumb instruction in another function
For branching this works just fine:
LDR R6, =0x24000
ADD R6, #1 # (set lsb to 1)
BX R6
or save an instruction and just do this
LDR R6, =0x24001
BX R6
if you want to branch link and you know the address and you are in thumb mode and want to get to thumb code then
ldr r6,=0x24001
bl thumb_trampoline
;#returns here
...
.thumb_func
thumb_trampoline:
bx r6
And almost the exact same if you are starting in arm mode, and want to get to thumb code at an address you already know.
ldr r6,=0x24001
bl arm_trampoline
;#returns here
...
arm_trampoline:
bx r6
You have to know that you can trash r6 in this way (make sure r6 isnt saving some value being used by some code that called this code).
Very sorry misleading you with the other answer, I could swear that mov lr,pc pulled in the lsbit as a mode, but it doesnt.
The accepted answer achieves the desired goal, but to address the answer exactly as asked you can use the .equ directive to associate a constant vale with a symbol, that can then be used as an operand to instructions. This has the assembler synthesise the trampoline if/when necessary:
equ myFirmwareFunction, 0x12346570
.globl _start
mov r0, #42
b myFirmwareFunction
Which generates the following assembly[1]
01000000 <_start>:
1000000: e3a0002a mov r0, #42 ; 0x2a
1000004: eaffffff b 1000008 <__*ABS*0x12346570_veneer>
01000008 <__*ABS*0x12346570_veneer>:
__*ABS*0x12346570_veneer():
1000008: e51ff004 ldr pc, [pc, #-4] ; 100000c <__*ABS*0x12346570_veneer+0x4>
100000c: 12346570 data: #0x12345670
If the immediate value is close enough to PC that the offset will fit in the immediate field, then the verneer (trampoline) is skipped and you will get a single branch instruction to the specified constant address.
[1] using the codesorcery (2009q1) toolchain with:
arm-none-eabi-gcc -march=armv7-a -x assembler test.spp -o test.elf -Ttext=0x1000000 -nostdlib