I've got a program that I'm running on an ARM and I'm writing one function of it in assembly. I've made good progress on this, although I've found it difficult sometimes to figure out exactly how to write certain instructions for go's assembler, for example, I didn't expect a right shift to be written like this:
MOVW R3>>8, R3
Now I want to do a multiply and accumulate (MLA), according to this doc not all opcodes are supported, so maybe MLA isn't, but I don't know how to tell if it is or not. I see mentions of MLA with regards to ARM in the golang repo, but I'm not really sure what to make of what I see there.
Is there anywhere that documents what instructions are supported and how to write them? Can anyone give me any useful pointers?
Here is a bit of a scrappy doc i wrote on how to write ARM assembler
I wrote it from the point of view of an experienced ARM person trying to figure out how Go assembler works.
Here is an excerpt from the start. Feel free to email me if you have more questions!
The Go assembler is based on the plan 9 assembler which is documented here.
http://plan9.bell-labs.com/sys/doc/asm.html
Nice introduction to ARM
http://www.davespace.co.uk/arm/introduction-to-arm/index.html
Opcodes
http://simplemachines.it/doc/arm_inst.pdf
Instructions
Destination goes last not first
Parameters seem to be completely reversed
May be condensed to 2 operands, so
ADD r0, r0, r1 ; [ARM] r0 <- r0 + r1
is written as
ADD r1, r0, r0
or
ADD r1, r0
Constants denoted with '$' not '#'
Related
I'm writing some ARM64 assembly code for macOS, and it needs to access a global variable.
I tried to use the solution in this SO answer, and it works fine if I just call the function as is. However, my application needs to patch some instructions of this function, and the way I'm doing it, the function gets moved somewhere else in memory in the process. Note the adrp/ldr pair is untouched during patching.
However, if I try to run the function after moving it elsewhere in memory, it no longer returns correct results. This happens even if I just memcpy() the code as is, without patching. After tracing with a debugger, I isolated the issue to the address of the global valuable being incorrectly loaded by the adrp/ldr pair (and weirdly, the ldr is assembled as an add, as seen with objdump straight after compiling the binary -- not sure if it's somehow related to the issue here.)
What would be the correct way to load a global variable, so that it survives the function being copied somewhere else and run from there?
Note the adrp/ldr pair is untouched during patching.
There's the issue. If you rip code out of the binary it's in, then you effectively need to re-link it.
There's two ways of dealing with this:
If you have complete control over the segment layout, then you could have one executable segment with all of your assembly in it, and right next to it one segment with all addresses that code needs, and make sure the assembly ONLY has references to things on that page. Then wherever you copy your assembly, you'd also copy the data page next to it. This would enable you to make use of static addresses that get rebased by the dynamic linker at the time your binary is loaded. This might look something like:
.section __ASM,__asm,regular
.globl _asm_stub
.p2align 2
_asm_stub:
adrp x0, _some_ref#PAGE
ldr x0, [x0, _some_ref#PAGEOFF]
ret
.section __REF,__ref
.globl _some_ref
.p2align 3
_some_ref:
.8byte _main
Compile that with -Wl,-segprot,__ASM,rx,rx and you'll get an executable __ASM and a writeable __REF segment. Those two would have to maintain their relative position to each other when they get copied around.
(Note that on arm64 macOS you cannot put symbol references into executable segments for the dynamic linker to rebase, because it will fault and crash while trying to do so, and even if it were able to do that, it would invalidate the code signature.)
You act as a linker, scanning for PC-relative instructions and re-linking them as you go. The list of PC-relative instructions in arm64 is quite short, so it should be a feasible amount of work:
adr and adrp
b and bl
b.cond (and bc.cond with FEAT_HBC)
cbz and cbnz
tbz and tbnz
ldr and ldrsw (literal)
ldr (SIMD & FP literal)
prfm (literal)
(You can look for the string PC[] in the ARMv8 Reference Manual to find all uses.)
For each of those you'd have to check whether their target address lies within the range that's being copied or not. If it does, then you'd leave the instruction alone (unless you copy the code to a different offset within the 4K page than it was before, in which case you have to fix up adrp instructions). If it isn't then you'll have to recalculate the offset and emit a new instruction. Some of the instructions have a really low maximum offset (tbz/tbnz ±32KiB). But usually the only instructions that reference addresses across function boundaries are adr, adrp, b, bl and ldr. If all code on the page is written by you then you can do adrp+add instead of adr and adrp+ldr instead of just ldr, and if you have compiler-generated code on there, then all adr's and ldr's will have a nop before or after, which you can use to turn them into an adrp combo. That should get your maximum reference range up to ±128MiB.
We are building a custom android board based on an imx6 SoC. the android version used is quite old (KitKat 4.4.2), and so is the kernel (3.0.35).
We are dealing with an issue that we haven't figured out yet.
Usually, when everything works fine, the reboot of the board takes 5-6 second top. But sometimes, the reboot of the board takes a long time, ranging anywhere from 1.30 minute up to 2.30 minutes.
What we would like to know is, first, which module / function is the kernel stuck in.
We suspect this could be an eMMC problem, but this is a longshot guess and we really have no clue of what is going on at this point.
Do you guys know of ways to make the kernel extra verbose ? like print every function call ? Could kgdb or similar debugging tools help us at this point ?
Thanks,
Regards,
Vauteck
EDIT:
So we made progress in the search of the problem. Turns out the kernel is stuck in the arm_machine_restart() function in arch/arm/kernel/process.c.
Specifically, it's stuck after the call to cpu_proc_fin() function, which for our board is defined as cpu_v7_proc_init in arch/arm/mm/proc-v7.S. The code of this function is in assembly :
mrc p15, 0, r0, c1, c0, 0 # ctrl register
bic r0, r0, #0x1000 # ...i............
bic r0, r0, #0x0006 # .............ca.
mcr p15, 0, r0, c1, c0, 0 # disable caches
mov pc, lr
We are not the only ones that encountered this issue. (thread on NXP forum here)
We tried commenting out the line
// bic r0, r0, #0x0006 # .............ca.
Now the function never blocks but sometimes the board still doesn't reboot immediately.
We are still looking for insights and suggestions at this point.
Thanks for reading guys.
If you enable CONFIG_PRINTK_TIME in the kernel, dmesg will print the time before the logs (in seconds). This enables you to search for time gaps between lines and maybe you're able to find what is causing this problem.
If you've found out that the problem indeed exists in the kernel, it's likely that you can enable some CONFIG_DEBUG_* configuration item or define CONFIG_DEBUG in the driver to obtain more information. Otherwise, printk will be the best you've got.
Also, take a look the the following kernel configurations:
CONFIG_DEBUG_LL
CONFIG_DEBUG_IMX_UART
CONFIG_DEBUG_IMX6Q_UART
CONFIG_EARLY_PRINTK
CONFIG_EARLY_PRINTK_DIRECT
To be complete: You can make use of logcat to see whether or not some initialisation delays the boot. If your company builds the hardware, I think it pays off to see what the chip is doing with a scope (because I don't immediately think that Linux is delaying the boot), but not before you know for certain that multiple boards have the same problem.
I'm interested in what you will find. Keep me (us) updated ;-)
I'm trying to call a function that is coded in ARM NEON assembly in an .s file that looks like this:
AREA myfunction, code, readonly, ARM
global fun
align 4
fun
push {r4, r5, r6, r7, lr}
add r7, sp, #12
push {r8, r10, r11}
sub r4, sp, #64
bic r4, r4, #15
mov sp, r4
vst1.64 {d8, d9, d10, d11}, [r4]!
vst1.64 {d12, d13, d14, d15}, [r4]
[....]
and I'm assembling it like this:
armasm.exe -32 func.s func.obj
Unfortunately this doesn't work, and I'm getting illegal instruction exception when I try and call the function. When I used dumpbin.exe to disassemble the .obj, it seem to be disassembling as though it was Thumb code, despite the ARM directive in the assembly (see code above).
I suspect the function is being called in Thumb mode, and that all functions are assumed to be in Thumb mode by default on Windows. Can't see to find any info on this though.
Does anyone know what is going on here?
EDIT: This happens on Microsoft Surface as well
VS 2012 by default produces thumb code for both Windows RT and Windows Phone 8 so the error you got is probably caused by calling into arm code from thumb code. You have two options:
1. Switch from thumb mode to arm mode before calling your function (you can use BX asm instruction for it), or
2. You can try to rewrite your NEON code in C++ using ARM/NEON intrinsics - they are supported by VS 2012. Just include "arm_neon.h" and you're done.
For the ARM intrinsics reference check out the following link: http://msdn.microsoft.com/en-us/library/hh875058.aspx
For NEON intrinsics reference check out this link: http://infocenter.arm.com/help/topic/com.arm.doc.dui0491c/DUI0491C_arm_compiler_reference.pdf
These NEON intrinsics from the link above are generally supported by VS 2012, there might be some small differences though - if unsure, check the "arm_neon.h" include to find out.
You could start your assembly code with a bx instruction in Thumb mode and simply branch to the ARM part in the same source file.
And you don't have to switch back to Thumb mode at the end since you'll finish the ARM function in bx or pop {pc} anyway which does the switching automatically.
My answer is WAAAAAAY late, but I'm really curious if it works on a WP. (I don't have any)
I'm trying to write a firmware mod (to existing firmware, for which i don't have source code)
All Thumb code.
does anybody have any idea how to do this, in gcc as (GAS) assembler:
Use BL without having to manually calculate offsets, when BL'ing to some existing function (not in my code.. but i know its address)
Currently, if i want to use BL ...i have to :
-go back in my code
-figure out and add all the bytes that would result from assembling all the previous instructions in the function i'm writing
-add the begining address of my function to that (i specify the starting address of what i'm writing, in the linker script)
-and then substract the address of the firmfunc function i want to call
All this... just to calculate the offset... to be able to write abl offset... to call an existing firmware function?
And if i change any code before that BL, i have to do it all over again manually !
See.. this is why i want to learn to use BX right... instead of BL
Also, i don't quite understand the BX. If i use BX to jump to an absolute address, do i have to increase the actual address by 1, when caling Thumb code from Thumb code (to keep the lsb byte 1)... and the CPU will know it's thumb code ?
BIG EDIT:
Changing the answer based on what I have learned recently and a better understanding of the question
First off I dont know how to tell the linker to generate a bl to an address that is a hardcoded address and not actually in this code. You might try to rig up an elf file that has labels and such but dummy or no code, dont know if that will fool the linker or not. You would have to modify the linker script as well. not worth it.
your other question that was spawned from this one:
Arm/Thumb: using BX in Thumb code, to call a Thumb function, or to jump to a Thumb instruction in another function
For branching this works just fine:
LDR R6, =0x24000
ADD R6, #1 # (set lsb to 1)
BX R6
or save an instruction and just do this
LDR R6, =0x24001
BX R6
if you want to branch link and you know the address and you are in thumb mode and want to get to thumb code then
ldr r6,=0x24001
bl thumb_trampoline
;#returns here
...
.thumb_func
thumb_trampoline:
bx r6
And almost the exact same if you are starting in arm mode, and want to get to thumb code at an address you already know.
ldr r6,=0x24001
bl arm_trampoline
;#returns here
...
arm_trampoline:
bx r6
You have to know that you can trash r6 in this way (make sure r6 isnt saving some value being used by some code that called this code).
Very sorry misleading you with the other answer, I could swear that mov lr,pc pulled in the lsbit as a mode, but it doesnt.
The accepted answer achieves the desired goal, but to address the answer exactly as asked you can use the .equ directive to associate a constant vale with a symbol, that can then be used as an operand to instructions. This has the assembler synthesise the trampoline if/when necessary:
equ myFirmwareFunction, 0x12346570
.globl _start
mov r0, #42
b myFirmwareFunction
Which generates the following assembly[1]
01000000 <_start>:
1000000: e3a0002a mov r0, #42 ; 0x2a
1000004: eaffffff b 1000008 <__*ABS*0x12346570_veneer>
01000008 <__*ABS*0x12346570_veneer>:
__*ABS*0x12346570_veneer():
1000008: e51ff004 ldr pc, [pc, #-4] ; 100000c <__*ABS*0x12346570_veneer+0x4>
100000c: 12346570 data: #0x12345670
If the immediate value is close enough to PC that the offset will fit in the immediate field, then the verneer (trampoline) is skipped and you will get a single branch instruction to the specified constant address.
[1] using the codesorcery (2009q1) toolchain with:
arm-none-eabi-gcc -march=armv7-a -x assembler test.spp -o test.elf -Ttext=0x1000000 -nostdlib
Is it possible using GNU tools (gcc, binutils, etc) to modify all occurrences of an assembly instruction into a no-op? Specifically, gcc with the -pg option generates the following assembly (ARM):
0x0: e1a0c00d mov ip, sp
0x4: e92dd800 stmdb sp!, {fp, ip, lr, pc}
0x8: e24cb004 sub fp, ip, #4 ; 0x4
0xc: ebfffffe bl 0 <mcount>
I want to record the address of this last instruction, and then change it to a nop like in the following code
0x0: e1a0c00d mov ip, sp
0x4: e92dd800 stmdb sp!, {fp, ip, lr, pc}
0x8: e24cb004 sub fp, ip, #4 ; 0x4
0xc: e1a00000 nop (mov r0,r0)
The Linux kernel can do something similar to this at run-time, but I'm looking for a build-time solution.
You can compile the code with gcc -S to output an assembler listing, instead of compiling fully into an object file or executable. Then, just replace the desired instructions with no-ops (e.g. using sed), and continue compilation from there.
If you also want to do this for object files or libraries that you don't have the original source code for, you'll instead have to use a tool such as objdump(1) to disassemble them and get the addresses of the instructions you wish to replace. Then, parse the object file headers to find the offsets within the file of those instructions, and then replace the machine instructions with no-ops directly in the object files. This is a little trickier, but doable.
This will certainly be easier with a RISC-ish fixed-length instruction format than for e.g. x86.
It should be relatively straightforward to use libelf (nice tutorial here: http://people.freebsd.org/~jkoshy/download/libelf/article.html) or libbfd (http://sourceware.org/binutils/docs-2.19/bfd/index.html) to open the object file, modify instructions within the .text section, and write it out again using provided APIs. Whether it's worth the effort or not will depend on non-technical considerations (I am a bit curious though...).
It's worth mentioning that there might be a few wrinkles with using libelf or libbfd if this needs to work in a cross-development environment.