I am happy to ask questions in stack overflow due to prompt reply from experts world wide:-) I wish to explain clearly the issue I am facing.
What I wish to do?
I wish to evaluate NEON instruction set through various examples available online in-order to write some algorithm on my own.
For evaluation purpose, I'm making use of memcpy samples available at ARM official website. Here is the link http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html.
My Environment
I am compiling NEON instruction set on Visual Studio 2008 with Platform Builder for Windows CE 7.0. Latest platform builder supports NEON instruction compilation.
I am running my code on OMAP3530 Mistral EVM board.
I have created a simple static library (NEONLIB.lib) that contains NEON instructions to perform the required operation. I have created simple Stream driver (stream_interface.dll) that uses this static library to perform memcpy operation on 1280X720X2 bytes buffer. I am loading and unloading this driver dynamically using a simple application (Neon_Test.exe).
Issue I'm facing
Once the OS boots, I am launching this application manually and following the exception I receive.
Exception 'Data Abort' (0x4): Thread-Id=047d002a(pth=c049c990), Proc-Id=00400002(pprc=8a3425e0) 'NK.EXE', VM-active=05420012(pprc=c04a1344) 'Neon_Test.exe'
PID:00400002 TID:047D002A PC=ef135120(stream_interface.dll+0x00005120) RA=ef133c18(stream_interface.dll+0x00003c18) SP=d0f3fc84, BVA=00000000
NeonMemcpy is function in my driver that calls NEON function.
Stream_Interface.map file
....
0001:000029f0 ?NeonInit##YAHXZ 100039f0 f Neon_Process.obj
0001:00002bb4 ?NeonMemcpy##YAXXZ 10003bb4 f Neon_Process.obj
0001:00002c58 NKDbgPrintfW 10003c58 f coredll:COREDLL.dll
0001:00002c68 SetLastError 10003c68 f coredll:COREDLL.dll
....
Neon_Process.cod file
.......
; 108 : MemcpyCustom((void*)g_pOUTVirtualAddr, (void*)g_pINPVirtualAddr, 1280 * 720 * 2);
00050 e5951000 ldr r1,[r5]
00054 e1a04000 mov r4,r0
00058 e5950004 ldr r0,[r5,#4]
0005c e3a02ae1 mov r2,#0xE1000
00060 eb000000 bl MemcpyCustom
; 109 : RETAILMSG(1, (L"Time for Copy using Neon %d\r\n", GetTickCount() - dwStartTime));
00064 eb000000 bl GetTickCount
00068 e1a03000 mov r3,r0
.......
My assembly source
AREA omap_neoncoding, CODE, READONLY
EXPORT MemcpyCustom
INCLUDE omap_neoncoding.inc
MemcpyCustom
stmfd sp!, {r4-r12,lr}
NEONCopyPLD
PLD [r1, #0xC0]
VLDM r1!,{d0-d7}
VSTM r0!,{d0-d7}
SUBS r2,r2,#0x40
BGE NEONCopyPLD
END
Based on article by Bruce Eitman, http://geekswithblogs.net/BruceEitman/archive/2008/05/19/windows-ce--finding-the-cause-of-a-data-abort.aspx, the location where the exception occurs was
00064 eb000000 bl GetTickCount
But I am sure that there is no issue in GetTickCount(), if I remove MemcpyCustom function, everything goes fine. Hope I have given all the information to help to sort out this issue. Please help me to find out the exact reason for the exception. Do i need to do any steps before calling neon functions or any other special neon instructions should be followed?
Thanks in advance for your help.
Spark
You are pushing registers in the function's prolog:
stmfd sp!, {r4-r12,lr}
But there is no corresponding pop at the end, and no return instruction. So the execution continues to whatever code happens to be after the function and what happens next is anyone's guess. The following, placed after the BGE should fix the problem:
ldmfd sp!, {r4-r12,pc}
EDIT: By the way, since you're not actually using r4-r12 in the function, you don't need to save them. You also don't need to save d0-d7 as they're considered volatile. So you can remove stmfd and replace ldmfd by just bx lr.
MemcpyCustom
PLD [r1, #0xC0]
VLDM r1!,{d0-d7}
VSTM r0!,{d0-d7}
SUBS r2,r2,#0x40
BGE MemcpyCustom
BX lr
Related
I've got a program that I'm running on an ARM and I'm writing one function of it in assembly. I've made good progress on this, although I've found it difficult sometimes to figure out exactly how to write certain instructions for go's assembler, for example, I didn't expect a right shift to be written like this:
MOVW R3>>8, R3
Now I want to do a multiply and accumulate (MLA), according to this doc not all opcodes are supported, so maybe MLA isn't, but I don't know how to tell if it is or not. I see mentions of MLA with regards to ARM in the golang repo, but I'm not really sure what to make of what I see there.
Is there anywhere that documents what instructions are supported and how to write them? Can anyone give me any useful pointers?
Here is a bit of a scrappy doc i wrote on how to write ARM assembler
I wrote it from the point of view of an experienced ARM person trying to figure out how Go assembler works.
Here is an excerpt from the start. Feel free to email me if you have more questions!
The Go assembler is based on the plan 9 assembler which is documented here.
http://plan9.bell-labs.com/sys/doc/asm.html
Nice introduction to ARM
http://www.davespace.co.uk/arm/introduction-to-arm/index.html
Opcodes
http://simplemachines.it/doc/arm_inst.pdf
Instructions
Destination goes last not first
Parameters seem to be completely reversed
May be condensed to 2 operands, so
ADD r0, r0, r1 ; [ARM] r0 <- r0 + r1
is written as
ADD r1, r0, r0
or
ADD r1, r0
Constants denoted with '$' not '#'
I have an ARM assembly function that is called from a C function.
At some point, I do something like this:
.syntax unified
.arm
.text
.globl myfunc
.extern printf
myfunc:
stmdb sp!, {r4-r11} // save stack from C call
... do stuff ...
// (NOT SHOWN): Load values into r1 and r2 to be printed by format string above
ldr r0, =message // Load format string above
push {lr} // me attempting to preserve my stack
bl printf // actual call to printf
pop {lr} // me attempting to recover my stack
ldmia sp!, {r4-r11} // recover stack from C call
mov r0, r2 // Move return value into r0
mov pc, lr // Return to C
.section data
message:
.asciz "Output: %d, %d\n"
.end
This runs sometimes, crashes sometimes, runs a few times then crashes, etc. It actually runs on a quasi bare-metal context, so I can't run a debugger. I'm 99% sure it's a stack -- or alignment? -- thing, as per this Printf Change values in registers, ARM Assembly and this Call C function from Assembly -- the application freezes at "call printf" and I have no idea why.
Can anyone provide some specific ideas for how to get the above chunk of code running, and perhaps general ideas for best practices here? Ideally I'd like to be able to call the same output function multiple times in my assembly file, to debug things as I go.
Thanks in advance!
I could see the following issues in that code:
.align 2 (could be 3 or any higher value) before function entry point (myfunc:)
.align 2 // guarantee that instruction address is 4B aligned
myfunc:
as was mentioned in comments, stack is expected to be 8B aligned. push {lr} breaks that.
message: doesn't need to be in 'data' section. It might be placed in code section behind 'myfunc'. Check linker map that data is actually present & address loaded into r0 is correct.
Since that a bare-metal, check that stack is set properly and enough room is reserved for it.
I'm trying to call a function that is coded in ARM NEON assembly in an .s file that looks like this:
AREA myfunction, code, readonly, ARM
global fun
align 4
fun
push {r4, r5, r6, r7, lr}
add r7, sp, #12
push {r8, r10, r11}
sub r4, sp, #64
bic r4, r4, #15
mov sp, r4
vst1.64 {d8, d9, d10, d11}, [r4]!
vst1.64 {d12, d13, d14, d15}, [r4]
[....]
and I'm assembling it like this:
armasm.exe -32 func.s func.obj
Unfortunately this doesn't work, and I'm getting illegal instruction exception when I try and call the function. When I used dumpbin.exe to disassemble the .obj, it seem to be disassembling as though it was Thumb code, despite the ARM directive in the assembly (see code above).
I suspect the function is being called in Thumb mode, and that all functions are assumed to be in Thumb mode by default on Windows. Can't see to find any info on this though.
Does anyone know what is going on here?
EDIT: This happens on Microsoft Surface as well
VS 2012 by default produces thumb code for both Windows RT and Windows Phone 8 so the error you got is probably caused by calling into arm code from thumb code. You have two options:
1. Switch from thumb mode to arm mode before calling your function (you can use BX asm instruction for it), or
2. You can try to rewrite your NEON code in C++ using ARM/NEON intrinsics - they are supported by VS 2012. Just include "arm_neon.h" and you're done.
For the ARM intrinsics reference check out the following link: http://msdn.microsoft.com/en-us/library/hh875058.aspx
For NEON intrinsics reference check out this link: http://infocenter.arm.com/help/topic/com.arm.doc.dui0491c/DUI0491C_arm_compiler_reference.pdf
These NEON intrinsics from the link above are generally supported by VS 2012, there might be some small differences though - if unsure, check the "arm_neon.h" include to find out.
You could start your assembly code with a bx instruction in Thumb mode and simply branch to the ARM part in the same source file.
And you don't have to switch back to Thumb mode at the end since you'll finish the ARM function in bx or pop {pc} anyway which does the switching automatically.
My answer is WAAAAAAY late, but I'm really curious if it works on a WP. (I don't have any)
I'm trying to write a firmware mod (to existing firmware, for which i don't have source code)
All Thumb code.
does anybody have any idea how to do this, in gcc as (GAS) assembler:
Use BL without having to manually calculate offsets, when BL'ing to some existing function (not in my code.. but i know its address)
Currently, if i want to use BL ...i have to :
-go back in my code
-figure out and add all the bytes that would result from assembling all the previous instructions in the function i'm writing
-add the begining address of my function to that (i specify the starting address of what i'm writing, in the linker script)
-and then substract the address of the firmfunc function i want to call
All this... just to calculate the offset... to be able to write abl offset... to call an existing firmware function?
And if i change any code before that BL, i have to do it all over again manually !
See.. this is why i want to learn to use BX right... instead of BL
Also, i don't quite understand the BX. If i use BX to jump to an absolute address, do i have to increase the actual address by 1, when caling Thumb code from Thumb code (to keep the lsb byte 1)... and the CPU will know it's thumb code ?
BIG EDIT:
Changing the answer based on what I have learned recently and a better understanding of the question
First off I dont know how to tell the linker to generate a bl to an address that is a hardcoded address and not actually in this code. You might try to rig up an elf file that has labels and such but dummy or no code, dont know if that will fool the linker or not. You would have to modify the linker script as well. not worth it.
your other question that was spawned from this one:
Arm/Thumb: using BX in Thumb code, to call a Thumb function, or to jump to a Thumb instruction in another function
For branching this works just fine:
LDR R6, =0x24000
ADD R6, #1 # (set lsb to 1)
BX R6
or save an instruction and just do this
LDR R6, =0x24001
BX R6
if you want to branch link and you know the address and you are in thumb mode and want to get to thumb code then
ldr r6,=0x24001
bl thumb_trampoline
;#returns here
...
.thumb_func
thumb_trampoline:
bx r6
And almost the exact same if you are starting in arm mode, and want to get to thumb code at an address you already know.
ldr r6,=0x24001
bl arm_trampoline
;#returns here
...
arm_trampoline:
bx r6
You have to know that you can trash r6 in this way (make sure r6 isnt saving some value being used by some code that called this code).
Very sorry misleading you with the other answer, I could swear that mov lr,pc pulled in the lsbit as a mode, but it doesnt.
The accepted answer achieves the desired goal, but to address the answer exactly as asked you can use the .equ directive to associate a constant vale with a symbol, that can then be used as an operand to instructions. This has the assembler synthesise the trampoline if/when necessary:
equ myFirmwareFunction, 0x12346570
.globl _start
mov r0, #42
b myFirmwareFunction
Which generates the following assembly[1]
01000000 <_start>:
1000000: e3a0002a mov r0, #42 ; 0x2a
1000004: eaffffff b 1000008 <__*ABS*0x12346570_veneer>
01000008 <__*ABS*0x12346570_veneer>:
__*ABS*0x12346570_veneer():
1000008: e51ff004 ldr pc, [pc, #-4] ; 100000c <__*ABS*0x12346570_veneer+0x4>
100000c: 12346570 data: #0x12345670
If the immediate value is close enough to PC that the offset will fit in the immediate field, then the verneer (trampoline) is skipped and you will get a single branch instruction to the specified constant address.
[1] using the codesorcery (2009q1) toolchain with:
arm-none-eabi-gcc -march=armv7-a -x assembler test.spp -o test.elf -Ttext=0x1000000 -nostdlib
Is it possible using GNU tools (gcc, binutils, etc) to modify all occurrences of an assembly instruction into a no-op? Specifically, gcc with the -pg option generates the following assembly (ARM):
0x0: e1a0c00d mov ip, sp
0x4: e92dd800 stmdb sp!, {fp, ip, lr, pc}
0x8: e24cb004 sub fp, ip, #4 ; 0x4
0xc: ebfffffe bl 0 <mcount>
I want to record the address of this last instruction, and then change it to a nop like in the following code
0x0: e1a0c00d mov ip, sp
0x4: e92dd800 stmdb sp!, {fp, ip, lr, pc}
0x8: e24cb004 sub fp, ip, #4 ; 0x4
0xc: e1a00000 nop (mov r0,r0)
The Linux kernel can do something similar to this at run-time, but I'm looking for a build-time solution.
You can compile the code with gcc -S to output an assembler listing, instead of compiling fully into an object file or executable. Then, just replace the desired instructions with no-ops (e.g. using sed), and continue compilation from there.
If you also want to do this for object files or libraries that you don't have the original source code for, you'll instead have to use a tool such as objdump(1) to disassemble them and get the addresses of the instructions you wish to replace. Then, parse the object file headers to find the offsets within the file of those instructions, and then replace the machine instructions with no-ops directly in the object files. This is a little trickier, but doable.
This will certainly be easier with a RISC-ish fixed-length instruction format than for e.g. x86.
It should be relatively straightforward to use libelf (nice tutorial here: http://people.freebsd.org/~jkoshy/download/libelf/article.html) or libbfd (http://sourceware.org/binutils/docs-2.19/bfd/index.html) to open the object file, modify instructions within the .text section, and write it out again using provided APIs. Whether it's worth the effort or not will depend on non-technical considerations (I am a bit curious though...).
It's worth mentioning that there might be a few wrinkles with using libelf or libbfd if this needs to work in a cross-development environment.