In my embedded systems class, we were asked to re-code the given C-function AbsVal into ARM Assembly.
We were told that the best we could do was 3-lines. I was determined to find a 2-line solution and eventually did, but the question I have now is whether I actually decreased performance or increased it.
The C-code:
unsigned long absval(signed long x){
unsigned long int signext;
signext = (x >= 0) ? 0 : -1; //This can be done with an ASR instruction
return (x + signet) ^ signext;
}
The TA/Professor's 3-line solution
ASR R1, R0, #31 ; R1 <- (x >= 0) ? 0 : -1
ADD R0, R0, R1 ; R0 <- R0 + R1
EOR R0, R0, R1 ; R0 <- R0 ^ R1
My 2-line solution
ADD R1, R0, R0, ASR #31 ; R1 <- x + (x >= 0) ? 0 : -1
EOR R0, R1, R0, ASR #31 ; R0 <- R1 ^ (x >= 0) ? 0 : -1
There are a couple of places I can see potential performance differences:
The addition of one extra Arithmetic Shift Right call
The removal of one memory fetch
So, which one is actually faster? Does it depend upon the processor or memory access speed?
Here is a nother two instruction version:
cmp r0, #0
rsblt r0, r0, #0
Which translate to the simple code:
if (r0 < 0)
{
r0 = 0-r0;
}
That code should be pretty fast, even on modern ARM-CPU cores like the Cortex-A8 and A9.
Dive over to ARM.com and grab the Cortex-M3 datasheet. Section 3.3.1 on page 3-4 has the instruction timings. Fortunately they're quite straightforward on the Cortex-M3.
We can see from those timings that in a perfect 'no wait state' system your professor's example takes 3 cycles:
ASR R1, R0, #31 ; 1 cycle
ADD R0, R0, R1 ; 1 cycle
EOR R0, R0, R1 ; 1 cycle
; total: 3 cycles
and your version takes two cycles:
ADD R1, R0, R0, ASR #31 ; 1 cycle
EOR R0, R1, R0, ASR #31 ; 1 cycle
; total: 2 cycles
So yours is, theoretically, faster.
You mention "The removal of one memory fetch", but is that true? How big are the respective routines? Since we're dealing with Thumb-2 we have a mix of 16-bit and 32-bit instructions available. Let's see how they assemble:
Their version (adjusted for UAL syntax):
.syntax unified
.text
.thumb
abs:
asrs r1, r0, #31
adds r0, r0, r1
eors r0, r0, r1
Assembles to:
00000000 17c1 asrs r1, r0, #31
00000002 1840 adds r0, r0, r1
00000004 4048 eors r0, r1
That's 3x2 = 6 bytes.
Your version (again, adjusted for UAL syntax):
.syntax unified
.text
.thumb
abs:
add.w r1, r0, r0, asr #31
eor.w r0, r1, r0, asr #31
Assembles to:
00000000 eb0071e0 add.w r1, r0, r0, asr #31
00000004 ea8170e0 eor.w r0, r1, r0, asr #31
That's 2x4 = 8 bytes.
So instead of removing a memory fetch you've actually increased the size of the code.
But does this affect performance? My advice would be to benchmark.
Related
I am trying to get the final accumulate in the code below to use the ARM M7 SMLAL 32*32->64 bit accumulate function. If I include the T3 = T3 + 1 than it does use this, but if I comment it out it does a full 64*64 bit and accumulate using 3 multiply and 2 add instructions. I don't actually want to add 1 to T3 so it needs to go.
I've broken the code down so that I could analyse it in more detail and it definitely seems to be that the cast of T3 to int32_t and throwing away the bottom 32 bits from the multiply isn't being picked up by the compiler and it thinks T3 still has 64 bits. Bit when I add the simple increment of T3 it then gets it correct. I tried adding zero but then it goes back to the full 64*64 bit multiply.
I'm using the -O2 optimisation on STM's STM32CubeIDE which uses a version of GCC. Other optimations never use SMLAL or unroll everything.
int64_t T4 = 0;
osc = key * NumHarmonics;
harmonic = 0;
do
{
if (OscLevel[osc] > 1)
{
OscPhase[osc] = OscPhase[osc] + (uint32_t)(T2);
int32_t T5 = Sine[(OscPhase[osc] >> 16) & 0x0000FFFF];
int64_t T6 = (int64_t)T1 * Tremelo[harmonic];
int32_t T3 = (int32_t)(T6 >> 32); // grab the most significant register
// T3 = T3 + 1; // needs the +1 to force use of SMLAL in next instruction ! (+0 doesn't help)
T4 = T4 + (int64_t)T3 * (int64_t)T5; // should be SMLAL but does a full 64*64 mult if no +1 above
}
osc++;
harmonic++;
}
while (harmonic < NumHarmonics);
OscTotal = T4;
without the addition :
800054e: 4b13 ldr r3, [pc, #76] ; (800059c <main+0xd8>)
8000550: f853 1024 ldr.w r1, [r3, r4, lsl #2]
8000554: ea4f 79e1 mov.w r9, r1, asr #31
8000558: fba7 4501 umull r4, r5, r7, r1
800055c: fb07 f309 mul.w r3, r7, r9
8000560: fb01 3202 mla r2, r1, r2, r3
8000564: 4415 add r5, r2
8000566: e9dd 2300 ldrd r2, r3, [sp]
800056a: 1912 adds r2, r2, r4
800056c: 416b adcs r3, r5
800056e: e9cd 2300 strd r2, r3, [sp]
}
osc++;
8000572: 3001 adds r0, #1
harmonic++;
with the addition
8000542: 4b0b ldr r3, [pc, #44] ; (8000570 <main+0xac>)
8000544: f853 3020 ldr.w r3, [r3, r0, lsl #2]
8000548: fbc3 6701 smlal r6, r7, r3, r1
}
osc++;
800054c: 3201 adds r2, #1
harmonic++;
I'm just a super newbie like I just learn how to do this just for 12 hrs I was wondering why my loop is not stopping. Can you help me find what is wrong. I know this code is garbage, please bear with me.
So our task is to ask the user to ask the user input a string with max 80 characters and should end with period since it is our basis to know if it is the end of the string. The program will count the characters and words and display it, but in my case the program doesn't stop. Please help.
.ORIG X3000
LEA R0, PROMPT_ENTER ;Message for entering number.
PUTS
LEA R2, SENTENCE ;allocated memory
AND R3, R3, #0 ;setting R3 to zero for word counter.
ADD R3, R3, #1
AND R1, R1, #0 ;setting R4 to zero for char counter.
;---------ASKING USER TO INPUT A SENTENCE------
GET_USER_INPUT: ;loop for getting characters.
GETC
OUT
STR R0, R2, #0 ;r0 -> ( memory address stored in r2 + 0 )
PUT
ADD R2, R2, #1 ;increments the memory pointer
ADD R0, R0, #-10 ;decrements loop to proceed when pressed enter.
BRz COUNT_LENGTH
BRnp GET_USER_INPUT
;--------Element counter----
COUNT_LENGTH:
AND R0, R0, #0
LEA R4, SENTENCE
LDR R0, R4, #0
ADD R0, R0, #-10
BRz EMPTY
BRnp COUNT_ELEMENTS
EMPTY:
AND R0, R0, #0
LEA R0, PROMPT_NULL
PUTS
HALT
COUNT_ELEMENTS:
AND R0, R0, #0
LEA R4, SENTENCE
LDR R0, R4, #0
LD R6, TMNT
ADD R0, R0, R6
BRz END_OF_SENTENCE
LDR R0, R4, #0
LD R6, SPACE
ADD R0, R0, R6
BRz WORD_COUNT
ADD R4, R4, #1
ADD R1, R1, #1
BRnp COUNT_ELEMENTS
WORD_COUNT:
ADD R4, R4, #1
ADD R3, R3, #1
JSR COUNT_ELEMENTS
END_OF_SENTENCE:
AND R0, R0, #0
LDR R3, R3, #0
LD R5, ASCII
ADD R0, R0, R5
OUT
AND R0, R0, #0
LDR R1, R1, #0
ADD R0, R0, R1
OUT
HALT
SENTENCE .BLKW #80 ;initialize the array named sentence with length 80
TMNT .fill #-89
SPACE .fill #-32
ASCII .fill #48
;----MESSAGES------
PROMPT_ENTER .stringz "Enter the word(maximum 80 characters): \n"
PROMPT_AGAIN .stringz "Do you want to try again? Y/N: \n"
PROMPT_NULL .stringz "Error: Please enter a sentence!"
PROMPT_NOTMNT .stringz "Error: No terminating symbol (.) is expected at the end!"
PROMPT_DSPACE .stringz "Error: Multiple white space is not allowed!"
.END
I've only skimmed though this code.
it is an infinite loop because you reset R4 to point to the start of SENTENCE in each iteration of COUNT_ELEMENTS.
I can see in your code where you are incrementing R4 before going back to COUNT_ELEMENTS (btw JSR is only used to call a subroutine if you want to Branch unconditionally use BR).
You'd want to set R4 to point to SENTENCE only once. I do believe you can simply remove the LEA R4, SENTENCE within COUNT_ELEMENTS since it was set previously as part of COUNT_LENGTH.
In the future I would recommend pulling your code up in a lc3 simulator and stepping through it examining the values of the registers as you step though.
In ARM, GCC uses the PC-relative load is usually used to load constants into registers. The idea is that you store the constant relative to the instruction loading the constant. E.g. the following instruction can be used to load a constant from the address PC+8+offset
ldr r0, [pc, #offset]
As result, the .text segment interleaves instructions and data. The latter usually stored at the end of function's code. E.g.
00010860 <call_weak_fn>:
10860: e59f3014 ldr r3, [pc, #20] ; 1087c <call_weak_fn+0x1c>
10864: e59f2014 ldr r2, [pc, #20] ; 10880 <call_weak_fn+0x20>
10868: e08f3003 add r3, pc, r3
1086c: e7932002 ldr r2, [r3, r2]
10870: e3520000 cmp r2, #0
10874: 012fff1e bxeq lr
10878: e1a00000 nop ; (mov r0, r0)
1087c: 00089790 muleq r8, r0, r7
10880: 00000074 andeq r0, r0, r4, ror r0
For a research project, I would like to ensure that code and constant never reside on the same cache line (i.e. block 64 bytes aligned).
Is it possible to align the constants generated by GCC?
I need to specify each boolean manually like in a fixed table, so using
Array: .skip 400
I will be declaring an array of 400 bytes,so how can i set the boolean values?
ARM registers are 32 bits each. You only need a bit to represent a boolean. So you can use the following 'C' code to access an array,
uint32_t load_bool(uint32_t index)
{
return (bool_array[index>>2] & (1<<(index&3)));
}
void store_bool(uint32_t index, int value)
{
uint32_t target = bool_array[index>>2];
if(value)
target |= (1<<(index&3));
else
target &= ~(1<<(index&3));
bool_array[index>>2] = target;
}
Use a compiler to target your CPU; for instance tuning godbolt output on a Cortex-A5 gives,
load_bool(unsigned int):
ldr r3, =bool_array
mov r2, r0, lsr #2
ldr r3, [r3, r2, asl #2]
and r0, r0, #3
mov r2, #1
and r0, r3, r2, asl r0
bx lr
store_bool(unsigned int, int):
ldr r3, =bool_array
mov r2, r0, lsr #2
cmp r1, #0
ldr r1, [r3, r2, asl #2]
and r0, r0, #3
mov ip, #1
orrne r0, r1, ip, asl r0
biceq r0, r1, ip, asl r0
str r0, [r3, r2, asl #2]
bx lr
The instructions tst, bclr, etc might be useful if you choose a macro instead of a function call (bit index known at compile/assemble time). Also, ldrb or byte access might be better on older platforms/CPUs. Most ARM CPUs have a 32bit bus, so the cycles for ldrb and ldr are equal.
Boolean variables in C and C++ are basically treated as a native integer assigned 1 for true and 0 for false; in ARM's case it would be a 32-bit integer. So if you need to access the structure as an array of Booleans in C/C++ you would need to access them as 32-bit integers aligned on a 4-byte boundary. However if you only need to access it from other assembly code you can use each byte as it's own boolean variable and simply manipulate the array on a byte level.
In ARM assembly, this would be the difference between accessing the array with LDR vs with LDRB.
I'm trying to count the number of characters in LC3 simulator and keep getting "a trap was executed with an illegal vector number".
These are the objects I execute
charcount.obj:
0011000000000000
0101010010100000
0010011000010000
1111000000100011
0110001011000000
0001100001111100
0000010000001000
1001001001111111
0001001001100001
0001001001000000
0000101000000001
0001010010100001
0001011011100001
0110001011000000
0000111111110110
0010000000000100
0001000000000010
1111000000100001
1111000000100101
1110001011111111
0000000000110000
and verse:
.ORIG x3100
.STRINGZ "Simple Simon met a pieman,"
.STRINGZ "Going to the fair;"
.STRINGZ "Says Simple Simon to the pieman,"
.STRINGZ "Let me taste your ware."
.FILL x04
.END
Looks like we're going to need more information before we can help you much. I understand you've provided us with some binary and I ran that through the LC3 simulator. Here's where I'm a bit lost, which string would you like to count and where is it stored?
After trying to piece together what you've provided here's what I've found.
Registers:
R0 x0061 97
R1 x0000 0
R2 x0000 0
R3 xE2FF -7425
R4 x0000 0
R5 x0000 0
R6 x0000 0
R7 x3003 12291
PC x3004 12292
IR x62C0 25280
CC Z
Memory:
x3000 0101010010100000 x54A0 AND R2, R2, #0
x3001 0010011000010000 x2610 LD R3, x3012
x3002 1111000000100011 xF023 TRAP IN
x3003 0110001011000000 x62C0 LDR R1, R3, #0
x3004 0001100001111100 x187C ADD R4, R1, #-4
x3005 0000010000001000 x0408 BRZ x300E
x3006 1001001001111111 x927F NOT R1, R1
x3007 0001001001100001 x1261 ADD R1, R1, #1
x3008 0001001001000000 x1240 ADD R1, R1, R0
x3009 0000101000000001 x0A01 BRNP x300B
x300A 0001010010100001 x14A1 ADD R2, R2, #1
x300B 0001011011100001 x16E1 ADD R3, R3, #1
x300C 0110001011000000 x62C0 LDR R1, R3, #0
x300D 0000111111110110 x0FF6 BRNZP x3004
x300E 0010000000000100 x2004 LD R0, x3013
x300F 0001000000000010 x1002 ADD R0, R0, R2
x3010 1111000000100001 xF021 TRAP OUT
x3011 1111000000100101 xF025 TRAP HALT
x3012 1110001011111111 xE2FF LEA R1, x3112
x3013 0000000000110000 x0030 NOP
The values displayed in the registers is what I get when I stop after line x3003. For some reason the literal value of xE2FF gets loaded into register R3. After that the value of 0 at memory location xE2FF is loaded into register R1 and then the problems mount from there.
I would recommend displaying your asm code and then commenting each line so we can better understand what you're trying to accomplish.