LC3, Store a Value of a Register to a Memory Location

LC3, Store a Value of a Register to a Memory Location - lc3

I'm attempting to write a short LC-3 program that initializes R1=5, R2=16 and computes the sum of R1 and R2 and put the result in memory x4000. The program is supposed to start at x3000. Unfortunately, I have to write it in binary form.
This is what I have so far...
.orig x3000__________; Program starts at x3000
0101 001 001 1 00000 ;R1 <- R1 AND x0000
0001 001 001 1 00101 ;R1 <- R1 + x0005
0101 010 010 1 00000 ;R2 <- R2 AND x0000
0001 010 010 1 01000 ;R2 <- R2 + x0008
0001 010 010 1 01000 ;R2 <- R2 + x0008
0001 011 010 0 00 001 ;R3 <- R2 + R1
//This last step is where I'm struggling...
I was thinking of using ST, and I figured PCOFFSET9 to be 994, but I can't represent that using 8 bits... so how else would I do this? Is my code inefficient?
0011 011

The ST command is limited to only 511 (I believe) from its current location in memory. For something like this you will need to use the STI command (Store Indirect) The sample code below will help explain how to use STI.
.orig x3000
AND R1, R1, #0 ; Clear R1
ADD R1, R1, #5 ; Store 5 into R1
AND R2, R2, #0 ; Clear R2
ADD R2, R2, #8 ; Store 8 into R2
ADD R3, R2, R1 ; R3 = R2 + R1
STI R3, STORE_x4000 ; Store the value of R3 into mem[x4000]
HALT ; TRAP x25 end the program
; Variables
STORE_x4000 .FILL x4000
.END
You will need to make the appropriate conversions to binary, but if you plug the code into the LC-3 simulator it will give you the binary representation.

Related

#LC3 Trying to Subtract two number, but the result comes out is not correct

MySUB NOT R4,R3 | Flip R3 and load into r4
ADD R1,R4,#1 | ADD R4 with 1 and load into r1
ADD R3,R2,R1 | ADD R4 with R1 and load into R3
RET |Output R3
So R2 and R3 it's given in the data, and r3's value is the output value.
My output:
009-001=008
004+004=008
008+002=010
003+002=005
001+000=001
005+001=006
009+003=012
008+004=012
000+000=000
002+001=003
The Output should look like:
009-001=008
004-004=000
008-002=006
003-002=001
001-000=001
005-001=004
009-003=006
008-004=004
000-000=000
002-001=001

Why is ARM gcc calling __udivsi3 when dividing by a constant?

I'm using the latest available version of ARM-packaged GCC:
arm-none-eabi-gcc (GNU Arm Embedded Toolchain 10-2020-q4-major) 10.2.1 20201103 (release)
Copyright (C) 2020 Free Software Foundation, Inc.
When I compile this code using "-mcpu=cortex-m0 -mthumb -Ofast":
int main(void) {
uint16_t num = (uint16_t) ADC1->DR;
ADC1->DR = num / 7;
}
I would expect that the division would be accomplished by a multiplication and a shift, but instead this code is being generated:
08000b5c <main>:
8000b5c: b510 push {r4, lr}
8000b5e: 4c05 ldr r4, [pc, #20] ; (8000b74 <main+0x18>)
8000b60: 2107 movs r1, #7
8000b62: 6c20 ldr r0, [r4, #64] ; 0x40
8000b64: b280 uxth r0, r0
8000b66: f7ff facf bl 8000108 <__udivsi3>
8000b6a: b280 uxth r0, r0
8000b6c: 6420 str r0, [r4, #64] ; 0x40
8000b6e: 2000 movs r0, #0
8000b70: bd10 pop {r4, pc}
8000b72: 46c0 nop ; (mov r8, r8)
8000b74: 40012400 .word 0x40012400
Using __udivsi3 instead of multiply and shift is terribly inefficient. Am I using the wrong flags, or missing something else, or is this a GCC bug?

The Cortex-M0 lacks instructions to perform a 32x32->64-bit multiply. Because num is an unsigned 16-bit quantity, multiplying it by 9363 and shifting right 16 would yield a correct result in all cases, but--likely because a uint16_t will be promoted to int before the multiply, gcc does not include such optimizations.
From what I've observed, gcc does a generally poor job of optimizing for the Cortex-M0, failing to employ some straightforward optimizations which would be appropriate for that platform, but sometimes employing "optimizations" which aren't. Given something like
void test1(uint8_t *p)
{
for (int i=0; i<32; i++)
p[i] = (p[i]*9363) >> 16; // Divide by 7
}
gcc happens to generate okay code for the Cortex-M0 at -O2, but if the multiplication were replaced with an addition the compiler would generate code which reloads the constant 9363 on every iteration of the loop. When using addition, even if the code were changed to:
void test2(uint16_t *p)
{
register unsigned u9363 = 9363;
for (int i=0; i<32; i++)
p[i] = (p[i]+u9363) >> 16;
}
gcc would still bring the load of the constant into the loop. Sometimes gcc's optimizations may also have unexpected behavioral consequences. For example, one might expect that on a platform like a Cortex-M0, invoking something like:
unsigned short test(register unsigned short *p)
{
register unsigned short temp = *p;
return temp - (temp >> 15);
}
while an interrupt changes the contents of *p might yield behavior consistent with the old value or the new value. The Standard wouldn't require such treatment, but most implementations intended to be suitable for embedded programming tasks will offer stronger guarantees than what the Standard requires. If either the old or new value would be equally acceptable, letting the compiler use whichever is more convenient may allow more efficient code than using volatile. As it happens, however, the "optimized" code from gcc will replace the two uses of temp with separate loads of *p.
If you're using gcc with the Cortex-M0 and are at all concerned about performance or the possibility of "astonishing" behaviors, get in the habit of inspecting the compiler's output. For some kinds of loop, it might even be worth considering testing out -O0. If code makes suitable use of the register keyword, its performance can sometimes beat that of identical code processed with -O2.

Expanding on supercat's answer.
Feed this:
unsigned short fun ( unsigned short x )
{
return(x/7);
}
to something with a larger multiply:
00000000 <fun>:
0: e59f1010 ldr r1, [pc, #16] ; 18 <fun+0x18>
4: e0832190 umull r2, r3, r0, r1
8: e0400003 sub r0, r0, r3
c: e08300a0 add r0, r3, r0, lsr #1
10: e1a00120 lsr r0, r0, #2
14: e12fff1e bx lr
18: 24924925 .word 0x24924925
1/7 in binary (long division):
0.001001001001001
111)1.000000
111
====
1000
111
===
1
0.001001001001001001001001001001
0.0010 0100 1001 0010 0100 1001 001001
0x2492492492...
0x24924925>>32 (rounded up)
For this to work you need a 64 bit result, you take the top half and do some adjustments, so for example:
7 * 0x24924925 = 0x100000003
and you take the top 32 bits (not completely this simple but for this value you can see it working).
The all thumbs variant multiply is 32 bits = 32 bits * 32 bits, so the result would be 0x00000003 and that does not work.
So 0x24924 which we can make 0x2493 as supercat did or 0x2492.
Now we can use the 32x32 = 32 bit multiply:
0x2492 * 7 = 0x0FFFE
0x2493 * 7 = 0x10005
Let's run with the one larger:
0x100000000/0x2493 = a number greater than 65536. so that is fine.
but:
0x3335 * 0x2493 = 0x0750DB6F
0x3336 * 0x2493 = 0x07510002
0x3335 / 7 = 0x750
0x3336 / 7 = 0x750
So you can only get so far with that approach.
If we follow the model of the arm code:
for(ra=0;ra<0x10000;ra++)
{
rb=0x2493*ra;
rd=rb>>16;
rb=ra-rd;
rb=rd+(rb>>1);
rb>>=2;
rc=ra/7;
printf("0x%X 0x%X 0x%X \n",ra,rb,rc);
if(rb!=rc) break;
}
Then it works from 0x0000 to 0xFFFF, so you could write the asm to do that (note it needs to be 0x2493 not 0x2492).
If you know the operand is not going above a certain value then you can use more bits of 1/7th to multiply against.
In any case when the compiler does not do this optimization for you then you might still have a chance yourself.
Now that I think about it I ran into this before, and now it makes sense. But I was on a full sized arm and I called a routine I compiled in arm mode (the other code was in thumb mode), and had a switch statement basically if denominator = 1 then result = x/1; if denominator = 2 then result = x/2 and so on. And then it avoided the gcclib function and generated the 1/x multiplies. (I had like 3 or 4 different constants to divide by):
unsigned short udiv7 ( unsigned short x )
{
unsigned int r0;
unsigned int r3;
r0=x;
r3=0x2493*r0;
r3>>=16;
r0=r0-r3;
r0=r3+(r0>>1);
r0>>=2;
return(r0);
}
Assuming I made no mistakes:
00000000 <udiv7>:
0: 4b04 ldr r3, [pc, #16] ; (14 <udiv7+0x14>)
2: 4343 muls r3, r0
4: 0c1b lsrs r3, r3, #16
6: 1ac0 subs r0, r0, r3
8: 0840 lsrs r0, r0, #1
a: 18c0 adds r0, r0, r3
c: 0883 lsrs r3, r0, #2
e: b298 uxth r0, r3
10: 4770 bx lr
12: 46c0 nop ; (mov r8, r8)
14: 00002493 .word 0x00002493
That should be faster than a generic division library routine.
Edit
I think I see what supercat has done with the solution that works:
((i*37449 + 16384u) >> 18)
We have this as the 1/7th fraction:
0.001001001001001001001001001001
but we can only do a 32 = 32x32 bit multiply. The leading zeros give us some breathing room we might be able to take advantage of. So instead of 0x2492/0x2493 we can try:
1001001001001001
0x9249
0x9249*0xFFFF = 0x92486db7
And so far it won't overflow:
rb=((ra*0x9249) >> 18);
by itself it fails at 7 * 0x9249 = 0x3FFFF, 0x3FFFF>>18 is zero not 1.
So maybe
rb=((ra*0x924A) >> 18);
that fails at:
0xAAAD 0x1862 0x1861
So what about:
rb=((ra*0x9249 + 0x8000) >> 18);
and that works.
What about supercat's?
rb=((ra*0x9249 + 0x4000) >> 18);
and that runs clean for all values 0x0000 to 0xFFFF:
rb=((ra*0x9249 + 0x2000) >> 18);
and that fails here:
0xE007 0x2000 0x2001
So there are a couple of solutions that work.
unsigned short udiv7 ( unsigned short x )
{
unsigned int ret;
ret=x;
ret=((ret*0x9249 + 0x4000) >> 18);
return(ret);
}
00000000 <udiv7>:
0: 4b03 ldr r3, [pc, #12] ; (10 <udiv7+0x10>)
2: 4358 muls r0, r3
4: 2380 movs r3, #128 ; 0x80
6: 01db lsls r3, r3, #7
8: 469c mov ip, r3
a: 4460 add r0, ip
c: 0c80 lsrs r0, r0, #18
e: 4770 bx lr
10: 00009249 .word 0x00009249
Edit
As far as the "why" question goes, that is not a Stack Overflow question; if you want to know why gcc doesn't do this, ask the authors of that code. All we can do is speculate here and the speculation is they may either have chosen not to because of the number of instructions or they may have chosen not to because they have an algorithm that states because this is not a 64 = 32x32 bit multiply then do not bother.
Again the why question is not a Stack Overflow question, so perhaps we should just close this question and delete all of the answers.
I found this to be incredibly educational (once you know/understand what was being said).
Another WHY? question is why did gcc do it the way they did it when they could have done it the way supercat or I did it?

The compiler can only rearrange integer expressions if it knows that the result will be correct for any input allowed by the language.
Because 7 is co-prime to 2, it is impossible to carry out dividing any input by seven with multiplying and shifting.
If you know that it is possible for the input that you intend to provide, then you have to do it yourself using the multiply and shift operators.
Depending on the size of the input, you will have to choose how much to shift so that the output is correct (or at least good enough for your application) and so that the intermediate doesn't overflow. The compiler has no way of knowing what is accurate enough for your application, or what your maximum input will be. If it allows any input up to the maximum of the type, then every multiplication will overflow.
In general GCC will only carry out division using shifting if the divisor is not co-prime to 2, that is if it is a power of two.

Local variable location from DWARF info in ARM

I have a C program in file delay.c:
void delay(int num)
{
volatile int i;
for(i=0; i<num; i++);
}
Then I compile the program with gcc 4.6.3 on ARM emulator (armel, more specifically) with command gcc -g -O1 -o delay.o delay.c. The assembly in delay.o is:
00000000 <delay>:
0: e24dd008 sub sp, sp, #8
4: e3a03000 mov r3, #0
8: e58d3004 str r3, [sp, #4]
c: e59d3004 ldr r3, [sp, #4]
10: e1500003 cmp r0, r3
14: da000005 ble 30 <delay+0x30>
18: e59d3004 ldr r3, [sp, #4]
1c: e2833001 add r3, r3, #1
20: e58d3004 str r3, [sp, #4]
24: e59d3004 ldr r3, [sp, #4]
28: e1530000 cmp r3, r0
2c: bafffff9 blt 18 <delay+0x18>
30: e28dd008 add sp, sp, #8
34: e12fff1e bx lr
I want to figure out where the variable i is on the stack of function delay from debugging information. Below is the information about delay and i in .debug_info section:
<1><25>: Abbrev Number: 2 (DW_TAG_subprogram)
<26> DW_AT_external : 1
<27> DW_AT_name : (indirect string, offset: 0x19): delay
<2b> DW_AT_decl_file : 1
<2c> DW_AT_decl_line : 1
<2d> DW_AT_prototyped : 1
<2e> DW_AT_low_pc : 0x0
<32> DW_AT_high_pc : 0x38
<36> DW_AT_frame_base : 0x0 (location list)
<3a> DW_AT_sibling : <0x59>
...
<2><4b>: Abbrev Number: 4 (DW_TAG_variable)
<4c> DW_AT_name : i
<4e> DW_AT_decl_file : 1
<4f> DW_AT_decl_line : 3
<50> DW_AT_type : <0x60>
<54> DW_AT_location : 0x20 (location list)
It shows that the location of i is in the location list. So I output the location list:
Offset Begin End Expression
00000000 00000000 00000004 (DW_OP_breg13 (r13): 0)
00000000 00000004 00000038 (DW_OP_breg13 (r13): 8)
00000000 <End of list>
00000020 0000000c 00000020 (DW_OP_fbreg: -12)
00000020 00000024 00000028 (DW_OP_reg3 (r3))
00000020 00000028 00000038 (DW_OP_fbreg: -12)
00000020 <End of list>
From address 4 to 38, the frame base of delay should be r13 + 8. So from address c to 20 and from address 28 to 38, the location of i is r13 + 8 -12 = r13 - 4.
However, from the assembly, we can know that there is no location r13 - 4 and i is apparently at location r13 + 4.
Do I miss some calculation step? Anyone can explain the difference of i's location between calculation from debugging information and in assembly?
Thanks in advance!

TL;DR The analysis in the question is correct and the discrepancy is a bug in one of the gcc components (GNU Arm Embedded Toolchain is an obvious place to log one).
As it stands, this other answer is incorrect because it erroneously conflates the value of the stack pointer on evaluation of a location expression with the earlier value of the stack pointer on entry to the function.
As far as the DWARF is concerned, the location of i varies with the program counter. Consider, for example, the text address delay+0x18. At this point, the location of i is given by DW_OP_fbreg(-12), i.e. 12 bytes below the frame base. The frame base is given by the parent DW_TAG_subprogram's DW_AT_frame_base attribute which, in this case, is also dependent on the program counter: for delay+0x18 its expression is DW_OP_breg13(8), i.e. r13 + 8. Importantly, this calculation uses the current value of r13, i.e. the value of r13 when the program counter is equal to delay+0x18.
Thus the DWARF asserts that, at delay+0x18, i is located at r13 + 8 - 12, i.e. 4 bytes below the bottom of the existing stack. Inspection of the assembly shows that, at delay+018, i should be found 4 bytes above the bottom of the stack. Therefore the DWARF is in error and whatever generated it is defective.
One can demonstrate the bug using gdb with a simple wrapper around the test case provided in the question:
$ cat delay.c
void delay(int num)
{
volatile int i;
for(i=0; i<num; i++);
}
$ gcc-4.6 -g -O1 -c delay.c
$ cat main.c
void delay(int);
int main(int argc, char **argv) {
delay(3);
}
$ gcc-4.6 -o test main.c delay.o
$ gdb ./test
.
.
.
(gdb)
Set a breakpoint at delay+0x18 and run to the second occurrence (where we expect i to be 1):
(gdb) break *delay+0x18
Breakpoint 1 at 0x103cc: file delay.c, line 4.
(gdb) run
Starting program: /home/pi/test
Breakpoint 1, 0x000103cc in delay (num=3) at delay.c:4
4 for(i=0; i<num; i++);
(gdb) cont
Continuing.
Breakpoint 1, 0x000103cc in delay (num=3) at delay.c:4
4 for(i=0; i<num; i++);
(gdb)
We know from the disassembly that i is four bytes above the stack pointer. Indeed, there it is:
(gdb) print *((int *)($r13 + 4))
$1 = 1
(gdb)
However, the bogus DWARF means that gdb looks in the wrong place:
(gdb) print i
$2 = 0
(gdb)
As explained above, the DWARF is incorrectly giving the location of i at four bytes below the stack pointer. There's a zero there, hence the reported value of i:
(gdb) print *((int *)($r13 - 4))
$3 = 0
(gdb)
This isn't a coincidence. A magic number written into this bogus location below the stack pointer reappears when gdb is asked to print i:
(gdb) set *((int *)($r13 - 4)) = 42
(gdb) print i
$6 = 42
(gdb)
Thus, at delay+0x18, the DWARF incorrectly encodes the location of i as r13 - 4 even though its true location is r13 + 4.
One can go a step further by editing the compilation unit by hand and replacing DW_OP_fbreg(-12) (bytes 0x91 0x74) with DW_OP_fbreg(-4) (bytes 0x91 0x7c). This gives
$ readelf --debug-dump=loc delay.modified.o
Contents of the .debug_loc section:
Offset Begin End Expression
00000000 00000000 00000004 (DW_OP_breg13 (r13): 0)
0000000c 00000004 00000038 (DW_OP_breg13 (r13): 8)
00000018 <End of list>
00000020 0000000c 00000020 (DW_OP_fbreg: -4)
0000002c 00000024 00000028 (DW_OP_reg3 (r3))
00000037 00000028 00000038 (DW_OP_fbreg: -4)
00000043 <End of list>
$
In other words, the DWARF has been corrected so that at, e.g., delay+0x18 the location of i is given as frame base - 4 = r13 + 8 - 4 = r13 + 4, matching the assembly. Repeating the gdb experiment with the corrected DWARF shows the expected value of i each time around the loop:
$ gcc-4.6 -o test.modified main.c delay.modified.o
$ gdb ./test.modified
.
.
.
(gdb) break *delay+0x18
Breakpoint 1 at 0x103cc: file delay.c, line 4.
(gdb) run
Starting program: /home/pi/test.modified
Breakpoint 1, 0x000103cc in delay (num=3) at delay.c:4
4 for(i=0; i<num; i++);
(gdb) print i
$1 = 0
(gdb) cont
Continuing.
Breakpoint 1, 0x000103cc in delay (num=3) at delay.c:4
4 for(i=0; i<num; i++);
(gdb) print i
$2 = 1
(gdb) cont
Continuing.
Breakpoint 1, 0x000103cc in delay (num=3) at delay.c:4
4 for(i=0; i<num; i++);
(gdb) print i
$3 = 2
(gdb) cont
Continuing.
[Inferior 1 (process 30954) exited with code 03]
(gdb)

I am not agree with the OP's asm analysis:
00000000 <delay>: ; so far, let's suppose sp = sp(0)
0: e24dd008 sub sp, sp, #8 ; sp = sp(0) - 8
4: e3a03000 mov r3, #0 ; r3 = 0
8: e58d3004 str r3, [sp, #4] ; store the value of r3 in (sp + 4)
c: e59d3004 ldr r3, [sp, #4] ; load (sp + 4) in r3
10: e1500003 cmp r0, r3 ; compare r3 and r0
14: da000005 ble 30 <delay+0x30> ; go to end of loop
18: e59d3004 ldr r3, [sp, #4] ; i is in r3, and it is being loaded from
; (sp + 4), that is,
; sp(i) = sp(0) - 8 + 4 = sp(0) - 4
1c: e2833001 add r3, r3, #1 ; r3 = r3 + 1, that is, increment i
20: e58d3004 str r3, [sp, #4] ; store i (which is in r3) in (sp + 4),
; being again sp(i) = sp(0) - 8 + 4 = \
; sp(0) - 4
24: e59d3004 ldr r3, [sp, #4] ; load sp + 4 in r3
28: e1530000 cmp r3, r0 ; compare r3 and r0
2c: bafffff9 blt 18 <delay+0x18> ; go to init of loop
30: e28dd008 add sp, sp, #8 ; sp = sp + 8
34: e12fff1e bx lr ;
So i is located in sp(0) - 4, which matchs with the dwarf analysis (which says that i is being located in 0 + 8 - 12)
Edit in order to add information regarding my DWARF analysis:
According to this line: 00000020 0000000c 00000020 (DW_OP_fbreg: -12) , being DW_OP_fbreg :
The DW_OP_fbreg operation provides a signed LEB128 offset from
the address specified by
the location description in the DW_AT_frame_base attribute of the
current function. (This is
typically a “stack pointer” register plus or minus some offset.
On more sophisticated systems
it might be a location list that adjusts the offset according to
changes in the stack pointer as
the PC changes.)
,the address is frame_base + offset, where:
frame_base : is the stack pointer +/- some offset, and according to the previous line (00000000 00000004 00000038 (DW_OP_breg13 (r13): 8)), from 00000004 to 00000038, it has an offset of +8 (r13 is SP)
offset: obviously it is -12
Given that, DWARF indicates that it is pointing to sp(0) + 8 - 12 = sp(0) - 4

LC3 trap executed illegal vector number

I'm trying to count the number of characters in LC3 simulator and keep getting "a trap was executed with an illegal vector number".
These are the objects I execute
charcount.obj:
0011000000000000
0101010010100000
0010011000010000
1111000000100011
0110001011000000
0001100001111100
0000010000001000
1001001001111111
0001001001100001
0001001001000000
0000101000000001
0001010010100001
0001011011100001
0110001011000000
0000111111110110
0010000000000100
0001000000000010
1111000000100001
1111000000100101
1110001011111111
0000000000110000
and verse:
.ORIG x3100
.STRINGZ "Simple Simon met a pieman,"
.STRINGZ "Going to the fair;"
.STRINGZ "Says Simple Simon to the pieman,"
.STRINGZ "Let me taste your ware."
.FILL x04
.END

Looks like we're going to need more information before we can help you much. I understand you've provided us with some binary and I ran that through the LC3 simulator. Here's where I'm a bit lost, which string would you like to count and where is it stored?
After trying to piece together what you've provided here's what I've found.
Registers:
R0 x0061 97
R1 x0000 0
R2 x0000 0
R3 xE2FF -7425
R4 x0000 0
R5 x0000 0
R6 x0000 0
R7 x3003 12291
PC x3004 12292
IR x62C0 25280
CC Z
Memory:
x3000 0101010010100000 x54A0 AND R2, R2, #0
x3001 0010011000010000 x2610 LD R3, x3012
x3002 1111000000100011 xF023 TRAP IN
x3003 0110001011000000 x62C0 LDR R1, R3, #0
x3004 0001100001111100 x187C ADD R4, R1, #-4
x3005 0000010000001000 x0408 BRZ x300E
x3006 1001001001111111 x927F NOT R1, R1
x3007 0001001001100001 x1261 ADD R1, R1, #1
x3008 0001001001000000 x1240 ADD R1, R1, R0
x3009 0000101000000001 x0A01 BRNP x300B
x300A 0001010010100001 x14A1 ADD R2, R2, #1
x300B 0001011011100001 x16E1 ADD R3, R3, #1
x300C 0110001011000000 x62C0 LDR R1, R3, #0
x300D 0000111111110110 x0FF6 BRNZP x3004
x300E 0010000000000100 x2004 LD R0, x3013
x300F 0001000000000010 x1002 ADD R0, R0, R2
x3010 1111000000100001 xF021 TRAP OUT
x3011 1111000000100101 xF025 TRAP HALT
x3012 1110001011111111 xE2FF LEA R1, x3112
x3013 0000000000110000 x0030 NOP
The values displayed in the registers is what I get when I stop after line x3003. For some reason the literal value of xE2FF gets loaded into register R3. After that the value of 0 at memory location xE2FF is loaded into register R1 and then the problems mount from there.
I would recommend displaying your asm code and then commenting each line so we can better understand what you're trying to accomplish.

ARM Assembly: Absolute Value Function: Are two or three lines faster?

In my embedded systems class, we were asked to re-code the given C-function AbsVal into ARM Assembly.
We were told that the best we could do was 3-lines. I was determined to find a 2-line solution and eventually did, but the question I have now is whether I actually decreased performance or increased it.
The C-code:
unsigned long absval(signed long x){
unsigned long int signext;
signext = (x >= 0) ? 0 : -1; //This can be done with an ASR instruction
return (x + signet) ^ signext;
}
The TA/Professor's 3-line solution
ASR R1, R0, #31 ; R1 <- (x >= 0) ? 0 : -1
ADD R0, R0, R1 ; R0 <- R0 + R1
EOR R0, R0, R1 ; R0 <- R0 ^ R1
My 2-line solution
ADD R1, R0, R0, ASR #31 ; R1 <- x + (x >= 0) ? 0 : -1
EOR R0, R1, R0, ASR #31 ; R0 <- R1 ^ (x >= 0) ? 0 : -1
There are a couple of places I can see potential performance differences:
The addition of one extra Arithmetic Shift Right call
The removal of one memory fetch
So, which one is actually faster? Does it depend upon the processor or memory access speed?

Here is a nother two instruction version:
cmp r0, #0
rsblt r0, r0, #0
Which translate to the simple code:
if (r0 < 0)
{
r0 = 0-r0;
}
That code should be pretty fast, even on modern ARM-CPU cores like the Cortex-A8 and A9.

Dive over to ARM.com and grab the Cortex-M3 datasheet. Section 3.3.1 on page 3-4 has the instruction timings. Fortunately they're quite straightforward on the Cortex-M3.
We can see from those timings that in a perfect 'no wait state' system your professor's example takes 3 cycles:
ASR R1, R0, #31 ; 1 cycle
ADD R0, R0, R1 ; 1 cycle
EOR R0, R0, R1 ; 1 cycle
; total: 3 cycles
and your version takes two cycles:
ADD R1, R0, R0, ASR #31 ; 1 cycle
EOR R0, R1, R0, ASR #31 ; 1 cycle
; total: 2 cycles
So yours is, theoretically, faster.
You mention "The removal of one memory fetch", but is that true? How big are the respective routines? Since we're dealing with Thumb-2 we have a mix of 16-bit and 32-bit instructions available. Let's see how they assemble:
Their version (adjusted for UAL syntax):
.syntax unified
.text
.thumb
abs:
asrs r1, r0, #31
adds r0, r0, r1
eors r0, r0, r1
Assembles to:
00000000 17c1 asrs r1, r0, #31
00000002 1840 adds r0, r0, r1
00000004 4048 eors r0, r1
That's 3x2 = 6 bytes.
Your version (again, adjusted for UAL syntax):
.syntax unified
.text
.thumb
abs:
add.w r1, r0, r0, asr #31
eor.w r0, r1, r0, asr #31
Assembles to:
00000000 eb0071e0 add.w r1, r0, r0, asr #31
00000004 ea8170e0 eor.w r0, r1, r0, asr #31
That's 2x4 = 8 bytes.
So instead of removing a memory fetch you've actually increased the size of the code.
But does this affect performance? My advice would be to benchmark.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

LC3, Store a Value of a Register to a Memory Location - lc3

Related

#LC3 Trying to Subtract two number, but the result comes out is not correct

Why is ARM gcc calling __udivsi3 when dividing by a constant?

Local variable location from DWARF info in ARM

LC3 trap executed illegal vector number

ARM Assembly: Absolute Value Function: Are two or three lines faster?

Categories

Resources