Related
Background:
I wrote a function in C and compiled it with arm-none-eabi-gcc (7-2018-q2-update).
The generated assembly code for the loop body looks like it should take 20 cycles per iteration,
including 2 wait states for load operations accessing constant data from non-volatile program memory.
However, the NVM controller cache for my MCU says that the cache miss penalty is guaranteed to be zero,
so I'm not sure why it can't prefetch the data for the two NVM load operations.
Therefore, I think the loop should take 18 cycles per iteration.
Unfortunately, the measured performance is quite different from the expected performance.
If I change int8_t increment and int16_t patch_data_i so that both are int32_t,
then GCC generates effectively the same instructions in a slightly different order.
Let's call this version (b).
The interesting thing is that version (a) takes 21 cycles per iteration and version (b) takes 20 cycles per iteration!
This performance difference is highly repeatable.
I have measured it very precisely, by varying the number of iterations between (5, 6, 7, 8) for version (a) and version (b).
Timing measurements from a Tektronix 465 oscilloscope at a fixed B sweep setting:
T(a)[min, max, avg] = (20.0, 21.0, 20.3) c # 48 MHz.
T(b)[min, max, avg] = (21.0, 22.0, 21.3) c # 48 MHz.
(Peformance of this loop body is crucial, since it executes 8 iterations,
and this function is called once per every 2000 clock cycles.
For my application, even this single-cycle difference amounts to roughly 0.5 percent of total cpu time.)
My question has 4 parts:
What is going on here?
Why is it that version (a) takes 21 cycles and version (b) takes 20 cycles?
Why don't both versions take 18 cycles?
Is there any possible way to accurately predict the access latency for RAM and NVM on an Atmel SAMD21 microcontroller,
other than trying random permutations of assembly operations and measuring everything on an oscilloscope?
(Answers to any 1 of these 4 parts would be extremely appreciated.)
Source code (version a)
__attribute__((used))
void enc_calc_transition(struct enc *enc, uint16_t old_state, uint16_t
new_state)
{
uint32_t transitions = enc_interleave_states(old_state, new_state);
size_t j = 0;
for (size_t i = 0; i < 8; i++, j += 4) {
const size_t transition = (transitions >> j) & 0xf;
const int8_t increment = enc_increment_for_transition[transition];
int16_t patch_data_i = enc->patch_data[i];
patch_data_i += increment;
size_t patch_data_i_plus_one = patch_data_i + 1;
patch_data_i = enc_constrain_8x258[patch_data_i_plus_one];
enc->patch_data[i] = patch_data_i;
}
}
Source code (version b)
__attribute__((used))
void enc_calc_transition(struct enc *enc, uint16_t old_state, uint16_t
new_state)
{
uint32_t transitions = enc_interleave_states(old_state, new_state);
size_t j = 0;
for (size_t i = 0; i < 8; i++, j += 4) {
const size_t transition = (transitions >> j) & 0xf;
const int32_t increment = enc_increment_for_transition[transition];
int32_t patch_data_i = enc->patch_data[i];
patch_data_i += increment;
size_t patch_data_i_plus_one = patch_data_i + 1;
patch_data_i = enc_constrain_8x258[patch_data_i_plus_one];
enc->patch_data[i] = patch_data_i;
}
}
Generated assembly (version a)
cyc addr code instr fields
x 894e: 2200 movs r2, #0
x 8950: 250f movs r5, #15
x 8952: 4f09 ldr r7, [pc, #36] ; (8978)
x 8954: 4e09 ldr r6, [pc, #36] ; (897c)
x 8956: 4460 add r0, ip
1 8958: 000b movs r3, r1
1 895a: 40d3 lsrs r3, r2
1 895c: 402b ands r3, r5
2 895e: 7804 ldrb r4, [r0, #0]
2 8960: 56f3 ldrsb r3, [r6, r3]
1 8962: 3204 adds r2, #4
1 8964: 191b adds r3, r3, r4
1 8966: 18fb adds r3, r7, r3
2 8968: 785b ldrb r3, [r3, #1]
2 896a: 7003 strb r3, [r0, #0]
1 896c: 3001 adds r0, #1
1 896e: 2a20 cmp r2, #32
2 8970: d1f2 bne.n 8958 <enc_calc_transition+0x38>
18
x 8972: bdf0 pop {r4, r5, r6, r7, pc}
x 8974: 000090a8 ; <enc_expand_16x256>
x 8978: 00008fa4 ; <enc_constrain_8x258>
x 897c: 00008f94 ; <enc_increment_for_transition> [signed, 8x16]
instruction cycles:
movs lsrs ands ldrb ldrsb adds adds adds ldrb strb adds cmp bne
= 1 + 1 + 1 + 2 + 2 + 1 + 1 + 1 + 2 + 2 + 1 + 1 + 2
= 18
Generated assembly (version b)
cyc addr code instr fields
x 894e: 2200 movs r2, #0
x 8950: 250f movs r5, #15
x 8952: 4f09 ldr r7, [pc, #36] ; (8978)
x 8954: 4e09 ldr r6, [pc, #36] ; (897c)
x 8956: 4460 add r0, ip
1 8958: 0021 movs r1, r4
1 895a: 40d1 lsrs r1, r2
2 895c: 7803 ldrb r3, [r0, #0]
1 895e: 4029 ands r1, r5
2 8960: 5671 ldrsb r1, [r6, r1]
1 8962: 18fb adds r3, r7, r3
1 8964: 185b adds r3, r3, r1
2 8966: 785b ldrb r3, [r3, #1]
1 8968: 3204 adds r2, #4
2 896a: 7003 strb r3, [r0, #0]
1 896c: 3001 adds r0, #1
1 896e: 2a20 cmp r2, #32
2 8970: d1f2 bne.n 8958
18
x 8972: bdf0 pop {r4, r5, r6, r7, pc}
x 8974: 000090a8 ; <enc_expand_16x256>
x 8978: 00008fa4 ; <enc_constrain_8x258>
x 897c: 00008f94 ; <enc_increment_for_transition> [signed, 8x16]
instruction cycles:
movs lsrs ldrb ands ldrsb adds adds ldrb adds strb adds cmp bne
= 1 + 1 + 2 + 1 + 2 + 1 + 1 + 2 + 1 + 2 + 1 + 1 + 2
= 18
My interpretation of the generated assembly (version a)
I have written out my "interpretation" of the generated assembly for each case.
This section might be unnecessary, but I thought it might as well include it since it helped me understand the differences between (a) and (b).
As above, the portions before and after the loop are identical.
The only significant difference I can see is that the two versions execute the same instructions in a slightly different order.
In particular, version (b) (which takes 20 cycles per iteration),
has zero instances of consecutive load/store operations,
zero instances of consecutive load/load operations,
and zero instances of consecutive store/store operations.
(The documented number of wait states for each load operation is commented in brackets: 1 wait state would be indicated by // ^ ldrb [1].)
r2 size_t j = 0;
r5 uint32_t mask_0xf = 0xf;
r7 uint8_t *constrain = &enc_constrain_8x258[0]; // 0x8fa4
r6 uint8_t *increment_for_transition =
&enc_increment_for_transition[0]; // 0x8f94
r0 uint8_t *patch_data = &enc->patch_data[0]
do {
r3 uint32_t _transitions = transitions;
r3 uint32_t transitions_shifted = _transitions >> j;
r3 size_t transition = transitions_shifted & mask_0xf;
r4 int16_t patch_data_i = *(patch_data + 0); //
// ^ ldrb [0]
r3 int8_t _increment = *(increment_for_transition + transition);
// ^ ldrsb [1]
j += 4;
r3 int16_t inc_plus_pdata = _increment + patch_data_i;
r3 uint8_t *constrain_plus_inc_plus_pdata =
constrain + inc_plus_pdata;
r3 uint8_t constrained_pdata = *(constrain_plus_inc_plus_pdata + 1);
// ^ ldr [1]
*(patch_data + 0) = constrained_pdata;
// ^ strb [0]
patch_data++;
} while (j < 32);
My interpretation of the generated assembly (version b)
r2 size_t j = 0;
r5 uint32_t mask_0xf = 0xf;
r7 uint8_t *constrain = &enc_constrain_8x258[0]; // 0x8fa4
r6 uint8_t *increment_for_transition =
&enc_increment_for_transition[0]; // 0x8f94
r0 uint8_t *patch_data = &enc->patch_data[0]
do {
r1 uint32_t _transitions = transitions;
r1 uint32_t transitions_shifted = _transitions >> j;
r3 int32_t patch_data_i = *(patch_data + 0);
// ^ ldrb [0]
r1 size_t transition = transitions_shifted & mask_0xf;
r1 int32_t _increment = *(increment_for_transition + transition);
// ^ ldrsb [1]
r3 uint8_t *constrain_plus_pdata = constrain + patch_data_i;
r3 uint8_t *constrain_plus_pdata_plus_inc =
constrain_plus_pdata + _increment;
r3 uint8_t constrained_pdata = *(constrain_plus_pdata_plus_inc + 1);
// ^ ldr [1]
j += 4;
*(patch_data + 0) = constrained_pdata;
// ^ strb [0]
patch_data++;
} while (j < 32);
Platform information
The microcontroller is the Atmel/Microchip AT91SAMD21G18A.
The architecture is ARMv6-M.
The microarchitecture is ARM Cortex-M0+.
The master clock frequency of my MCU core is 48 MHz.
At 48 MHz, the SAMD21 [non-volatile] program memory requires 1 wait state if the cache is disabled.
At 48 MHz, the SAMD21 SRAM requires zero wait states.
However, I don't see any reason why it would be faster to execute the code from RAM.
I believe the NVM data path is separate from the RAM data path,
so instruction fetches from NVM should never contend with data fetches from RAM
(I'm not 100% sure about this fact, but I think it's true.).
Therefore, if the NVM controller cache is working as documented,
it seems that running this loop from NVM should almost certainly be faster than running this loop from RAM.
The SAMD21 has a 64-byte cache for accesses to non-volatile memory.
The NVM controller cache "is a direct-mapped cache that implements 8 lines of 64 bits (i.e., 64 bytes)."
The NVM controller cache is enabled, in NO_MISS_PENALTY mode.
This is the datasheet description of NO_MISS_PENALTY mode:
"The NVM controller (cache system) does not insert wait states on a cache miss.
Gives the best system performance."
The datasheet does not provide any more information about NO_MISS_PENALTY mode.
Cortex-M0+ uses a von-Neumann architecture. Instruction fetches always contend with data access, whether in zero-wait-state SRAM or in flash.
So I'm having trouble with my program. It's supposed to read in a text file
that has a number on each line. It then stores that in an array, sorts it using selection sort, and then outputs it to a new file. The reading of and writing to the file work perfectly fine but my code for the sort isn't working properly. When I run the program, it only seems to store some of the numbers
in the array and then a bunch of zeroes.
So if my input is 112323, 32, 12, 19, 2, 1, 23. The output is 0,0,0,0, 2,1,23. I'm pretty sure the problem's with how I'm storing and loading from the array
onto the registers because assuming that part works, I can't find any reason why the selection sort algorithm shouldn't work.
Ok thanks to your help, I figured out that I needed to change the load and store instruction so that it matches the specifier used (ldr -> ldrb and str -> strb). But I need to make a sorting algorithm that works for 32 bit numbers so which combination of specifiers and load/store instructions would allow me to do that? Or would I have to load/store 8 bits a time? And if so, how would I do that?
.data
.balign 4
readfile: .asciz "myfile.txt"
.balign 4
readmode: .asciz "r"
.balign 4
writefile: .asciz "output.txt"
.balign 4
writemode: .asciz "w"
.balign 4
return: .word 0
.balign 4
scanformat: .asciz "%d"
.balign 4
printformat: .asciz "%d\n"
.balign 4
a: .space 32
.text
.global main
.global fopen
.global fprintf
.global fclose
.global fscanf
.global printf
main:
ldr r1, =return
str lr, [r1]
ldr r0, =readfile
ldr r1, =readmode
bl fopen
mov r4, r0
mov r5, #0
ldr r6, =a
loop:
cmp r5, #7
beq sort
mov r0, r4
ldr r1, =scanformat
mov r2, r6
bl fscanf
add r5, r5, #1
add r6, r6, #1
b loop
sort:
mov r5,#0 /*array parser for first loop*/
mov r6, #0 /* #stores index of minimum*/
mov r7, #0 /* #temp*/
mov r8, #0 /*# array parser for second loop*/
mov r9, #7 /*# stores length of array*/
ldr r10, =a /*# the array*/
mov r11, #0 /*#used to obtain offset for min*/
mov r12, #0 /*# used to obtain offset for second parser access*/
loop3:
cmp r5, r9 /*# check if first parser reached end of array*/
beq write /* #if it did array is sorted write it to file*/
mov r6, r5 /*#set the min index to the current position*/
mov r8, r6 /*#set the second parser to where first parser is at*/
b loop4 /*#start looking for min in this subarray*/
loop4:
cmp r8, r9 /* #if reached end of list min is found*/
beq increment /* #get out of this loop and increment 1st parser**/
lsl r7, r6, #3 /*multiplies min index by 8 */
ADD r7, r10, r7 /* adds offset to r10 address storing it in r7 */
ldr r11, [r7] /* loads value of min in r11 */
lsl r7, r8, #3 /* multiplies second parse index by 8 */
ADD r7, r10, r7 /* adds offset to r10 address storing in r7 */
ldr r12, [r7] /* loads value of second parse into r12 */
cmp r11, r12 /* #compare current min to the current position of 2nd parser !!!!!*/
movgt r6, r8 /*# set new min to current position of second parser */
add r8, r8, #1 /*increment second parser*/
b loop4 /*repeat */
increment:
lsl r11, r5, #3 /* multiplies first parse index by 8 */
ADD r11, r10, r11 /* adds offset to r10 address stored in r11*/
ldr r8, [r11] /* loads value in memory address in r11 to r8*/
lsl r12, r6, #3 /*multiplies min index by 8 */
ADD r12, r10, r12 /*ads offset to r10 address stored in r12 */
ldr r7, [r12] /* loads value in memory address in r12 to r7 */
str r8, [r12] /* # stores value of first parser where min was !!!!!*/
str r7, [r11] /*# store value of min where first parser was !!!!!*/
add r5, r5, #1 /*#increment the first parser*/
ldr r0,=printformat
mov r1, r7
bl printf
b loop3 /*#go to loop1*/
write:
mov r0, r4
bl fclose
ldr r0, =writefile
ldr r1, =writemode
bl fopen
mov r4, r0
mov r5, #0
ldr r6, =a
loop2:
cmp r5, #7
beq end
mov r0, r4
ldr r1, =printformat
ldrb r2, [r6]
bl fprintf
add r5, r5, #1
add r6, r6, #1
b loop2
end:
mov r0, r4
bl fclose
ldr r0, =a
ldr r0, [r0]
ldr lr, =return
ldr lr, [lr]
bx lr
I figured out that I needed to change the load and store instruction
so that it matches the specifier used (ldr -> ldrb and str -> strb).
But I need to make a sorting algorithm that works for 32 bit numbers
so which combination of specifiers and load/store instructions would
allow me to do that?
If you want to read 32b (4 bytes) values from memory, you have to have 4 bytes values in memory to begin with. Well that should not be surprising :)
Eg if your input is numbers 1, 2, 3, 4, each number is 32b value than in memory that would be
0x00000000: 01 00 00 00 | 02 00 00 00 <- 32b values of 1 & 2
0x00000008: 03 00 00 00 | 04 00 00 00 <- 32b values of 3 & 4
In such case ldr would read 32b each time and you would get 1, 2, 3, 4 with each read in register.
Now, you have in memory byte values (based on your statement that `ldrb` gives right result), eg
0x00000000: 01
0x00000001: 02
0x00000002: 03
0x00000003: 04
or same in one line
0x00000000: 01 02 03 04
So reading 8bit by ldrb gives you numbers 1, 2, 3, 4
But ldr would do read 32b value from memory (all 4 bytes at once) and you would get 32b value 0x04030201 in register.
Note: examples for little-endian systems
How can I write a simple LC-3 program that compares the two numbers in R1 and R2 and puts the value 0 in R0 if R1 = R2, 1 if R1 > R2 and -1 if R1 < R2.
The comparison is done using simple arithmetic.
In my example we compare 2 and 6, you know what the result is.
LD R1, NUMBER1 ;load NUMBER1 into R1
LD R2, NUMBER2 ;load NUMBER1 into R2
AND R6,R6,#0 ;initialize R0 with 0
NOT R3, R2 ;R3 = -R2 (we negate NUMBER2)
ADD R4, R3, R1 ;R4 = R1 - R2
BRz Equals ;we jump to Equals if NUMBER1 = NUMBER2 (we can just jump directly to END)
BRn GreaterR2 ;we jump to GreaterR2 if NUMBER1 < NUMBER2
BRp GreaterR1 ;we jump to GreaterR2 if NUMBER1 > NUMBER2
Equals BRnzp End ;nothing to do, because R0=0 (THIS IS NOT NECCESARY)
GreaterR2 ADD R0, R0, #-1 ;R0 = -1
BRnzp End
GreaterR1 ADD R0, R0, #1 ;R0 = 1
BRnzp End
Done HALT ;THE END
NUMBER1 .FILL #2 ;/ Here we declare the numbers we want to compare
NUMBER1 .FILL #6 ;\
.ORIG x3000
AND R1, R1, x0
AND R2, R2, x0
LD R6, RESET
LEA R0, LINE1
PUTS
GETC
OUT
ADD R1, R6, R0
LEA R0, LINE2
PUTS
GETC
OUT
ADD R2, R6, R0
JSR COMPARE
HALT
;////////// COMPARE FUNCTION BEGINS /////////////
COMPARE
AND R3, R3, x0
NOT R2, R2
ADD R2, R2, x1
ADD R3, R1, R2
BRn NEG
ADD R3, R3, x0
BRp POS
ADD R3, R3, x0
BRz EQ
AND R5, R5, x0
ADD R5, R5, R1
RET
NEG LEA R0, N ; triggers when R3 IS NEGATIVE
PUTS
RET
POS LEA R0, P ; triggers when R3 IS POSITIVE
PUTS
RET
EQ LEA R0, E ; triggers when R3 IS ZERO
PUTS
RET
N .STRINGZ "\nX IS LESS THAN Y"
P .STRINGZ "\nX IS GREATER THAN Y"
E .STRINGZ "\nX IS EQUAL TO Y"
RESET .FILL xFFD0; RESET = -48 AS THIS IS ASCII RESETER FOR OUR PROGRAM
LINE1 .STRINGZ "ENTER X : "
LINE2 .STRINGZ "\nENTER Y : "
.END
Can anyone run this program, or try to help me understand why my counter isnt updating?
I am supposed to read in text from a prompt and find the length of the text and output it with a colon before printing out the actual text entered.
If the first time I type in "test" the length is 4, but then when it loops back to start it asks me to input again, it outputs the correct text, but the counter doesnt change unless the text is longer.
So, if i type "I", it will output length of 4, since test is longer and is 4. But if I type "Control" which is 7 letters, it will update the counter to 7.
OUTPUT:
Enter: Hey
3:Hey
Enter: Test
4:Test
Enter: Control
7:Control
Enter: Hey
7:Hey <---- Length should be 3!
Thank you!
.orig x3000 ; Starting point of the program.
BR start ; Branch to the start routine.
newln: .stringz "\n"
msg1: .stringz "Enter: "
; Prints out the instructions to the user and reads some user input.
start:
lea r0, newln ; Load address of newline into R0.
puts ; Print newline.
lea r0, msg1 ; Load address of message3 into R0.
puts
lea r2, MESSAGE ; Load starting point address of MESSAGE.
and r1, r1, #0 ; Initialize R1 to zero.
input:
getc ; Read in a single character to R0.
out
add r5, r0, #-10 ; Subtract 10 because enter key is 10.
BRz printint ; If zero, branch to checkChar routine.
; Else continue the loop.
str r0, r2, #0 ; Store char in MESSAGE.
add r2, r2, #1 ; Increment index of MESSAGE.
add r1, r1, #1 ; Increment input counter.
BR input ; Unconditional branch to input.
checkChar:
lea r5, inv81 ; Load address of inv68 into R6.
ldr r5, r5, #0 ; Load contents of inv68 into R6 (R6 now holds -68).
add r0, r3, r5 ; Add -68 to the value in R3, to check if it's 'q'.
BRz quit ; If zero, branch to decrypt.
;
;print integer starts here
;
printint:
ld r3,psign
jsr STRLEN
ADD r7, r0, #0 ; get the integer to print
brzp nonneg
ld r3,nsign
not r7,r7
add r7,r7,1
nonneg:
lea r6,buffer ; get the address of o/p area
add r6,r6,#7 ; compute address of end of o/p
ld r5,char0 ; get '0' to add to int digits
loop1:
and r0,r0,#0 ; init quotient for each divide
loop2:
add r7,r7,#-10 ; add -10
brn remdr ; until negative
add r0,r0,#1 ; incr to compute quotient
br loop2 ; repeat
remdr:
add r7,r7,#10 ; add 10 to get remainder
add r7,r7,r5 ; convert to ascii
str r7,r6,0 ; place ascii in o/p
add r7,r0,#0 ; move quot for next divide
brz end ; if done then print
add r6,r6,#-1 ; move to prev o/p position
br loop1 ; repeat
end:
add r6,r6,#-1 ; move to prev o/p position
str r3,r6,0 ; place sign
add r0,r6,#0 ; move address of 1st char
puts ; into r0 and print
output:
ld r5, colon
and r3,r3, 0;
add r0, r3, r5;
out
lea r2, MESSAGE ; Load (starting) address of MESSAGE.
outputLoop:
ldr r0, r2, #0 ; Load contents of address at MESSAGE index into R0.
out ; Print character.
add r2, r2, #1 ; Increment MESSAGE index.
add r1, r1, #-1 ; Decrease counter.
BRp outputLoop ; If positive, loop.
br start
quit:
halt ; Halt execution.
STRLEN:
LEA R2, MESSAGE ;R1 is pointer to characters
AND R0, R0, #0 ;R0 is counter, initially 0
LD R5, char0
LOOP: ADD R2, R2, #1 ;POINT TO NEXT CHARACTER
LDR R4, R2, #0 ;R4 gets character input
BRz FINISH
ADD R0, R0, #1
BR LOOP
FINISH:
ADD R0, R0, #1
ret
MESSAGE: .blkw 99 ; MESSAGE of size 20.
inv48: .fill #-48 ; Constant for converting numbers from ASCII to decimal.
inv81: .fill #-81 ; Constant for the inverse of 'Q'.
buffer: .blkw 8 ; o/p area
null: .fill 0 ; null to end o/p area
char0: .fill x30
colon .fill x3A
nsign .fill x2d
psign .fill x20
.end
At the end of your example, the contents in memory starting at message are:
Heytrol0000000
It looks to me like the problem is that in STRLEN we compute the length of the string by counting until we reach the first character that is 0. There are 7 characters in "Heytrol".
However, when we are storing the message, we count the number of characters that we have read in (kept in r1). When we print the string later on, we use the value in r1, so we don't end up printing any of the "extra" characters.
To fix this, I'd either output the value in r1 that was computed while reading the string as its length (get rid of the STRLEN code completely) or make sure that when we read the enter in the input loop, we write a zero into the string before going and printing:
input:
getc ; Read in a single character to R0.
out
add r5, r0, #-10 ; Subtract 10 because enter key is 10.
BRz finishString ; If zero, branch to finishString routine.
; Else continue the loop.
str r0, r2, #0 ; Store char in MESSAGE.
add r2, r2, #1 ; Increment index of MESSAGE.
add r1, r1, #1 ; Increment input counter.
BR input ; Unconditional branch to input.
finishString:
and r0, r0, #0 ; set r0 to zero so we can store it
str r0, r2, #0 ; write zero (from r0) into the end of the string
BR printint ; Now, branch to checkChar routine.
I am trying to write a LC3 assembly language program that takes two input numbers and prints out x * y = z.
I can get it to work for numbers 0-9 however any numbers above that I get weird letters or symbols.
Also how can I make it so that it can not only take only 1 inputs per GETC but two numbers eg. 10 * 12= 120?
Any help would be appreciated! :)
Here's what I have done so far
.ORIG x3000
AND R3, R3, #0 ;r3 stores the sum, set r3 to zero
AND R4, R4, #0 ;r4 is the counter
LD R5, INVERSE_ASCII_OFFSET ;inverse ascii offset
LD R6, DECIMAL_OFFSET ;decimal offset
;---------------------
;storing first input digits
LEA R0, display1 ;load the address of the 'display1' message string
PUTS ;Prints the message string
GETC ;get the first number
OUT ;print the first number
ADD R1, R0, #0 ;store input value(ascii) to r1
ADD R1, R1, R5 ;get real value of r1
;storing second input digits
LEA R0, display2 ;load the address of the 'display2' message string
PUTS ;Prints the message string
GETC ;get the first number
OUT ;print the first number
ADD R2, R0, #0 ;store input value(ascii) to r2
ADD R2, R2, R5 ;get real value of r2
;----------------------
ADD R4, R2, #0 ;fill counter with multiplier
MULTIPLICATION:
ADD R3, R3, R1 ;add to sum
ADD R4, R4, #-1 ;decrease counter by one
BRp MULTIPLICATION ;continue loop until multiplier is 0
LEA R0, stringResult
PUTS
ADD R0, R3, R6 ;move result to r0
OUT ;print result
HALT
display1 .STRINGZ "\nenter the 1st no.: "
display2 .STRINGZ "\nenter the 2nd no.: "
stringResult .STRINGZ "\nResult: "
INVERSE_ASCII_OFFSET .fill xFFD0 ; Negative of x0030.
DECIMAL_OFFSET .fill #48
.END
Your display function works by adding a number to the base ascii value of '0'. This works because the ascii table was arranged in a way to be convenient. For instance, '0' + 1 = '1', which is equivalent to 0x30 + 1 = 0x31. However, if you are probably finding that '0' + 12 = '<'. This is because '0' = 0x30, so 0x30 + 12 (0xC) = 0x3C. Looking at the ascii chart we see that 0x3C = '<'. That is, this is an effective method only to print out a single digit.
The answer to both your questions lie in writing a routine that iteratively deals with digits and forms a number with them. In other words, you will need a loop that determines which character to print out next and prints it.