Assuming direct-mapping cache, how is LDR handled in the following cases? - caching

I've been tasked to make a "timing" table for the following code (using a 5-stage pipeline):
LDR R1, = 0x345678
LDR R2, [R1]
LDR R3, =0xFC167B
LDR R4, [R3]
LDR R5, =0xD8967A
LDR R6, [R5]
Assuming these are all "cache misses", I am wondering where delays would be introduced if we are assuming a 1-cycle penalty in the event of a miss. My thoughts are that the first line, LDR R1, = 0x345678, wouldn't cause a penalty, since are just loading some word into a register.
But for the second line, LDR R2, [R1], I am thinking that there will be a penalty, since we are trying to read from the cache, but it isn't in the cache. Hence, would there just be a delay after the "Memory" stage of the pipeline?
Regards

Related

How to make a sorting algorithm in Assembly code, in LC3

I have a simple loop that takes a name and prints the name without saving it.
looptext getc ;starts text get loop for name
;since name isn't re-used, we don't have to save it
add r1, r0, -10 ;Test for enter character
brz finishloop1 ;if enter, cancel the text loop
OUT ;If it's not enter, print out the character typed
br looptext ;Go back to loop
finishloop1
The program then asks for an ID number separated by spaces. All these values are saved into an array and each loop, it checks if the new input is the 'new' lowest value or 'new' highest value and sets it into the respectable register.
[Deleted code for copyright sake]
At the end of the code, where I need to add a sorting algorithm, I am left with an array of only characters.
I need to go through each index of the array and rearrange the characters in the numerical order (smallest to largest).
Thank you very much all of you for the tips and tricks. Thank you specifically #Ped7g for linking me that sorting algorithms page. I ended up searching around and actually finding someone on gitub that had a bubble algorithm already written out in Assembly code. So thanks for indirectly giving me the answer.
Note: For any future people coming here to find answer, here is the link the bubble sorting algorithm code:
https://github.com/oc-cs360/s2014/blob/master/lc3/bubblesort.asm. This is part of the lecture notes for a university course.
; Implementing bubble sort algorithm
; R0 File item
; R1 File item
; R2 Work variable
; R3 File pointer
; R4 Outer loop counter
; R5 Inner loop counter
.ORIG x3000
; Count the number of items to be sorted and store the value in R7
AND R2, R2, #0 ; Initialize R2 <- 0 (counter)
LD R3, FILE ; Put file pointer into R3
COUNT LDR R0, R3, #0 ; Put next file item into R0
BRZ END_COUNT ; Loop until file item is 0
ADD R3, R3, #1 ; Increment file pointer
ADD R2, R2, #1 ; Increment counter
BRNZP COUNT ; Counter loop
END_COUNT ADD R4, R2, #0 ; Store total items in R4 (outer loop count)
BRZ SORTED ; Empty file
; Do the bubble sort
OUTERLOOP ADD R4, R4, #-1 ; loop n - 1 times
BRNZ SORTED ; Looping complete, exit
ADD R5, R4, #0 ; Initialize inner loop counter to outer
LD R3, FILE ; Set file pointer to beginning of file
INNERLOOP LDR R0, R3, #0 ; Get item at file pointer
LDR R1, R3, #1 ; Get next item
NOT R2, R1 ; Negate ...
ADD R2, R2, #1 ; ... next item
ADD R2, R0, R2 ; swap = item - next item
BRNZ SWAPPED ; Don't swap if in order (item <= next item)
STR R1, R3, #0 ; Perform ...
STR R0, R3, #1 ; ... swap
SWAPPED ADD R3, R3, #1 ; Increment file pointer
ADD R5, R5, #-1 ; Decrement inner loop counter
BRP INNERLOOP ; End of inner loop
BRNZP OUTERLOOP ; End of outer loop
SORTED HALT
FILE .FILL x3500 ; File location
.END

ARM ldrb is setting Wt to -1

I made a function in ARM that removes a duplicate character if the same character exists in a previous location. If it finds one, it will remove the duplicate character and shift all values over.
The first argument (r0) is the index of the character to check for duplicates, and the second argument (r1) is the array to look through. For example, if r0 is 3 and r1 points to "hello", it will look at the third element in the string ('l') and after running, the string r1 points to will become "helo" and r0 will return 1, in r0.
The current problem I'm having is that the first ldrb r2, [r1, r0] is setting r2 to -1. I have checked the ARM manual for ldrb returning errors like this, but I haven't been able to find anything like that. I was wondering if anybody knew a solution.
Thanks!
check_if_duplicate:
push {r4, r5, lr}
# Get character at value i in r4
ldrb r2, [r1, r0]
mov r3, #0
duplicate_loop:
# Check if the target index is greater than the current index
cmp r0, r3
ble duplicate_done
# Check if current index is a duplicate
ldrb r4, [r1, r3]
cmp r4, r2
# Increment iterator and repeat loop if they are not equal
addne r3, r3, #1
bne duplicate_loop
# If they are equal, then shift all values over
duplicate_shift_loop:
# Find next value
add r3, r3, #1
# Load next value
ldrb r2, [r1, r3]
# Check if new value is the end of the array
cmp r2, #0
beq duplicate_shift_done
# Use previous index to store value
sub r5, r3, #1
# Store vaue in previous address
strb r2, [r1, r5]
# Do process again
b duplicate_shift_loop
duplicate_shift_done:
# Store string ending byte
sub r3, r3, #1
mov r2, #0
strb r2, [r1, r3]
# Set 1 as the return value
mov r0, #1
b d_done
duplicate_done:
mov r0, #0
d_done:
pop {r4, r5, pc}

C11 and C++11 atomics: acquire-release semantics and memory barriers

I'm using C11* atomics to manage a state enum between a few threads. The code resembles the following:
static _Atomic State state;
void setToFoo(void)
{
atomic_store_explicit(&state, STATE_FOO, memory_order_release);
}
bool stateIsBar(void)
{
return atomic_load_explicit(&state, memory_order_acquire) == STATE_BAR;
}
This assembles (for an ARM Cortex-M4) to:
<setToFoo>:
ldr r3, [pc, #8]
dmb sy ; Memory barrier
movs r2, #0
strb r2, [r3, #0] ; store STATE_FOO
bx lr
.word 0x00000000
<stateIsBar>:
ldr r3, [pc, #16]
ldrb r0, [r3, #0] ; load state
dmb sy ; Memory barrier
sub.w r0, r0, #2 ; Comparison and return follows
clz r0, r0
lsrs r0, r0, #5
bx lr
.word 0x00000000
Why are the fences placed before the release and after the acquire? My mental model assumed that a barrier would be placed after after a release (to "propagate" the variable being stored and all other stores to other threads) and before an acquire (to receive all previous stores from other threads).
*While this particular example is given in C11, the situation is identical in C++11, as the two share the same concepts (and even the same enums) when it comes to memory ordering. gcc and g++ emit the same machine code in this situation. See http://en.cppreference.com/w/c/atomic/memory_order and http://en.cppreference.com/w/cpp/atomic/memory_order
The memory fence before the store is to guarantee that the store isn't ordered before any prior stores. Similarly, the memory fence after the read guarantees that the read isn't ordered after any following reads. When you combine the two, it creates a synchronizes-with relation between the writes and reads.
T1: on-deps(A) -> fence -> write(A)
T2: read(A) -> fence -> deps-on(A)
read(A) happens before deps-on(A)
write(A) happens after on-deps(A)
If you change the order of either fence, the sequence of dependencies is broken which obviously will cause inconsistent results (e.g. race conditions).
Some more possible reading...
Acquire and Release Fences
Memory Barriers: a Hardware View for Software Hackers

LC-3 Code Segment

I am having a difficult time understanding this particular problem. I have the answers, but I really want to know the reason as to why they are what they are! I understand how each opcode works, just not in applying it to this problem.....
An engineer is in the process of debugging a program she has written. She is looking at the following segment of the program, and decides to place a breakpoint in memory at location 0xA404. Starting with the PC = 0xA400, she initializes all the registers to zero and runs the program until the breakpoint is encountered.
Code Segment:
0xA400 THIS1 LEA R0, THIS1
0xA401 THIS2 LD R1, THIS2
0xA402 THIS3 LDI R2, THIS5
0xA403 THIS4 LDR R3, R0, #2
0xA404 THIS5 .FILL xA400
Show the contents of the register file (in hexadecimal) when the breakpoint is encountered.
Again, I'm not seeking a list of answers, but an explanation to help me understand what exactly is going on in the program. Thanks so much!
If the engineer put the breakpoint on line 0xa404 (stopping the program before 0xa404 is run), the code would do the following:
0xA400 THIS1 LEA R0, THIS1 ; LEA loads the address of THIS1 into R0.
; Since THIS1 is at memory location 0xA400,
; after this instruction R0 = 0xA400
0xA401 THIS2 LD R1, THIS2 ; LD loads the contents of the memory at
; THIS2 into R1. Since THIS2 is this very
; line its contents are this instruction,
; which is 0010001111111111 in binary or
; 0x23ff in hex, so after this line executes
; R1 hold 0x23ff
0xA402 THIS3 LDI R2, THIS5 ; LDI visits THIS5 and treats its value as a
; new memory location to visit. It visits
; that second location and stores its
; contents into R2. In this case, it would
; look at THIS5 and see its value is 0xA400.
; It would then visit 0xA400 and store its
; contents in R2. 0xA400 contains the first
; line of your program which translates to
; 1110000111111111 in binary, 0xe1ff in
; hex, so it stores 0xe1ff into R2.
0xA403 THIS4 LDR R3, R0, #2 ; LDR starts from the memory location of R0,
; adds 2 to that, then stores whatever it
; finds in that memory location into R3. In
; this case R0 = 0xA400. It adds 2, bringing
; it up to 0xA402, which is the instruction
; immediately above this one. In binary, that
; instruction is 1010 0100 0000 0001, which
; translates into 0xa401 so the program stores
; the program stores 0xa401 into R3.
0xA404 THIS5 .FILL xA400

ARM Integer Division Algorithm

I'm beginning in ARM assembly and I've been trying to write a simple integer division subroutine. So far, I have the following:
.text
start:
mov r0, #25
mov r1, #5
bl divide
b stop
divide:
cmp r0, r1
it lo
mov pc, lr
sub r0, r0, r1
add r2, r2, #1
b divide
stop:
b stop
I wrote it based on the pseudocode I came up with for the algorithm:
Is the Divisor (bottom) larger than the Dividend (top)?
Yes:
-Return the remainder and the counter(quotient)
No:
-Subtract the Divisor from the Dividend
-Increment the counter by 1
-Repeat the method
r0 contains the numerator and r1 contains the denominator.
When the algorithm finishes, r0 should contain the remainder and r2 should contain the quotient. However, upon running, r0 contains 19 and r2 contains 0.
Are there any fallacies in my logic that I'm just missing?
I removed it lo and changed mov to movlo and it worked fine.

Resources