C11 and C++11 atomics: acquire-release semantics and memory barriers - c++11

I'm using C11* atomics to manage a state enum between a few threads. The code resembles the following:
static _Atomic State state;
void setToFoo(void)
{
atomic_store_explicit(&state, STATE_FOO, memory_order_release);
}
bool stateIsBar(void)
{
return atomic_load_explicit(&state, memory_order_acquire) == STATE_BAR;
}
This assembles (for an ARM Cortex-M4) to:
<setToFoo>:
ldr r3, [pc, #8]
dmb sy ; Memory barrier
movs r2, #0
strb r2, [r3, #0] ; store STATE_FOO
bx lr
.word 0x00000000
<stateIsBar>:
ldr r3, [pc, #16]
ldrb r0, [r3, #0] ; load state
dmb sy ; Memory barrier
sub.w r0, r0, #2 ; Comparison and return follows
clz r0, r0
lsrs r0, r0, #5
bx lr
.word 0x00000000
Why are the fences placed before the release and after the acquire? My mental model assumed that a barrier would be placed after after a release (to "propagate" the variable being stored and all other stores to other threads) and before an acquire (to receive all previous stores from other threads).
*While this particular example is given in C11, the situation is identical in C++11, as the two share the same concepts (and even the same enums) when it comes to memory ordering. gcc and g++ emit the same machine code in this situation. See http://en.cppreference.com/w/c/atomic/memory_order and http://en.cppreference.com/w/cpp/atomic/memory_order

The memory fence before the store is to guarantee that the store isn't ordered before any prior stores. Similarly, the memory fence after the read guarantees that the read isn't ordered after any following reads. When you combine the two, it creates a synchronizes-with relation between the writes and reads.
T1: on-deps(A) -> fence -> write(A)
T2: read(A) -> fence -> deps-on(A)
read(A) happens before deps-on(A)
write(A) happens after on-deps(A)
If you change the order of either fence, the sequence of dependencies is broken which obviously will cause inconsistent results (e.g. race conditions).
Some more possible reading...
Acquire and Release Fences
Memory Barriers: a Hardware View for Software Hackers

Related

ARM ldrb is setting Wt to -1

I made a function in ARM that removes a duplicate character if the same character exists in a previous location. If it finds one, it will remove the duplicate character and shift all values over.
The first argument (r0) is the index of the character to check for duplicates, and the second argument (r1) is the array to look through. For example, if r0 is 3 and r1 points to "hello", it will look at the third element in the string ('l') and after running, the string r1 points to will become "helo" and r0 will return 1, in r0.
The current problem I'm having is that the first ldrb r2, [r1, r0] is setting r2 to -1. I have checked the ARM manual for ldrb returning errors like this, but I haven't been able to find anything like that. I was wondering if anybody knew a solution.
Thanks!
check_if_duplicate:
push {r4, r5, lr}
# Get character at value i in r4
ldrb r2, [r1, r0]
mov r3, #0
duplicate_loop:
# Check if the target index is greater than the current index
cmp r0, r3
ble duplicate_done
# Check if current index is a duplicate
ldrb r4, [r1, r3]
cmp r4, r2
# Increment iterator and repeat loop if they are not equal
addne r3, r3, #1
bne duplicate_loop
# If they are equal, then shift all values over
duplicate_shift_loop:
# Find next value
add r3, r3, #1
# Load next value
ldrb r2, [r1, r3]
# Check if new value is the end of the array
cmp r2, #0
beq duplicate_shift_done
# Use previous index to store value
sub r5, r3, #1
# Store vaue in previous address
strb r2, [r1, r5]
# Do process again
b duplicate_shift_loop
duplicate_shift_done:
# Store string ending byte
sub r3, r3, #1
mov r2, #0
strb r2, [r1, r3]
# Set 1 as the return value
mov r0, #1
b d_done
duplicate_done:
mov r0, #0
d_done:
pop {r4, r5, pc}

Assuming direct-mapping cache, how is LDR handled in the following cases?

I've been tasked to make a "timing" table for the following code (using a 5-stage pipeline):
LDR R1, = 0x345678
LDR R2, [R1]
LDR R3, =0xFC167B
LDR R4, [R3]
LDR R5, =0xD8967A
LDR R6, [R5]
Assuming these are all "cache misses", I am wondering where delays would be introduced if we are assuming a 1-cycle penalty in the event of a miss. My thoughts are that the first line, LDR R1, = 0x345678, wouldn't cause a penalty, since are just loading some word into a register.
But for the second line, LDR R2, [R1], I am thinking that there will be a penalty, since we are trying to read from the cache, but it isn't in the cache. Hence, would there just be a delay after the "Memory" stage of the pipeline?
Regards

LC-3 Code Segment

I am having a difficult time understanding this particular problem. I have the answers, but I really want to know the reason as to why they are what they are! I understand how each opcode works, just not in applying it to this problem.....
An engineer is in the process of debugging a program she has written. She is looking at the following segment of the program, and decides to place a breakpoint in memory at location 0xA404. Starting with the PC = 0xA400, she initializes all the registers to zero and runs the program until the breakpoint is encountered.
Code Segment:
0xA400 THIS1 LEA R0, THIS1
0xA401 THIS2 LD R1, THIS2
0xA402 THIS3 LDI R2, THIS5
0xA403 THIS4 LDR R3, R0, #2
0xA404 THIS5 .FILL xA400
Show the contents of the register file (in hexadecimal) when the breakpoint is encountered.
Again, I'm not seeking a list of answers, but an explanation to help me understand what exactly is going on in the program. Thanks so much!
If the engineer put the breakpoint on line 0xa404 (stopping the program before 0xa404 is run), the code would do the following:
0xA400 THIS1 LEA R0, THIS1 ; LEA loads the address of THIS1 into R0.
; Since THIS1 is at memory location 0xA400,
; after this instruction R0 = 0xA400
0xA401 THIS2 LD R1, THIS2 ; LD loads the contents of the memory at
; THIS2 into R1. Since THIS2 is this very
; line its contents are this instruction,
; which is 0010001111111111 in binary or
; 0x23ff in hex, so after this line executes
; R1 hold 0x23ff
0xA402 THIS3 LDI R2, THIS5 ; LDI visits THIS5 and treats its value as a
; new memory location to visit. It visits
; that second location and stores its
; contents into R2. In this case, it would
; look at THIS5 and see its value is 0xA400.
; It would then visit 0xA400 and store its
; contents in R2. 0xA400 contains the first
; line of your program which translates to
; 1110000111111111 in binary, 0xe1ff in
; hex, so it stores 0xe1ff into R2.
0xA403 THIS4 LDR R3, R0, #2 ; LDR starts from the memory location of R0,
; adds 2 to that, then stores whatever it
; finds in that memory location into R3. In
; this case R0 = 0xA400. It adds 2, bringing
; it up to 0xA402, which is the instruction
; immediately above this one. In binary, that
; instruction is 1010 0100 0000 0001, which
; translates into 0xa401 so the program stores
; the program stores 0xa401 into R3.
0xA404 THIS5 .FILL xA400

ARM Integer Division Algorithm

I'm beginning in ARM assembly and I've been trying to write a simple integer division subroutine. So far, I have the following:
.text
start:
mov r0, #25
mov r1, #5
bl divide
b stop
divide:
cmp r0, r1
it lo
mov pc, lr
sub r0, r0, r1
add r2, r2, #1
b divide
stop:
b stop
I wrote it based on the pseudocode I came up with for the algorithm:
Is the Divisor (bottom) larger than the Dividend (top)?
Yes:
-Return the remainder and the counter(quotient)
No:
-Subtract the Divisor from the Dividend
-Increment the counter by 1
-Repeat the method
r0 contains the numerator and r1 contains the denominator.
When the algorithm finishes, r0 should contain the remainder and r2 should contain the quotient. However, upon running, r0 contains 19 and r2 contains 0.
Are there any fallacies in my logic that I'm just missing?
I removed it lo and changed mov to movlo and it worked fine.

Multiply using addition and a restricted set of instructions

I am building a CPU circuit with Logisim. My CPU has only 2 general purpose registers and a 16-byte RAM. I have encoded the following instruction set (Rxy means one of the two registers)
• ADD Rxy, Rxy (add Rxy and Rxy and store result inside the first register)
• SUB Rxy, Rxy (same but with sub)
• ST Rxy, Address (save value of Rxy into RAM address)
• LD Rxy, Address (load into Rxy value at RAM address)
• BZ Rxy, Address (branch to address if value of Rxy is zero)
I thought I could use decrement the second addend until it reaches 0 and at each step, add the first addend to itself.
For example, 5*3 = 5+5+5 ; 3-1-1-1
But I'm not sure my instruction can permit this program... I only have a branch if Rxy is equal to 0, whereas I would like to branch if not equal to 0.
My program currently looks like this :
Assume R1 is preloaded with second addends (iteration left count)
(A) LD R0, Address # constant 1
SUB R1, R0 # decrement iteration left
ST R1, Address # save iteration count in memory
LD R0, Address # Load first addend
LD R1, Address # load current total
ADD R0, R1 # do addition
ST R0, Address # save new current total
BZ R1, (B) # if iteration is 0, then display and end, else add
(B)
STOP
Is there a way to loop with my instruction set?
You can change
BZ R1, (B)
(B)
to
BZ R1, (B)
LD R0, Address # constant 1
SUB R0, R0
BZ R0, (A)
(B)

Resources