Is this a GCC bug or am I doing something wrong?

Is this a GCC bug or am I doing something wrong? - gcc

I am trying to get the final accumulate in the code below to use the ARM M7 SMLAL 32*32->64 bit accumulate function. If I include the T3 = T3 + 1 than it does use this, but if I comment it out it does a full 64*64 bit and accumulate using 3 multiply and 2 add instructions. I don't actually want to add 1 to T3 so it needs to go.
I've broken the code down so that I could analyse it in more detail and it definitely seems to be that the cast of T3 to int32_t and throwing away the bottom 32 bits from the multiply isn't being picked up by the compiler and it thinks T3 still has 64 bits. Bit when I add the simple increment of T3 it then gets it correct. I tried adding zero but then it goes back to the full 64*64 bit multiply.
I'm using the -O2 optimisation on STM's STM32CubeIDE which uses a version of GCC. Other optimations never use SMLAL or unroll everything.
int64_t T4 = 0;
osc = key * NumHarmonics;
harmonic = 0;
do
{
if (OscLevel[osc] > 1)
{
OscPhase[osc] = OscPhase[osc] + (uint32_t)(T2);
int32_t T5 = Sine[(OscPhase[osc] >> 16) & 0x0000FFFF];
int64_t T6 = (int64_t)T1 * Tremelo[harmonic];
int32_t T3 = (int32_t)(T6 >> 32); // grab the most significant register
// T3 = T3 + 1; // needs the +1 to force use of SMLAL in next instruction ! (+0 doesn't help)
T4 = T4 + (int64_t)T3 * (int64_t)T5; // should be SMLAL but does a full 64*64 mult if no +1 above
}
osc++;
harmonic++;
}
while (harmonic < NumHarmonics);
OscTotal = T4;
without the addition :
800054e: 4b13 ldr r3, [pc, #76] ; (800059c <main+0xd8>)
8000550: f853 1024 ldr.w r1, [r3, r4, lsl #2]
8000554: ea4f 79e1 mov.w r9, r1, asr #31
8000558: fba7 4501 umull r4, r5, r7, r1
800055c: fb07 f309 mul.w r3, r7, r9
8000560: fb01 3202 mla r2, r1, r2, r3
8000564: 4415 add r5, r2
8000566: e9dd 2300 ldrd r2, r3, [sp]
800056a: 1912 adds r2, r2, r4
800056c: 416b adcs r3, r5
800056e: e9cd 2300 strd r2, r3, [sp]
}
osc++;
8000572: 3001 adds r0, #1
harmonic++;
with the addition
8000542: 4b0b ldr r3, [pc, #44] ; (8000570 <main+0xac>)
8000544: f853 3020 ldr.w r3, [r3, r0, lsl #2]
8000548: fbc3 6701 smlal r6, r7, r3, r1
}
osc++;
800054c: 3201 adds r2, #1
harmonic++;

Related

gcc ARM produces incorrect code - how to correct

gcc ARM for STM32F407 micro
The following function is used as a sanity check in FreeRtosTCP
UBaseType_t bIsValidNetworkDescriptor( const NetworkBufferDescriptor_t * pxDesc )
{
uint32_t offset = ( uint32_t ) ( ((const char *)pxDesc) - ((const char *)xNetworkBuffers) );
if( ( offset >= (uint32_t)(sizeof( xNetworkBuffers )) ) || ( ( offset % sizeof( xNetworkBuffers[0] ) ) != 0 ) )
return pdFALSE;
return (UBaseType_t) (pxDesc - xNetworkBuffers) + 1;
}
The line in question is ---> offset >= (uint32_t)(sizeof( xNetworkBuffers ))
gcc produces a bhi instruction after the cmp instead of a bhs.
If tries casting both as shown in the code above but nothing seems to get the bhs instruction to be used.
Any help appreciated.
Thanks.
Joe

Well knowing the exact size of the xNetworkBuffers array compiler can simply optimize it. Being curious I gave it a try. Following is the code with little modifications and the asm output and the explanation:
#include <stdint.h>
typedef struct abc {
char data[10];
}NetworkBufferDescriptor_t;
NetworkBufferDescriptor_t xNetworkBuffers[5];
int bIsValidNetworkDescriptor( const NetworkBufferDescriptor_t * pxDesc )
{
uint32_t offset = ( uint32_t ) ( ((const char *)pxDesc) - ((const char *)xNetworkBuffers) );
if( ( offset >= (uint32_t)(sizeof( xNetworkBuffers )) ) || ( ( offset % sizeof( xNetworkBuffers[0] ) ) != 0 ) )
return 0;
return (int) (pxDesc - xNetworkBuffers) + 1;
}
and the asm output is:
bIsValidNetworkDescriptor:
# Function supports interworking.
# args = 0, pretend = 0, frame = 16
# frame_needed = 1, uses_anonymous_args = 0
# link register save eliminated.
str fp, [sp, #-4]!
add fp, sp, #0
sub sp, sp, #20
str r0, [fp, #-16]
ldr r3, [fp, #-16]
ldr r2, .L5
sub r3, r3, r2
str r3, [fp, #-8]
ldr r3, [fp, #-8]
cmp r3, #49
bhi .L2
ldr r1, [fp, #-8]
ldr r3, .L5+4
umull r2, r3, r1, r3
lsr r2, r3, #3
mov r3, r2
lsl r3, r3, #2
add r3, r3, r2
lsl r3, r3, #1
sub r2, r1, r3
cmp r2, #0
beq .L3
.L2:
mov r3, #0
b .L4
.L3:
ldr r3, [fp, #-16]
ldr r2, .L5
sub r3, r3, r2
asr r2, r3, #1
mov r3, r2
lsl r3, r3, #1
add r3, r3, r2
lsl r1, r3, #4
add r3, r3, r1
lsl r1, r3, #8
add r3, r3, r1
lsl r1, r3, #16
add r3, r3, r1
lsl r3, r3, #2
add r3, r3, r2
add r3, r3, #1
.L4:
mov r0, r3
add sp, fp, #0
# sp needed
ldr fp, [sp], #4
bx lr
.L6:
.align 2
.L5:
In the block quoted asm code you can see that it is comparing with 49 not 50 (which is the actual size of xNetworkBuffers) so the conclusion I got is
offset >= (uint32_t)(sizeof( xNetworkBuffers ))
is also equal to
offset > (uint32_t)(sizeof( xNetworkBuffers ) - 1) )
and in that case compiler can use BHI producing the same results

I think the code generated by GCC is correct, technically speaking. offset cannot be larger than INT_MAX, because this is the maximum value representable in ptrdiff_t on this architecture.
You can compute the difference like this:
uintptr_t offset = (uintptr_t)pxDesc - (uintptr_t)xNetworkBuffers;
This is still implementation-defined, but it will avoid the overflow problem.

ARM assembly quicksort and recursion

I'm trying to change C quick sort code to ARM assembly code.
Firstly, I apologize for my long question.
just in case, I wrote whole codes.
but only part that makes problem is function recursion part.
I almost done this work except funcion recursion part.
I've also tried converting C codes to ARM assembly with gcc cross compiler to see how compiler do. But there are still few things that I can't understand well.
This is C code.
void quicksort_c(int array[8],int first,int last)
{
int i, j, pivot, temp;
if(first < last)
{
pivot = first;
i = first;
j = last;
while(i < j)
{
while(array[i] <= array[pivot] && i < last)
{
i++;
}
while(array[j] > array[pivot])
{
j--;
}
if(i < j)
{
temp = array[i];
array[i] = array[j];
array[j] = temp;
}
}
temp = array[pivot];
array[pivot] = array[j];
array[j] = temp;
quicksort_c(array, first, j-1);
quicksort_c(array, j+1, last);
}
}
and this is arm assembly code that I wrote from scratch.
quicksort
STMFD sp!, {r0-r8,lr}
quicksort_recursion
CMP r1, r2 ;// if(first < last)
BGE out_of_if
outer_if
MOV r5, r1 ;// pivot = first
MOV r3, r1 ;// i = first;
MOV r4, r2 ;// j = last;
outer_while_loop
CMP r3, r4 ;// while(i < j)
BGE end_of_outer_while
outer_while
inner_while_1_loop
;//while(array[i] <= array[pivot] && i < last)
LDR r7, [r0, r3, lsl #2] ;//r7 = array[i]
LDR r8, [r0, r5, lsl #2] ;//r8 = array[pivot]
CMP r7, r8 ;//array[i] <= array[pivot]
BGT inner_while_2_loop
CMP r3, r2 ;// i < last
BGE inner_while_2_loop
inner_while_1
ADD r3, #1 ;//i++
B inner_while_1_loop
inner_while_2_loop
;//whlie(array[j] > array[pivot])
LDR r7, [r0, r4, lsl #2] ;// r7 = array[j]
LDR r8, [r0, r5, lsl #2] ;//r8 = array[pivot]
CMP r7, r8 ;// (array[j] > array[pivot])
BLE inner_if_cmp
inner_while_2
SUB r4, #1
B inner_while_2_loop
inner_if_cmp
CMP r3, r4;// if(i < j)
BGE outer_while_loop
inner_if
LDR r7, [r0, r3, lsl #2] ;// r7 = array[i]
MOV r6, r7 ;// temp = array[i]
LDR r8, [r0, r4, lsl #2] ;// r8 = array[j]
STR r8, [r0, r3, lsl #2] ;// array[i] = array[j]
STR r6, [r0, r4, lsl #2] ;// array[j] = temp;
B outer_while_loop
end_of_outer_while
LDR r7, [r0, r5, lsl #2] ;// r7 = array[pivot]
MOV r6, r7 ;// temp = array[pivot]
LDR r8, [r0, r4, lsl #2] ;// r8 = array[j]
STR r8, [r0, r5, lsl #2] ;// array[pivot] = array[j]
STR r6, [r0, r4, lsl #2] ;// array[j] = temp
;;//quicksort(array,first,j-1)
STMFD sp!, {r0-r8,lr}
SUBS r2, r4, #1
BL quicksort_recursion
;//LDMFD sp!, {r0-r8,pc}
;//quicksort(array, j+1, last)
STMFD sp!, {r0-r8,lr}
ADDS r1, r4, #1
BL quicksort_recursion
;//LDMFD sp!, {r0-r8,pc}
out_of_if
end_function
LDMFD sp!, {r0-r8,pc}
END
and this is how I use register
r0: array pointer
r1: first index
r2: last index
r3: i
r4: j
r5: pivot
r6: temp
r7: array[?] value
r8: array[?] value
I confirmed every parts except recursion of my code are work well.
At first, I did recursion in this way.
;;//quicksort(array,first,j-1)
SUBS r2, r4, #1
BL quicksort
;//quicksort(array, j+1, last)
ADDS r1, r4, #1
BL quicksort
but this code only do first function recursion.
ex)
Array before quicksorting : 5 1 4 7 8 3 6
Array after quicksorting : 1 2 4 3 5 8 7 6
I think it's because after function call, it doesn't go back to where it branched. so I added push and pop before and after function call like this.
;;//quicksort(array,first,j-1)
STMFD sp!, {r0-r8,lr}
SUBS r2, r4, #1
BL quicksort_recursion
;//quicksort(array, j+1, last)
STMFD sp!, {r0-r8,lr}
ADDS r1, r4, #1
BL quicksort_recursion
But this code falls in to infinite loop.
according to gcc compiled code,
quicksort_linux
cmp r1, r2
blt L16
bx lr
L16
push {r4, r5, r6, r7, r8, r9, r10, lr}
mov r10, r1
add ip, r0, r1, lsl #2
mov r5, r2
mov r4, r1
L3
lsls r3, r4, #2
add lr, r0, r3
ldr r7, [r0, r4, lsl #2]
ldr r6, [ip]
cmp r2, r4
ite le
movle r8, #0
movgt r8, #1
cmp r7, r6
it gt
movgt r8, #0
adds r3, r3, #4
add r3, r3, r0
cmp r8, #0
beq L9
L4
adds r4, r4, #1
mov lr, r3
ldr r7, [r3], #4
cmp r2, r4
ite le
movle r1, #0
movgt r1, #1
cmp r7, r6
it gt
movgt r1, #0
cmp r1, #0
bne L4
L9
lsls r3, r5, #2
add r9, r0, r3
ldr r1, [r0, r5, lsl #2]
cmp r1, r6
ble L5
subs r3, r3, #4
add r3, r3, r0
L6
subs r5, r5, #1
mov r9, r3
ldr r1, [r3], #-4
cmp r1, r6
bgt L6
L5
cmp r4, r5
bge L7
str r1, [lr]
str r7, [r9]
b L3
L7
mov r7, r2
mov r1, r10
mov r4, r0
ldr r3, [r0, r5, lsl #2]
str r3, [r0, r10, lsl #2]
str r6, [r0, r5, lsl #2]
subs r2, r5, #1
bl quicksort_linux
mov r2, r7
adds r1, r5, #1
mov r0, r4
bl quicksort_linux
pop {r4, r5, r6, r7, r8, r9, r10, pc}
END
It seems like they don't do any push or pop before or after branch but it works.
So, question is
1.what is problem of my function recursion code?
2.How can I fix that?
3.How can gcc compiled code works without any push and pop before and after function recursion?

ARM Headers to Get Proper Call Stacks

I am currently carrying out optimizations on a linux-based software itself on an ARM processor. Those optimizations are mostly in the form of ARM and ARM NEON functions.
In order to profile the software I use perf record and flame-graphs, however, once I introduce the assembler functions, they do not stack on top of the functions that call them but rather seemingly random places.
My question therefore was, what should I include in my functions for them to appear properly in the call stacks.
There was a slightly related topic but no good answer was given How to get call graph profiling working with gcc compiled code and ARM Cortex A8 target?. I use the same flags plus mapcs-frame.
Below, I give an example of a C function translated to ARM by GCC. This ARM function seems to produces decent stacks but I would like to understand why.
int half(int in);
int sum(int in1, int in2);
int mean(int in1, int in2);
int half(int i)
{
return i / 2;
}
int sum(int i, int j)
{
return i + j;
}
int mean(int i, int j)
{
int s = sum(i, j);
int m = half(s);
return m;
}
int main()
{
int a = 1;
int b = 5;
int i;
int result;
for (i = 0; i<10000000; i++) {
result = mean(a, b);
}
return 0;
}
.cpu cortex-a9
.eabi_attribute 27, 3
.eabi_attribute 28, 1
.fpu neon
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 2
.eabi_attribute 30, 6
.eabi_attribute 34, 1
.eabi_attribute 18, 4
.file "a.c"
.text
.align 2
.global half
.type half, %function
half:
# args = 0, pretend = 0, frame = 8
# frame_needed = 1, uses_anonymous_args = 0
mov ip, sp
stmfd sp!, {fp, ip, lr, pc}
sub fp, ip, #4
sub sp, sp, #8
str r0, [fp, #-16]
ldr r3, [fp, #-16]
mov r2, r3, lsr #31
add r3, r2, r3
mov r3, r3, asr #1
mov r0, r3
sub sp, fp, #12
ldmfd sp, {fp, sp, pc}
.size half, .-half
.align 2
.global sum
.type sum, %function
sum:
# args = 0, pretend = 0, frame = 8
# frame_needed = 1, uses_anonymous_args = 0
mov ip, sp
stmfd sp!, {fp, ip, lr, pc}
sub fp, ip, #4
sub sp, sp, #8
str r0, [fp, #-16]
str r1, [fp, #-20]
ldr r2, [fp, #-16]
ldr r3, [fp, #-20]
add r3, r2, r3
mov r0, r3
sub sp, fp, #12
ldmfd sp, {fp, sp, pc}
.size sum, .-sum
.align 2
.global mean
.type mean, %function
mean:
# args = 0, pretend = 0, frame = 16
# frame_needed = 1, uses_anonymous_args = 0
mov ip, sp
stmfd sp!, {fp, ip, lr, pc}
sub fp, ip, #4
sub sp, sp, #16
str r0, [fp, #-24]
str r1, [fp, #-28]
ldr r1, [fp, #-28]
ldr r0, [fp, #-24]
bl sum
str r0, [fp, #-16]
ldr r0, [fp, #-16]
bl half
str r0, [fp, #-20]
ldr r3, [fp, #-20]
mov r0, r3
sub sp, fp, #12
ldmfd sp, {fp, sp, pc}
.size mean, .-mean
.align 2
.global main
.type main, %function
main:
# args = 0, pretend = 0, frame = 16
# frame_needed = 1, uses_anonymous_args = 0
mov ip, sp
stmfd sp!, {fp, ip, lr, pc}
sub fp, ip, #4
sub sp, sp, #16
mov r3, #1
str r3, [fp, #-20]
mov r3, #5
str r3, [fp, #-24]
mov r3, #0
str r3, [fp, #-16]
b .L8
.L9:
ldr r1, [fp, #-24]
ldr r0, [fp, #-20]
bl mean
str r0, [fp, #-28]
ldr r3, [fp, #-16]
add r3, r3, #1
str r3, [fp, #-16]
.L8:
ldr r2, [fp, #-16]
movw r3, #38527
movt r3, 152
cmp r2, r3
ble .L9
mov r3, #0
mov r0, r3
sub sp, fp, #12
ldmfd sp, {fp, sp, pc}
.size main, .-main
.ident "GCC: (crosstool-NG linaro-1.13.1-4.9-2014.09 - Linaro GCC 4.9-2014.09) 4.9.2 20140904 (prerelease)"
.section .note.GNU-stack,"",%progbits
-------------------EDIT-------------------
Here is the example of the kind of function I am trying to integrate. In terms of linkage, all it does is save the stack and link register at the beginning and set them a the end. What should I add to it?
.section .text
.global ARM_smoothing
ARM_smoothing:
STMFD sp!, {r4-r12,lr} //move used registers on stack (avoid segmentation fault)
MOV r5, r0
ADD r0, r0, r2
ADD r0, r0, r2
MOV r8, r0
ADD r8, r8, r2
ADD r8, r8, r2 //the 6 instructions create 3 pointers to the row above and below as well as the current one
ADD r1, r1, r2
ADD r1, r1, r2
ADD r1, r1, #2 //move destination pointer to first element (1 row down, 1 element left)
SUB r2, r2, #2
SUB r3, r3, #2 //counters decremented because smoothing function works with a margin of 1 on every side
LDR r9, =0x1C71C71D //(1/9)*2^32 pour effectuer la division par 9
LDR r10, =0x2
LDR r11, =0xC //shifts for pointers to data
VLDR.U64 d20, =0x1C71C71D //(1/9)*2^32 pour effectuer la division par 9
VLDR.U64 d22, =0x0 //initialization of zeros to be used (not ncessarily needed)
VLDR.U64 d23, =0x0
VDUP.32 d20, d20[0] //initialize vector for multiplication
height_loop:
MOV r4, r2 //reset width counter
CMP r4, #8
BLGE width_loop_eight_smoothing //use neon while more than 8 elements in row need smoothing
CMP r4, #1
BLGE width_loop_rest //use normal ARM for remaining elements, can't do in NEON because of margin
ADD r0, r0, #4 //skip margin
ADD r1, r1, #4
ADD r5, r5, #4
ADD r8, r8, #4
SUBS r3, r3, #1 //decrement row counter
BNE height_loop //loop while there still are rows
LDMFD sp!, {r4-r12,pc} //restore stack and return to calling function
width_loop_eight_smoothing:
SUB r4, r4, #8 //decrement width counter
VLD1.16 {d0, d1}, [r5], r10 //load upper left elements
VLD1.16 {d2, d3}, [r5], r10 //load upper middle elements
VADDL.S16 q2, d0, d2 //long addition of elements to be sure to not lose any data
VADDL.S16 q3, d1, d3
VLD1.16 {d0, d1}, [r5], r11 //load upper right elements
VLD1.16 {d2, d3}, [r0], r10 //load middle left elements
VADDL.S16 q4, d0, d2
VADDL.S16 q5, d1, d3
VADD.S32 q2, q4 //add to grand total
VADD.S32 q3, q5
VLD1.16 {d0, d1}, [r0], r10 //load current elements
VLD1.16 {d2, d3}, [r0], r11 //load middle right elements
VADDL.S16 q4, d0, d2
VADDL.S16 q5, d1, d3
VADD.S32 q2, q4
VADD.S32 q3, q5
VLD1.16 {d0, d1}, [r8], r10 //load lower left elements
VLD1.16 {d2, d3}, [r8], r10 //load lower middle elements
VADDL.S16 q4, d0, d2
VADDL.S16 q5, d1, d3
VADD.S32 q2, q4
VADD.S32 q3, q5
VLD1.16 {d0, d1}, [r8], r11 //load lower right elements
VADDL.S16 q4, d0, d22
VADDL.S16 q5, d1, d23
VADD.S32 q2, q4
VADD.S32 q3, q5
VMULL.S32 q6, d4, d20 //divide by 9 (upper element is total divided by 9)
VMULL.S32 q7, d5, d20
VMULL.S32 q8, d6, d20
VMULL.S32 q9, d7, d20
VUZP.32 q6, q7 //pack results into less registers and smaller elements
VUZP.32 q8, q9
VUZP.16 q7, q9
VSHR.U16 q8, q7, #15 //when multiplied element is negative, result is always one under
VADD.S16 q7, q8 //rectifying by adding sign bit to total
VST1.16 {d14, d15}, [r1]! //store results
CMP r4, #8 //check if theres enough elements to do 8 more in NEON
BCS width_loop_eight_smoothing //if yes, loop neon code
MOV PC, LR //return to ARM_smoothing if not
width_loop_rest: //works similaarly to NEON but one element at a time
LDRSH r6, [r0], #2 //converts loaded half words to signed full words
LDRSH r7, [r0] //main difference is with the way increments are done since there is an overlap
ADD r6, r7, r6
LDRSH r7, [r0, #2]
ADD r6, r7, r6
LDRSH r7, [r5], #2
ADD r6, r7, r6
LDRSH r7, [r5]
ADD r6, r7, r6
LDRSH r7, [r5, #2]
ADD r6, r7, r6
LDRSH r7, [r8], #2
ADD r6, r7, r6
LDRSH r7, [r8]
ADD r6, r7, r6
LDRSH r7, [r8, #2]
ADD r6, r7, r6
SMULLS r6, r7, r6, r9
ADDMI r7, #1
STRH r7, [r1], #2
SUBS r4, #1 //decrement width counter and check if there's any left
BNE width_loop_rest
MOV PC, LR

You can clearly see how the compiler is annotating the assembler with some pseudo-ops...
.global mean
.type mean, %function
...
.size mean, .-mean
These are put in COFF sections and need to make it to a build so that the call graph tools can know what PC range is for your assembler function.
.global ARM_smoothing
+ .type ARM_smoothing, %function
...
+ .size ARM_smoothing, .-ARM_smoothing
Other pseudo-ops depend on the debug information needed.
.func
.endfunc
.size
ARM CFI question
.cantunwind
Others are .fnend, .fnstart, .movsp, .save, .setfp, etc.
It depends on the debug/object format expected by the tool. There are also two types of data;
code extent information
stack and frame use
Both are typically needed for unwinding (or a stack back trace) but a sampling performance tool might only get away with the first. Exception handling code that does object clean up requires the most information.
Related: ARM Link and frame register

Why is a write to a memory-mapped peripheral register not actioned (LPC43xx)?

I'm building an application for NXP LPC4330 (Arm Cortex M0/M4 dual core). I'm compiling using arm-none-eabi-gcc 4.9.3. At one point in my code, I am performing a write to a (32-bit) memory location. Immediately afterwards, if I read back from that memory location, around one time in ten the result indicates that the write did not occur. Subsequent reads at later times indicate the same thing, so it is not a transient condition. Interrupts are disabled at the global level, and the assembler generated by the compiler is clearly attempting the write, so how is it possible that the write is not being actioned?
Specifically, I am writing to SLICE_MUX_CFG0 which is a memory-mapped register in the SGPIO peripheral. When the write works, the peripheral functions correctly. When the read-back indicates that the write has not worked, the peripheral does not function correctly. So, it seems that the register in question is not being set correctly, as indicated by the read-back.
Looking into the .asm (listed below), the write is clear. When I read back the value afterwards, it reads as zero, which - given the listing below - seems to me to be impossible. If I perform a read immediately before the write (see the .c listing, below), the problem goes away, which is perhaps a clue.
So the above indicates, what? Does this break some rule for use of the memory bus? I've looked at the GCC bugs list and can't see anything that relates to this.
The function follows, both source and ASM, with some annotation. What could be happening, here? Why does the write at "store value" apparently not have any effect?
20000f7c <camera_SGPIO_init_sub>:
; disable interrupts globally
20000f7c: b672 cpsid i
20000f7e: 2346 movs r3, #70 ; 0x46
20000f80: 4a16 ldr r2, [pc, #88] ; (20000fdc <camera_SGPIO_init_sub+0x60>)
20000f82: 6013 str r3, [r2, #0]
20000f84: 4a16 ldr r2, [pc, #88] ; (20000fe0 <camera_SGPIO_init_sub+0x64>)
20000f86: 6013 str r3, [r2, #0]
20000f88: 4a16 ldr r2, [pc, #88] ; (20000fe4 <camera_SGPIO_init_sub+0x68>)
20000f8a: 6013 str r3, [r2, #0]
20000f8c: 4a16 ldr r2, [pc, #88] ; (20000fe8 <camera_SGPIO_init_sub+0x6c>)
20000f8e: 6013 str r3, [r2, #0]
20000f90: 4a16 ldr r2, [pc, #88] ; (20000fec <camera_SGPIO_init_sub+0x70>)
20000f92: 3301 adds r3, #1
20000f94: 6013 str r3, [r2, #0]
20000f96: 4a16 ldr r2, [pc, #88] ; (20000ff0 <camera_SGPIO_init_sub+0x74>)
20000f98: 6013 str r3, [r2, #0]
20000f9a: 4a16 ldr r2, [pc, #88] ; (20000ff4 <camera_SGPIO_init_sub+0x78>)
20000f9c: 6013 str r3, [r2, #0]
20000f9e: 4a16 ldr r2, [pc, #88] ; (20000ff8 <camera_SGPIO_init_sub+0x7c>)
20000fa0: 6013 str r3, [r2, #0]
20000fa2: 4a16 ldr r2, [pc, #88] ; (20000ffc <camera_SGPIO_init_sub+0x80>)
20000fa4: 6013 str r3, [r2, #0]
20000fa6: 2240 movs r2, #64 ; 0x40
20000fa8: 4b15 ldr r3, [pc, #84] ; (20001000 <camera_SGPIO_init_sub+0x84>)
20000faa: 601a str r2, [r3, #0]
20000fac: 2290 movs r2, #144 ; 0x90
20000fae: 4b15 ldr r3, [pc, #84] ; (20001004 <camera_SGPIO_init_sub+0x88>)
20000fb0: 0512 lsls r2, r2, #20
20000fb2: 601a str r2, [r3, #0]
; load value
20000fb4: 23c6 movs r3, #198 ; 0xc6
; load destination address
20000fb6: 4a14 ldr r2, [pc, #80] ; (20001008 <camera_SGPIO_init_sub+0x8c>)
; store value
20000fb8: 6013 str r3, [r2, #0]
; read value back
20000fba: 6810 ldr r0, [r2, #0]
20000fbc: 4a13 ldr r2, [pc, #76] ; (2000100c <camera_SGPIO_init_sub+0x90>)
20000fbe: 6013 str r3, [r2, #0]
20000fc0: 4a13 ldr r2, [pc, #76] ; (20001010 <camera_SGPIO_init_sub+0x94>)
20000fc2: 6013 str r3, [r2, #0]
20000fc4: 4a13 ldr r2, [pc, #76] ; (20001014 <camera_SGPIO_init_sub+0x98>)
20000fc6: 6013 str r3, [r2, #0]
20000fc8: 4a13 ldr r2, [pc, #76] ; (20001018 <camera_SGPIO_init_sub+0x9c>)
20000fca: 6013 str r3, [r2, #0]
20000fcc: 4a13 ldr r2, [pc, #76] ; (2000101c <camera_SGPIO_init_sub+0xa0>)
20000fce: 6013 str r3, [r2, #0]
20000fd0: 4a13 ldr r2, [pc, #76] ; (20001020 <camera_SGPIO_init_sub+0xa4>)
20000fd2: 6013 str r3, [r2, #0]
20000fd4: 4a13 ldr r2, [pc, #76] ; (20001024 <camera_SGPIO_init_sub+0xa8>)
20000fd6: 6013 str r3, [r2, #0]
; enable interrupts globally
20000fd8: b662 cpsie i
20000fda: 4770 bx lr
20000fdc: 40086480 .word 0x40086480
20000fe0: 40086484 .word 0x40086484
20000fe4: 40086488 .word 0x40086488
20000fe8: 40086494 .word 0x40086494
20000fec: 40086380 .word 0x40086380
20000ff0: 40086384 .word 0x40086384
20000ff4: 40086388 .word 0x40086388
20000ff8: 4008639c .word 0x4008639c
20000ffc: 40086208 .word 0x40086208
20001000: 40086204 .word 0x40086204
20001004: 40050064 .word 0x40050064
20001008: 40101080 .word 0x40101080
2000100c: 401010a0 .word 0x401010a0
20001010: 40101090 .word 0x40101090
20001014: 401010a4 .word 0x401010a4
20001018: 40101088 .word 0x40101088
2000101c: 401010a8 .word 0x401010a8
20001020: 40101094 .word 0x40101094
20001024: 401010ac .word 0x401010ac
The C-code which compiled to the above follows.
volatile uint32_t vol_dummy_for_read;
#define __SFS(addr, value) *((volatile uint32_t*)addr) = value;
#define SGPIO_SLICE_MUX_CFG0 (*((volatile uint32_t*) ... some address ... ))
uint32_t camera_SGPIO_init_sub()
{
__asm volatile ("cpsid i" : : : "memory");
// configure pins to SGPIO
__SFS(P9_0, SCU_SFS_INPUT | 6); // D0, SGPIO0
__SFS(P9_1, SCU_SFS_INPUT | 6);
__SFS(P9_2, SCU_SFS_INPUT | 6);
__SFS(P9_5, SCU_SFS_INPUT | 6);
__SFS(P7_0, SCU_SFS_INPUT | 7);
__SFS(P7_1, SCU_SFS_INPUT | 7);
__SFS(P7_2, SCU_SFS_INPUT | 7);
__SFS(P7_7, SCU_SFS_INPUT | 7); // D7, SGPIO7
// SGPIO8
__SFS(P4_2, SCU_SFS_INPUT | 7); // PCLK, SGPIO8
// configure pins to GPIO
__SFS(P4_1, SCU_SFS_INPUT | 0); // HSYNC, GPIO2[1]
// bring SGPIO clock up to full speed (same as PLL1, M4)
CGU_BASE_PERIPH_CLK = (0 << 1) | (0 << 11) | (9 << 24);
// SLICE_MUX_CFG
uint32_t SLICE_MUX_CFG_VALUE =
(1 << 1) /* clock on falling edge */
| (1 << 2) /* clock from external pin */
| (3 << 6) /* shift 8 bytes per clock */
;
//// see note above (this fixes it)
//vol_dummy_for_read = SGPIO_SLICE_MUX_CFG0 ;
SGPIO_SLICE_MUX_CFG0 = SLICE_MUX_CFG_VALUE; // A
uint32_t ret = SGPIO_SLICE_MUX_CFG0;
SGPIO_SLICE_MUX_CFG8 = SLICE_MUX_CFG_VALUE; // I
SGPIO_SLICE_MUX_CFG4 = SLICE_MUX_CFG_VALUE; // E
SGPIO_SLICE_MUX_CFG9 = SLICE_MUX_CFG_VALUE; // J
SGPIO_SLICE_MUX_CFG2 = SLICE_MUX_CFG_VALUE; // C
SGPIO_SLICE_MUX_CFG10 = SLICE_MUX_CFG_VALUE; // K
SGPIO_SLICE_MUX_CFG5 = SLICE_MUX_CFG_VALUE; // F
SGPIO_SLICE_MUX_CFG11 = SLICE_MUX_CFG_VALUE; // L
__asm volatile ("cpsie i" : : : "memory");
return ret;
}

(I am answering my own question; this answer was reached based on clues offered in the comments, above).
Short Answer
The peripheral is not actioning the register update reliably because the peripheral clock that is driving it (CGU_BASE_PERIPH_CLK) has only just had its speed changed at the time of the write operation. Setting the AUTOBLOCK bit when updating the clock's speed eliminates the problem.
Discussion
Presumably, the clock to the peripheral is transiently invalid during the frequency change, depending on conditions. Perhaps, if the timing of edges happens to be just-so, very short clock pulses find their way through during the change. Or something similarly unpleasant finds its way down the clock line to the peripheral. In any case, in these unpredictable conditions, the write may not occur, causing the reported failure.
Waiting for a period of time between the clock speed change and the subsequent assignment also eliminates the problem, understandably. As reported in the question, performing a read of the register prior to the write also eliminates the problem; whether this is because it takes time, or the read operation blocks (for reasons unclear) until the peripheral clock has settled, is unclear.
AUTOBLOCK is documented only as far as the statement of function: "Block clock automatically during frequency change". The User Manual gives no indication of the conditions under which the bit should be set, or left clear, during a clock speed change. However, given the evidence reported here, a policy of always setting AUTOBLOCK when updating the speed of a clock in one of these devices, unless there is a known reason to leave it clear, seems wise.
Reference: NXP User Manual for LPC43xx, UM10503 Rev 1.9, Chapter 13.

ARM Assembly: Absolute Value Function: Are two or three lines faster?

In my embedded systems class, we were asked to re-code the given C-function AbsVal into ARM Assembly.
We were told that the best we could do was 3-lines. I was determined to find a 2-line solution and eventually did, but the question I have now is whether I actually decreased performance or increased it.
The C-code:
unsigned long absval(signed long x){
unsigned long int signext;
signext = (x >= 0) ? 0 : -1; //This can be done with an ASR instruction
return (x + signet) ^ signext;
}
The TA/Professor's 3-line solution
ASR R1, R0, #31 ; R1 <- (x >= 0) ? 0 : -1
ADD R0, R0, R1 ; R0 <- R0 + R1
EOR R0, R0, R1 ; R0 <- R0 ^ R1
My 2-line solution
ADD R1, R0, R0, ASR #31 ; R1 <- x + (x >= 0) ? 0 : -1
EOR R0, R1, R0, ASR #31 ; R0 <- R1 ^ (x >= 0) ? 0 : -1
There are a couple of places I can see potential performance differences:
The addition of one extra Arithmetic Shift Right call
The removal of one memory fetch
So, which one is actually faster? Does it depend upon the processor or memory access speed?

Here is a nother two instruction version:
cmp r0, #0
rsblt r0, r0, #0
Which translate to the simple code:
if (r0 < 0)
{
r0 = 0-r0;
}
That code should be pretty fast, even on modern ARM-CPU cores like the Cortex-A8 and A9.

Dive over to ARM.com and grab the Cortex-M3 datasheet. Section 3.3.1 on page 3-4 has the instruction timings. Fortunately they're quite straightforward on the Cortex-M3.
We can see from those timings that in a perfect 'no wait state' system your professor's example takes 3 cycles:
ASR R1, R0, #31 ; 1 cycle
ADD R0, R0, R1 ; 1 cycle
EOR R0, R0, R1 ; 1 cycle
; total: 3 cycles
and your version takes two cycles:
ADD R1, R0, R0, ASR #31 ; 1 cycle
EOR R0, R1, R0, ASR #31 ; 1 cycle
; total: 2 cycles
So yours is, theoretically, faster.
You mention "The removal of one memory fetch", but is that true? How big are the respective routines? Since we're dealing with Thumb-2 we have a mix of 16-bit and 32-bit instructions available. Let's see how they assemble:
Their version (adjusted for UAL syntax):
.syntax unified
.text
.thumb
abs:
asrs r1, r0, #31
adds r0, r0, r1
eors r0, r0, r1
Assembles to:
00000000 17c1 asrs r1, r0, #31
00000002 1840 adds r0, r0, r1
00000004 4048 eors r0, r1
That's 3x2 = 6 bytes.
Your version (again, adjusted for UAL syntax):
.syntax unified
.text
.thumb
abs:
add.w r1, r0, r0, asr #31
eor.w r0, r1, r0, asr #31
Assembles to:
00000000 eb0071e0 add.w r1, r0, r0, asr #31
00000004 ea8170e0 eor.w r0, r1, r0, asr #31
That's 2x4 = 8 bytes.
So instead of removing a memory fetch you've actually increased the size of the code.
But does this affect performance? My advice would be to benchmark.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio