How to create an array of booleans in arm assembly? - gcc

I need to specify each boolean manually like in a fixed table, so using
Array: .skip 400
I will be declaring an array of 400 bytes,so how can i set the boolean values?

ARM registers are 32 bits each. You only need a bit to represent a boolean. So you can use the following 'C' code to access an array,
uint32_t load_bool(uint32_t index)
{
return (bool_array[index>>2] & (1<<(index&3)));
}
void store_bool(uint32_t index, int value)
{
uint32_t target = bool_array[index>>2];
if(value)
target |= (1<<(index&3));
else
target &= ~(1<<(index&3));
bool_array[index>>2] = target;
}
Use a compiler to target your CPU; for instance tuning godbolt output on a Cortex-A5 gives,
load_bool(unsigned int):
ldr r3, =bool_array
mov r2, r0, lsr #2
ldr r3, [r3, r2, asl #2]
and r0, r0, #3
mov r2, #1
and r0, r3, r2, asl r0
bx lr
store_bool(unsigned int, int):
ldr r3, =bool_array
mov r2, r0, lsr #2
cmp r1, #0
ldr r1, [r3, r2, asl #2]
and r0, r0, #3
mov ip, #1
orrne r0, r1, ip, asl r0
biceq r0, r1, ip, asl r0
str r0, [r3, r2, asl #2]
bx lr
The instructions tst, bclr, etc might be useful if you choose a macro instead of a function call (bit index known at compile/assemble time). Also, ldrb or byte access might be better on older platforms/CPUs. Most ARM CPUs have a 32bit bus, so the cycles for ldrb and ldr are equal.

Boolean variables in C and C++ are basically treated as a native integer assigned 1 for true and 0 for false; in ARM's case it would be a 32-bit integer. So if you need to access the structure as an array of Booleans in C/C++ you would need to access them as 32-bit integers aligned on a 4-byte boundary. However if you only need to access it from other assembly code you can use each byte as it's own boolean variable and simply manipulate the array on a byte level.
In ARM assembly, this would be the difference between accessing the array with LDR vs with LDRB.

Related

ARM GCC hardfault when using -O2

When using ARM GCC g++ compiler with optimization level -O2 (and up) this code:
void foo(void)
{
DBB("#0x%08X: 0x%08X", 1, *((uint32_t *)1));
DBB("#0x%08X: 0x%08X", 0, *((uint32_t *)0));
}
Compiles to:
0800abb0 <_Z3foov>:
800abb0: b508 push {r3, lr}
800abb2: 2301 movs r3, #1
800abb4: 4619 mov r1, r3
800abb6: 681a ldr r2, [r3, #0]
800abb8: 4802 ldr r0, [pc, #8] ; (800abc4 <_Z3foov+0x14>)
800abba: f007 fa83 bl 80120c4 <debug_print_blocking>
800abbe: 2300 movs r3, #0
800abc0: 681b ldr r3, [r3, #0]
800abc2: deff udf #255 ; 0xff
800abc4: 08022704 stmdaeq r2, {r2, r8, r9, sl, sp}
And this gives me hardfault at undefined instruction #0x0800abc2.
Also, if there is more code after that, it is not compiled into final binary.
The question is why compiler generates it like that, why undefined istruction?
By the way, it works fine for stuff like this:
...
uint32_t num = 2;
num -= 2;
DBB("#0x%08X: 0x%08X", 0, *((uint32_t *)num));
...
Compiler version:
arm-none-eabi-g++.exe (GNU Tools for ARM Embedded Processors 6-2017-q2-update) 6.3.1 20170620 (release) [ARM/embedded-6-branch revision 249437]
You can disable this (and verify this answer) by using -fno-delete-null-pointer-checks
The pointer you are passing has a value which matches the null pointer, and the compiler can see that from static analysis, so it faults (because that is the defined behaviour).
In your second example, the static analysis doesn't identify a NULL.

How to align to cache line GCC ldr pc-relative

In ARM, GCC uses the PC-relative load is usually used to load constants into registers. The idea is that you store the constant relative to the instruction loading the constant. E.g. the following instruction can be used to load a constant from the address PC+8+offset
ldr r0, [pc, #offset]
As result, the .text segment interleaves instructions and data. The latter usually stored at the end of function's code. E.g.
00010860 <call_weak_fn>:
10860: e59f3014 ldr r3, [pc, #20] ; 1087c <call_weak_fn+0x1c>
10864: e59f2014 ldr r2, [pc, #20] ; 10880 <call_weak_fn+0x20>
10868: e08f3003 add r3, pc, r3
1086c: e7932002 ldr r2, [r3, r2]
10870: e3520000 cmp r2, #0
10874: 012fff1e bxeq lr
10878: e1a00000 nop ; (mov r0, r0)
1087c: 00089790 muleq r8, r0, r7
10880: 00000074 andeq r0, r0, r4, ror r0
For a research project, I would like to ensure that code and constant never reside on the same cache line (i.e. block 64 bytes aligned).
Is it possible to align the constants generated by GCC?

How to force gcc generate thumb 32 bit instructions?

Is it possible to force generating thumb 32 bit instructions when possible?
For example I have:
int main(void) {
8000280: b480 push {r7}
8000282: b085 sub sp, #20
8000284: af00 add r7, sp, #0
uint32_t a, b, c;
a = 1;
8000286: 2301 movs r3, #1
8000288: 60fb str r3, [r7, #12]
b = 1;
800028a: 2301 movs r3, #1
800028c: 60bb str r3, [r7, #8]
c = a+b;
800028e: 68fa ldr r2, [r7, #12]
8000290: 68bb ldr r3, [r7, #8]
8000292: 4413 add r3, r2
8000294: 607b str r3, [r7, #4]
while (1) ;
8000296: e7fe b.n 8000296 <main+0x16>
But they are all thumb 16 bit. For testing reason I want thumb 32 bit instructions.
Well, I don't think you can do that directly, but there is a very strange way of achieving that in a very indirect way. Just a few days ago I've seen a project where the compilation process was not just a simple arm-none-eabi-gcc ... -c file.c -o file.o. The project was calling arm-none-eabi-gcc to also generate extended assembly listing, which was later assembled manually with arm-none-eabi-as into an object file. If you would do it like that, then between these two steps you could modify the assembly listing to have wide instructions only - in most (all?) cases you could just use sed to add .w suffix to the instructions (change add r3, r2 into add.w r3, r2 and so on). Whether or not such level of build complication is worth it, is up to you...

Why is a write to a memory-mapped peripheral register not actioned (LPC43xx)?

I'm building an application for NXP LPC4330 (Arm Cortex M0/M4 dual core). I'm compiling using arm-none-eabi-gcc 4.9.3. At one point in my code, I am performing a write to a (32-bit) memory location. Immediately afterwards, if I read back from that memory location, around one time in ten the result indicates that the write did not occur. Subsequent reads at later times indicate the same thing, so it is not a transient condition. Interrupts are disabled at the global level, and the assembler generated by the compiler is clearly attempting the write, so how is it possible that the write is not being actioned?
Specifically, I am writing to SLICE_MUX_CFG0 which is a memory-mapped register in the SGPIO peripheral. When the write works, the peripheral functions correctly. When the read-back indicates that the write has not worked, the peripheral does not function correctly. So, it seems that the register in question is not being set correctly, as indicated by the read-back.
Looking into the .asm (listed below), the write is clear. When I read back the value afterwards, it reads as zero, which - given the listing below - seems to me to be impossible. If I perform a read immediately before the write (see the .c listing, below), the problem goes away, which is perhaps a clue.
So the above indicates, what? Does this break some rule for use of the memory bus? I've looked at the GCC bugs list and can't see anything that relates to this.
The function follows, both source and ASM, with some annotation. What could be happening, here? Why does the write at "store value" apparently not have any effect?
20000f7c <camera_SGPIO_init_sub>:
; disable interrupts globally
20000f7c: b672 cpsid i
20000f7e: 2346 movs r3, #70 ; 0x46
20000f80: 4a16 ldr r2, [pc, #88] ; (20000fdc <camera_SGPIO_init_sub+0x60>)
20000f82: 6013 str r3, [r2, #0]
20000f84: 4a16 ldr r2, [pc, #88] ; (20000fe0 <camera_SGPIO_init_sub+0x64>)
20000f86: 6013 str r3, [r2, #0]
20000f88: 4a16 ldr r2, [pc, #88] ; (20000fe4 <camera_SGPIO_init_sub+0x68>)
20000f8a: 6013 str r3, [r2, #0]
20000f8c: 4a16 ldr r2, [pc, #88] ; (20000fe8 <camera_SGPIO_init_sub+0x6c>)
20000f8e: 6013 str r3, [r2, #0]
20000f90: 4a16 ldr r2, [pc, #88] ; (20000fec <camera_SGPIO_init_sub+0x70>)
20000f92: 3301 adds r3, #1
20000f94: 6013 str r3, [r2, #0]
20000f96: 4a16 ldr r2, [pc, #88] ; (20000ff0 <camera_SGPIO_init_sub+0x74>)
20000f98: 6013 str r3, [r2, #0]
20000f9a: 4a16 ldr r2, [pc, #88] ; (20000ff4 <camera_SGPIO_init_sub+0x78>)
20000f9c: 6013 str r3, [r2, #0]
20000f9e: 4a16 ldr r2, [pc, #88] ; (20000ff8 <camera_SGPIO_init_sub+0x7c>)
20000fa0: 6013 str r3, [r2, #0]
20000fa2: 4a16 ldr r2, [pc, #88] ; (20000ffc <camera_SGPIO_init_sub+0x80>)
20000fa4: 6013 str r3, [r2, #0]
20000fa6: 2240 movs r2, #64 ; 0x40
20000fa8: 4b15 ldr r3, [pc, #84] ; (20001000 <camera_SGPIO_init_sub+0x84>)
20000faa: 601a str r2, [r3, #0]
20000fac: 2290 movs r2, #144 ; 0x90
20000fae: 4b15 ldr r3, [pc, #84] ; (20001004 <camera_SGPIO_init_sub+0x88>)
20000fb0: 0512 lsls r2, r2, #20
20000fb2: 601a str r2, [r3, #0]
; load value
20000fb4: 23c6 movs r3, #198 ; 0xc6
; load destination address
20000fb6: 4a14 ldr r2, [pc, #80] ; (20001008 <camera_SGPIO_init_sub+0x8c>)
; store value
20000fb8: 6013 str r3, [r2, #0]
; read value back
20000fba: 6810 ldr r0, [r2, #0]
20000fbc: 4a13 ldr r2, [pc, #76] ; (2000100c <camera_SGPIO_init_sub+0x90>)
20000fbe: 6013 str r3, [r2, #0]
20000fc0: 4a13 ldr r2, [pc, #76] ; (20001010 <camera_SGPIO_init_sub+0x94>)
20000fc2: 6013 str r3, [r2, #0]
20000fc4: 4a13 ldr r2, [pc, #76] ; (20001014 <camera_SGPIO_init_sub+0x98>)
20000fc6: 6013 str r3, [r2, #0]
20000fc8: 4a13 ldr r2, [pc, #76] ; (20001018 <camera_SGPIO_init_sub+0x9c>)
20000fca: 6013 str r3, [r2, #0]
20000fcc: 4a13 ldr r2, [pc, #76] ; (2000101c <camera_SGPIO_init_sub+0xa0>)
20000fce: 6013 str r3, [r2, #0]
20000fd0: 4a13 ldr r2, [pc, #76] ; (20001020 <camera_SGPIO_init_sub+0xa4>)
20000fd2: 6013 str r3, [r2, #0]
20000fd4: 4a13 ldr r2, [pc, #76] ; (20001024 <camera_SGPIO_init_sub+0xa8>)
20000fd6: 6013 str r3, [r2, #0]
; enable interrupts globally
20000fd8: b662 cpsie i
20000fda: 4770 bx lr
20000fdc: 40086480 .word 0x40086480
20000fe0: 40086484 .word 0x40086484
20000fe4: 40086488 .word 0x40086488
20000fe8: 40086494 .word 0x40086494
20000fec: 40086380 .word 0x40086380
20000ff0: 40086384 .word 0x40086384
20000ff4: 40086388 .word 0x40086388
20000ff8: 4008639c .word 0x4008639c
20000ffc: 40086208 .word 0x40086208
20001000: 40086204 .word 0x40086204
20001004: 40050064 .word 0x40050064
20001008: 40101080 .word 0x40101080
2000100c: 401010a0 .word 0x401010a0
20001010: 40101090 .word 0x40101090
20001014: 401010a4 .word 0x401010a4
20001018: 40101088 .word 0x40101088
2000101c: 401010a8 .word 0x401010a8
20001020: 40101094 .word 0x40101094
20001024: 401010ac .word 0x401010ac
The C-code which compiled to the above follows.
volatile uint32_t vol_dummy_for_read;
#define __SFS(addr, value) *((volatile uint32_t*)addr) = value;
#define SGPIO_SLICE_MUX_CFG0 (*((volatile uint32_t*) ... some address ... ))
uint32_t camera_SGPIO_init_sub()
{
__asm volatile ("cpsid i" : : : "memory");
// configure pins to SGPIO
__SFS(P9_0, SCU_SFS_INPUT | 6); // D0, SGPIO0
__SFS(P9_1, SCU_SFS_INPUT | 6);
__SFS(P9_2, SCU_SFS_INPUT | 6);
__SFS(P9_5, SCU_SFS_INPUT | 6);
__SFS(P7_0, SCU_SFS_INPUT | 7);
__SFS(P7_1, SCU_SFS_INPUT | 7);
__SFS(P7_2, SCU_SFS_INPUT | 7);
__SFS(P7_7, SCU_SFS_INPUT | 7); // D7, SGPIO7
// SGPIO8
__SFS(P4_2, SCU_SFS_INPUT | 7); // PCLK, SGPIO8
// configure pins to GPIO
__SFS(P4_1, SCU_SFS_INPUT | 0); // HSYNC, GPIO2[1]
// bring SGPIO clock up to full speed (same as PLL1, M4)
CGU_BASE_PERIPH_CLK = (0 << 1) | (0 << 11) | (9 << 24);
// SLICE_MUX_CFG
uint32_t SLICE_MUX_CFG_VALUE =
(1 << 1) /* clock on falling edge */
| (1 << 2) /* clock from external pin */
| (3 << 6) /* shift 8 bytes per clock */
;
//// see note above (this fixes it)
//vol_dummy_for_read = SGPIO_SLICE_MUX_CFG0 ;
SGPIO_SLICE_MUX_CFG0 = SLICE_MUX_CFG_VALUE; // A
uint32_t ret = SGPIO_SLICE_MUX_CFG0;
SGPIO_SLICE_MUX_CFG8 = SLICE_MUX_CFG_VALUE; // I
SGPIO_SLICE_MUX_CFG4 = SLICE_MUX_CFG_VALUE; // E
SGPIO_SLICE_MUX_CFG9 = SLICE_MUX_CFG_VALUE; // J
SGPIO_SLICE_MUX_CFG2 = SLICE_MUX_CFG_VALUE; // C
SGPIO_SLICE_MUX_CFG10 = SLICE_MUX_CFG_VALUE; // K
SGPIO_SLICE_MUX_CFG5 = SLICE_MUX_CFG_VALUE; // F
SGPIO_SLICE_MUX_CFG11 = SLICE_MUX_CFG_VALUE; // L
__asm volatile ("cpsie i" : : : "memory");
return ret;
}
(I am answering my own question; this answer was reached based on clues offered in the comments, above).
Short Answer
The peripheral is not actioning the register update reliably because the peripheral clock that is driving it (CGU_BASE_PERIPH_CLK) has only just had its speed changed at the time of the write operation. Setting the AUTOBLOCK bit when updating the clock's speed eliminates the problem.
Discussion
Presumably, the clock to the peripheral is transiently invalid during the frequency change, depending on conditions. Perhaps, if the timing of edges happens to be just-so, very short clock pulses find their way through during the change. Or something similarly unpleasant finds its way down the clock line to the peripheral. In any case, in these unpredictable conditions, the write may not occur, causing the reported failure.
Waiting for a period of time between the clock speed change and the subsequent assignment also eliminates the problem, understandably. As reported in the question, performing a read of the register prior to the write also eliminates the problem; whether this is because it takes time, or the read operation blocks (for reasons unclear) until the peripheral clock has settled, is unclear.
AUTOBLOCK is documented only as far as the statement of function: "Block clock automatically during frequency change". The User Manual gives no indication of the conditions under which the bit should be set, or left clear, during a clock speed change. However, given the evidence reported here, a policy of always setting AUTOBLOCK when updating the speed of a clock in one of these devices, unless there is a known reason to leave it clear, seems wise.
Reference: NXP User Manual for LPC43xx, UM10503 Rev 1.9, Chapter 13.

ARM Assembly: Absolute Value Function: Are two or three lines faster?

In my embedded systems class, we were asked to re-code the given C-function AbsVal into ARM Assembly.
We were told that the best we could do was 3-lines. I was determined to find a 2-line solution and eventually did, but the question I have now is whether I actually decreased performance or increased it.
The C-code:
unsigned long absval(signed long x){
unsigned long int signext;
signext = (x >= 0) ? 0 : -1; //This can be done with an ASR instruction
return (x + signet) ^ signext;
}
The TA/Professor's 3-line solution
ASR R1, R0, #31 ; R1 <- (x >= 0) ? 0 : -1
ADD R0, R0, R1 ; R0 <- R0 + R1
EOR R0, R0, R1 ; R0 <- R0 ^ R1
My 2-line solution
ADD R1, R0, R0, ASR #31 ; R1 <- x + (x >= 0) ? 0 : -1
EOR R0, R1, R0, ASR #31 ; R0 <- R1 ^ (x >= 0) ? 0 : -1
There are a couple of places I can see potential performance differences:
The addition of one extra Arithmetic Shift Right call
The removal of one memory fetch
So, which one is actually faster? Does it depend upon the processor or memory access speed?
Here is a nother two instruction version:
cmp r0, #0
rsblt r0, r0, #0
Which translate to the simple code:
if (r0 < 0)
{
r0 = 0-r0;
}
That code should be pretty fast, even on modern ARM-CPU cores like the Cortex-A8 and A9.
Dive over to ARM.com and grab the Cortex-M3 datasheet. Section 3.3.1 on page 3-4 has the instruction timings. Fortunately they're quite straightforward on the Cortex-M3.
We can see from those timings that in a perfect 'no wait state' system your professor's example takes 3 cycles:
ASR R1, R0, #31 ; 1 cycle
ADD R0, R0, R1 ; 1 cycle
EOR R0, R0, R1 ; 1 cycle
; total: 3 cycles
and your version takes two cycles:
ADD R1, R0, R0, ASR #31 ; 1 cycle
EOR R0, R1, R0, ASR #31 ; 1 cycle
; total: 2 cycles
So yours is, theoretically, faster.
You mention "The removal of one memory fetch", but is that true? How big are the respective routines? Since we're dealing with Thumb-2 we have a mix of 16-bit and 32-bit instructions available. Let's see how they assemble:
Their version (adjusted for UAL syntax):
.syntax unified
.text
.thumb
abs:
asrs r1, r0, #31
adds r0, r0, r1
eors r0, r0, r1
Assembles to:
00000000 17c1 asrs r1, r0, #31
00000002 1840 adds r0, r0, r1
00000004 4048 eors r0, r1
That's 3x2 = 6 bytes.
Your version (again, adjusted for UAL syntax):
.syntax unified
.text
.thumb
abs:
add.w r1, r0, r0, asr #31
eor.w r0, r1, r0, asr #31
Assembles to:
00000000 eb0071e0 add.w r1, r0, r0, asr #31
00000004 ea8170e0 eor.w r0, r1, r0, asr #31
That's 2x4 = 8 bytes.
So instead of removing a memory fetch you've actually increased the size of the code.
But does this affect performance? My advice would be to benchmark.

Resources