64bit Write Order on a 32bit RISC-V - gcc

All
I found my riscv gcc will come times change my 64bit write order, and I want to avoid this
...
// A 64bit write to addr & addr+4, which map to IO
*((volatile uint32_t *)(addr)) = data;
*((volatile uint32_t *)(addr+4)) = data >> 32;
*(uint64_t*)buf = 0x0;
Then I get
sw a0, 0(a3)
sw zero, 0(a5) // lower 32bit of buf
sw a2, 4(a3) // high 32bit of addr
sw zero, 4(a5)
It confused some of IO device, is there any way to avoid gcc to do this opt?
Thanks

Related

64bit Read Order on a 32bit RISC-V

I am trying to read a 64bit device register with a 32bit RISC-V MCU.
Here is my code
uint64_t rd64(uint32_t addr){
return *((volatile uint64_t *)(addr));
}
main(){
....
uint64_t dataA = rd64(addrA);
uint64_t dataB = rd64(addrB);
}
and I got the asm code with gcc
lw t5, 1234(a0) // Lower 32bit first
lw t6, 1238(a0)
lw a1, 1308(a0) // High 32bit first
lw a0, 1304(a0)
It seems gcc split the 64bit read into two 32bit read, but there is no guarantee which 32bit word come out first. This makes my device no happy.
Is there anyway to constrain gcc output with lower 32bit word come out first?
Thanks
cast your uint64_t* to a volatile uint32_t[2], and read in the order you want. (or use a volatile uint32_t*, same idea)

STM32H7 + external SDRAM - memcpy with length 3 crashes - word boundaries, cache settings?

I have a project running on an STM32H753ieval board with heap in external memory, with freeRTOS, modelled on the STM32 cube demos.
At the moment the MPU and cache are not enabled. (as far as I can tell, their functions are commented out)
this works in the main() function, where a and b are in internal ram.
int* aptr;
int* bptr;
int main()
{
// MPU_Config();
// CPU_CACHE_Enable();
int a[100]; int b[100];
memcpy(a, b, 3);
aptr = a;
bptr = b;
...
however, when a freeRTOS thread creates variables on the heap, memcpy doesnt work with some length values.
static void mymemcpy(char* dst, char* src, int len)
{
for (int i = 0; i < len; i++)
{
dst[i] = src[i];
}
}
void StartThread(void* arg)
{
int a[100]; int b[100];
for (int i = 0; i < 10; i++)
{
memcpy(aptr, bptr, i); //works, using same mem as main
}
for (int i = 0; i < 10; i++)
{
mymemcpy(a, b, i); //works, using external ram mem, but with mymemcpy
}
memcpy(a, b, 4); //works, seems not a overrun issue
for (int i = 0; i < 10; i++)
{
memcpy(a, b, i); //jumps to random memory when i == 3, probably an undefined handler
}
while(1);
}
This is the first time I've dealt with a caching micro, and external ram.
Is this a cache issue, ram issue, library issue? How do i fix it?
Note: I don't care that the arrays are uninitialised. I'm happy copying garbage.
This issue has given me plenty of grief too. It has nothing to do with uninitialised buffers or ECC bits. It's caused by data alignment errors when accessing external SDRAM. The micro read instruction can access any group of byte(s) within a 4 byte boundary. It can't perform a read across a 4 byte boundary. Examples:
Load Register R0 (4-bytes) # 0xc0000000; // good juju
Load Register R0 (2-bytes) # 0xc0000002; // good juju
Load Register R0 (1-byte) # 0xc0000003; // good juju
Load Register R0 (4-bytes) # 0xc0000004; // good juju
Load Register R0 (4-bytes) # 0xc0000002; // bad juju
Load Register R0 (2-bytes) # 0xc0000003; // bad juju
Performing a read across a 4-byte boundary causes a bus exception (mine gets caught by the hardfault handler).
Your a & b buffers are declared on the stack, so you won't know their starting address. Your mymemcpy() is safe, because it copies 1 byte at a time. However, the newlib memcpy actually copies 4 bytes at a time. That code will potentially try to read across a 4-byte boundary and throw an exception. Even if the start address is on a 4-byte boundary, the end address might not be.
The same issue applies to memset, memcmp, etc. But it also happens under the hood. Example:
std::array<uint8_t, 10> a;
std::array<uint8_t, 10> b;
a = b; // potentially bad juju because = operator calls memcpy
To get around this problem I've enabled the data cache and setup the MPU region. The micro doesn't access the external SDRAM directly. Instead the data gets loaded into the cache which doesn't have the 4-byte boundary restriction. It seems to work ok, but it doesn't fill me with confidence.
The device likely crashes because reads to initialized memory may not have correct ECC bits set and when the processor discovers this during the read operation it faults on a double bit error.
Write to the memory first and then read it or configure your linker
to zero initialize your heap area... this may require assembly code to get the correct sequencing going to enable the ram first othewise the zero init may fail

ARM Cortex M4 tune rearranging an unsigned int variable

I'm just starting to learn the ARM cortex-M4, which have advanced functions like DSP instructions,......
uint32_t my_rearrange(uint32_t value){
uint32_t value_high = (value & 0xffff0000)>>16;
uint32_t value_low = (value & 0x0000ffff);
return (value_low<<16)|value_high;
}
This is a simple code for rearranging an unsigned int variable.
Is there anyway to tune this function for best performance or fastest execution in cortex-M4? Is there a way for me to use the dsp instructions in this function?
You don't need to use any of the dsp instructions for this. All you appear to be doing is a 16 bit rotate. The compiler should recognise this and generate a ROR. GCC 5.4.1 with flags -O3 -std=c++11 -march=armv7-m -mtune=cortex-m4 -mthumb does, and for your code generates
my_rearrange(unsigned long):
ror r0, r0, #16
bx lr
The only way you could speed it up would be if it was inlined.

Fastest way to swap alternate bytes on ARM Cortex M4 using gcc

I need to swap alternate bytes in a buffer as quickly as possible in an embedded system using ARM Cortex M4 processor. I use gcc. The amount of data is variable but the max is a little over 2K. it doesn't matter if a few extra bytes are converted because I can use an over-sized buffer.
I know that the ARM has the REV16 instruction, which I can use to swap alternate bytes in a 32-bit word. What I don't know is:
Is there a way of getting at this instruction in gcc without resorting to assembler? The __builtin_bswap16 intrinsic appears to operate on 16-bit words only. Converting 4 bytes at a time will surely be faster than converting 2 bytes.
Does the Cortex M4 have a reorder buffer and/or do register renaming? If not, what do I need to do to minimise pipeline stalls when I convert the dwords of the buffer in a partially-unrolled loop?
For example, is this code efficient, where REV16 is appropriately defined to resolve (1):
uint32_t *buf = ... ;
size_t n = ... ; // (number of bytes to convert + 15)/16
for (size_t i = 0; i < n; ++i)
{
uint32_t a = buf[0];
uint32_t b = buf[1];
uint32_t c = buf[2];
uint32_t d = buf[3];
REV16(a, a);
REV16(b, b);
REV16(c, c);
REV16(d, d);
buf[0] = a;
buf[1] = b;
buf[2] = c;
buf[3] = d;
buf += 4;
}
You can't use the __builtin_bswap16 function for the reason you stated, it works on 16 bit words so will 0 the other halfword. I guess the reason for this is to keep the intrinsic working the same on processors which don't have an instruction behaving similarly to REV16 on ARM.
The function
uint32_t swap(uint32_t in)
{
in = __builtin_bswap32(in);
in = (in >> 16) | (in << 16);
return in;
}
compiles to (ARM GCC 5.4.1 -O3 -std=c++11 -march=armv7-m -mtune=cortex-m4 -mthumb)
rev r0, r0
ror r0, r0, #16
bx lr
And you could probably ask the compiler to inline it, which would give you 2 instructions per 32bit word. I can't think of a way to get GCC to generate REV16 with a 32bit operand, without declaring your own function with inline assembly.
EDIT
As a follow up, and based on artless noise's comment about the non portability of the __builtin_bswap functions, the compiler recognizes
uint32_t swap(uint32_t in)
{
in = ((in & 0xff000000) >> 24) | ((in & 0x00FF0000) >> 8) | ((in & 0x0000FF00) << 8) | ((in & 0xFF) << 24);
in = (in >> 16) | (in << 16);
return in;
}
and creates the same 3 instruction function as above, so that is a more portable way to achieve it. Whether different compilers would produce the same output though...
EDIT EDIT
If inline assembler is allowed, the following function
inline uint32_t Rev16(uint32_t a)
{
asm ("rev16 %1,%0"
: "=r" (a)
: "r" (a));
return a;
}
gets inlined, and acts as a single instruction as can be seen here.

Windows' AllocationPreference - like setting in MacOS to test 32->64bit code?

As has been mentioned in previous answers here, Windows OS has a Registry key to force memory allocations from Top-Down to help catch pointer truncation issues when moving from 32bit code -> 64bit.
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Session Manager\Memory Management\AllocationPreference
However I don't see anything similar in MacOS. I wrote a quick program to check, and it seems that in 64bit, all memory allocated need at least 33bits to represent their address.
I'm wondering if a 64bit MacOS X program will ever allocate memory in address space that can be represented in 32bits or less? If I compile & run the same program in 32bit it obviously uses addresses in the lower 32bit space.
int main(int argc, const char * argv[])
{
void * ptr = NULL;
for (int i = 0; i < 20; i++)
{
ptr = malloc(10);
if ((UInt64)ptr > INT32_MAX)
printf("Pointer = %p > 32 bits required!\n", ptr);
else
printf("Pointer = %p <= 32 bits ok\n", ptr);
}
return 0;
}
Thanks for any help!
Matt

Resources