Dynamic memory allocation in Eigen - c++11

I am using Eigen C++ library and I would like to see if my code uses dynamic memory allocation. According to documentation (https://eigen.tuxfamily.org/dox/TopicPreprocessorDirectives.html), #defining the preprocessor macro "EIGEN_NO_MALLOC" should lead to an assertion failure if memory from the heap is allocated.
Thus I would expect that the following leads to a failure (because of using MatrixXcf):
#define EIGEN_NO_MALLOC 1
MatrixXcf A = MatrixXcf::Random(5,5);
But it doesn't. Why?

Related

working of “–print-memory-usage” in the GCC?

how does GCC provide a breakdown of the memory used in each memory region defined in the linker file using --print-memory-usage?
GCC just forwards --print-memory-usage to the linker, usually ld:
https://sourceware.org/binutils/docs-2.40/ld.html#index-memory-usage
gcc (or g++ for that matter) has no idea about memory usage, and the linker can only report usage of static storage memory, which is usually:
.text: The "program" or code to be executed. This might be located in RAM or in ROM (e.g. Flash) depending on options and architecture.
.rodata: Data in static storage that is read-only and does not need initialization at run-time. This is usually located in non-volatile memory like ROM or Flash; but there are exceptions, one of which is avr-gcc.
.data, .bss and COMMON: Data in RAM that's initialized during start-up by the CRT (C Runtime).
Apart from these common setcions, there might be other sections like .init*, .boot, .jumptables etc, which again depend on the application and architecture.
By its very nature, the linker (or assembler or compiler) cannot determine memory usage that unfolds at run-time, which is:
Stack usage: Non-static local variables that cannot be held in registers, alloca, ...
Heap usage: malloc, new and friends.
What the compiler can do for you is -fstack-usage and similar, which generates a text file *.su for each translation unit. The compiler reports stack usage that's known at compile time (static) and unknown stack usage that arises at run-time (dynamic). The functions marked as static use the specified amount of stack space, without counting the usages of non-inlined callees.
In order to know the complete stack usage (or a reliable upper bound), the dynamic call graph must be known. Even if it's known, GCC won't do the analysis four you. You will need other more elaborate tools to work out these metrics, e.g. by abstract interpretation or other means of static analysis.
Notice that data collected at run-time, like dynamic stack usage analysis at run time, only provide a lower bound of memory usage (or execution time for that matter). However, for sound analysis like in safety-scitical apps, what you meed are upper bounds.

why does arm atomic_[read/write] operations implemented as volatile pointers?

He is example of atomic_read implementation:
#define atomic_read(v) (*(volatile int *)&(v)->counter)
Also, should we explicitly use memory barriers for atomic operations on arm?
He is example of atomic_read implementation:
A problematic one actually, which assumes that a cast is not a nop, which isn't guaranteed.
Also, should we explicitly use memory barriers for atomic operations
on arm?
Probably. It depends on what you are doing and what you are expecting.
Yes, the casting to volatile is to prevent the compiler from assuming the value of v cannot change. As for using memory barriers, the GCC builtins already allow you to specify the memory ordering you desire, no need to do it manually: https://gcc.gnu.org/onlinedocs/gcc-9.2.0/gcc/_005f_005fatomic-Builtins.html#g_t_005f_005fatomic-Builtins
The default behavior on GCC is to use __ATOMIC_SEQ_CST which will emit the barriers necessary on Arm to make sure your atomics execute in the order you place them in the code. To optimize performance on Arm, you will want to consider using weaker semantics to allow the compiler to elide barriers and let the hardware execute faster. For more information on the types of memory barriers the Arm architecture has, see https://developer.arm.com/docs/den0024/latest/memory-ordering/barriers.

C++ std::atomic and shared memory?

I have 4096 bytes of shared memory allocated. How to treat it as an array of std::atomic<uint64_t> objects?
The final goal is to place array of 64-bit variables to shared memory and perform __sync_fetch_and_add (GCC built-in) on these variables. But I would prefer using native C++11 code instead of using GCC built-ins. So how to use the allocated memory as an std::atomic objects? Should I invoke placement new on 512 counters? What If std::atomic's constructor require additional memory allocations in some imeplementations? Should I consider the aligning of std::atomic object in shared memory?
With C++20, you can use std::atomic_ref if you can make sure that your objects are suitably aligned. For C++17 and anything older, boost::atomic_ref might work as well, though I haven't tested it.
If you don't want to use Boost, then compiler builtins are the only solution left. In that case, you should prefer the __atomic builtins over the old __sync functions, as stated on the GCC documentation page for atomics.

CUDA __syncthreads() usage within a warp

If it was absolutely required for all the threads in a block to be at the same point in the code, do we require the __syncthreads function if the number of threads being launched is equal to the number of threads in a warp?
Note: No extra threads or blocks, just a single warp for the kernel.
Example code:
shared _voltatile_ sdata[16];
int index = some_number_between_0_and_15;
sdata[tid] = some_number;
output[tid] = x ^ y ^ z ^ sdata[index];
Updated with more information about using volatile
Presumably you want all threads to be at the same point since they are reading data written by other threads into shared memory, if you are launching a single warp (in each block) then you know that all threads are executing together. On the face of it this means you can omit the __syncthreads(), a practice known as "warp-synchronous programming". However, there are a few things to look out for.
Remember that a compiler will assume that it can optimise providing the intra-thread semantics remain correct, including delaying stores to memory where the data can be kept in registers. __syncthreads() acts as a barrier to this and therefore ensures that the data is written to shared memory before other threads read the data. Using volatile causes the compiler to perform the memory write rather than keep in registers, however this has some risks and is more of a hack (meaning I don't know how this will be affected in the future)
Technically, you should always use __syncthreads() to conform with the CUDA Programming Model
The warp size is and always has been 32, but you can:
At compile time use the special variable warpSize in device code (documented in the CUDA Programming Guide, under "built-in variables", section B.4 in the 4.1 version)
At run time use the warpSize field of the cudaDeviceProp struct (documented in the CUDA Reference Manual)
Note that some of the SDK samples (notably reduction and scan) use this warp-synchronous technique.
You still need __syncthreads() even if warps are being executed in parallel. The actual execution in hardware may not be parallel because the number of cores within a SM (Stream Multiprocessor) can be less than 32. For example, GT200 architecture has 8 cores in each SM, so you can never be sure all threads are in the same point in the code.

How can I visualise the memory (SRAM) usage of an AVR program?

I have encountered a problem in a C program running on an AVR microcontroller (ATMega328P). I believe it is due to a stack/heap collision but I'd like to be able to confirm this.
Is there any way I can visualise SRAM usage by the stack and the heap?
Note: the program is compiled with avr-gcc and uses avr-libc.
Update: The actual problem I am having is that the malloc implementation is failing (returning NULL). All mallocing happens on startup and all freeing happens at the end of the application (which in practice is never since the main part of the application is in an infinite loop). So I'm sure fragmentation is not the issue.
You can check RAM static usage using avr-size utility, as decribed in
http://www.avrfreaks.net/index.php?name=PNphpBB2&file=viewtopic&t=62968,
http://www.avrfreaks.net/index.php?name=PNphpBB2&file=viewtopic&t=82536,
http://www.avrfreaks.net/index.php?name=PNphpBB2&file=viewtopic&t=95638,
and http://letsmakerobots.com/node/27115
avr-size -C -x Filename.elf
(avr-size documentation: http://ccrma.stanford.edu/planetccrma/man/man1/avr-size.1.html )
Follows an example of how to set this on an IDE:
On Code::Blocks, Project -> Build options -> Pre/post build steps -> Post-build steps, include:
avr-size -C $(TARGET_OUTPUT_FILE)
or
avr-size -C --mcu=atmega328p $(TARGET_OUTPUT_FILE)
Example output at the end of build:
AVR Memory Usage
----------------
Device: atmega16
Program: 7376 bytes (45.0% Full)
(.text + .data + .bootloader)
Data: 81 bytes (7.9% Full)
(.data + .bss + .noinit)
EEPROM: 63 bytes (12.3% Full)
(.eeprom)
Data is your SRAM usage, and it is only the amount that the compiler
knows at compile time. You also need room for things created at
runtime (particularly stack usage).
To check stack usage (dynamic RAM),
from http://jeelabs.org/2011/05/22/atmega-memory-use/
Here’s a small utility function which determines how much RAM is
currently unused:
int freeRam () {
extern int __heap_start, *__brkval;
int v;
return (int) &v - (__brkval == 0 ? (int) &__heap_start : (int) __brkval);
}
And here’s a sketch using that code:
void setup () {
Serial.begin(57600);
Serial.println("\n[memCheck]");
Serial.println(freeRam());
}
The freeRam() function returns how many bytes exists between the end of the heap and the last allocated memory on the stack, so it is effectively how much the stack/heap can grow before they collide.
You could check the return of this function around code you suspect may be causing stack/heap collision.
You say malloc is failing and returning NULL:
The obvious cause which you should look at first is that your heap is "full" - i.e, the memory you've asked to malloc cannot be allocated, because it's not available.
There are two scenarios to bear in mind:
a: You have a 16 K heap, you've already malloced 10 K and you try and malloc a further 10K. Your heap is simply too small.
b: More commonly, you have a 16 k Heap, you've been doing a bunch of malloc/free/realloc calls and your heap is less than 50% 'full': You call malloc for 1K and it FAILS - what's up? Answer - the heap free space is fragmented - there isn't a contigous 1K of free memory that can be returned. C Heap managers can not compact the heap when this happens, so you're generally in a bad way. There are techniques to avoid fragmentation, but it's difficult to know if this is really the problem. You'd need to add logging shims to malloc and free so that you can get an idea of what dynamic memory operations are being performed.
EDIT:
You say all mallocs happen at startup, so fragmentation isn't the issue.
In which case, it should be easy to replace the dynamic allocation with static.
old code example:
char *buffer;
void init()
{
buffer = malloc(BUFFSIZE);
}
new code:
char buffer[BUFFSIZE];
Once you've done this everywhere, your LINKER should warn you if everything cannot fit into the memory available. Don't forget to reduce the heap size - but beware that some runtime io system functions may still use the heap, so you may not be able to remove it entirely.
Don't use the heap / dynamic allocation on smaller embedded targets. Especially with a processor with such limited resources. Rather redesign your application because the problem will reoccur as your program grows.
The usual approach would be to fill the memory with a known pattern and then to check which areas are overwritten.
If you're using both stack and heap, then it can be a little more tricky. I'll explain what I've done when no heap is used. As a general rule, all the companies I've worked for (in the domain of embedded C software) have avoided using heap for small embedded projects—to avoid the uncertainty of heap memory availability. We use statically declared variables instead.
One method is to fill most of the stack area with a known pattern (e.g. 0x55) at start-up. This is usually done by a small bit of code early in the software execution, either right at the start of main(), or perhaps even before main() begins, in the start-up code. Take care not to overwrite the small amount of stack in use at that point of course. Then, after running the software for a while, inspect the contents of stack space and see where the 0x55 is still intact. How you "inspect" depends on your target hardware. Assuming you have a debugger connected, then you can simply stop the micro running and read the memory.
If you have a debugger that can do a memory-access breakpoint (a bit more fancy than the usual execution breakpoint), then you can set a breakpoint in a particular stack location—such as the farthest limit of your stack space. That can be extremely useful, because it also shows you exactly what bit of code is running when it reaches that extent of stack usage. But it requires your debugger to support the memory-access breakpoint feature and it's often not found in the "low-end" debuggers.
If you're also using heap, then it can be a bit more complicated because it may be impossible to predict where stack and heap will collide.
Assuming you're using just one stack (so not an RTOS or anything) and that the stack is at the end of the memory, growing down, while the heap is starting after the BSS/DATA region, growing up. I've seen implementations of malloc that actually check the stackpointer and fail on a collision. You could try to do that.
If you're not able to adapt the malloc code, you could choose to put your stack at the start of the memory (using the linker file). In general it's always a good idea to know/define the maximum size of the stack. If you put it at the start, you'll get an error on reading beyond the beginning of the RAM. The Heap will be at the end and can probably not grow beyond the end if it's a decent implemantation (will return NULL instead). Good thing is you know have 2 separate error cases for 2 separate issues.
To find out the maximum stack size, you could fill your memory with a pattern, run the application and see how far it went, see also reply from Craig.
If you can edit the code for your heap, you could pad it with a couple of extra bytes (tricky on such low resources) on each block of memory. These bytes could contain a known pattern different from the stack. This might give you a clue if it collides with the stack by seeing it appear inside the stack or vice versa.
On Unix like operating systems a library function named sbrk() with a parameter of 0 allows you to access the topmost address of dynamically allocated heap memory. The return value is a void * pointer and could be compared with the address of an arbitrary stack allocated variable.
Using the result of this comparison should be used with care. Depending on the CPU and system architecture, the stack may be growing down from a arbitrary high address while the allocated heap will move up from low-bound memory.
Sometimes the operating system has other concepts for memory management (i.e. OS/9) which places heap and stack in different memory segments in free memory. On these operating systems - especially for embedded systems - you need to define the maximum memory requirements of your
applications in advance to enable the system to allocate memory segments of matching sizes.

Resources