OpenMP Location of Private Variables? - openmp

Where do openmp private variables get allocated? On each thread stack, dynamically or through some shared array or something?

The OpenMP specification doesn't specify if those variables are to be allocated on the stack or on the heap (and if they are on the heap if it is in a shared array or if there is one object allocated for each thread). Generally I would assume that private variables are allocated on the stack (there is no reason not to and it's generally more efficient). According to the manual that is the behaviour used in libgomp (the implemention used by gcc) at least, no clue about other implemantations though (although I see little reason why those shouldn't do the same thing).

OpenMP does not specify anything about the allocation of private variables.
There are two options : heap and stack.
If we think about each thread executing less number of instructions, it makes sense for the master thread to allocate private variables like below.
Code :
1: set_threads(n)
2: #pragma omp parallel private(var)
3: {
4: var = ...
5:}
Machine code :
line2 : var_ptr = new variables[n]
line4: var_ptr[get_thread_id()] = ...
But the above code will induce a lot of false-sharing among the private variables in different threads. So I think it would make sense for the compiler to allocate them on the stack of each thread.

Related

can memory for a golang variable be used outside its declaration scope

I am trying to understand memory management in go. can i safely use the memory allocated inside a scope.
type BigConfigurationData struct {
subject1config *Subject1Config
subject2config *Subject2Config
...
}
var p BigConfigurationData
if aFlag {
var subject1config = Subject1Config {
foo: bar
}
p.subject1config = &subject1config
}
// can i use p.subject1config here and expect the memory has not been cleaned up?
Go looks simple, but in fact it does a lot to help the programmer.
In Go all variables are effectively references. The compiler tracks if any memory "escapes" a scope, and then automatically allocates is on the heap (shared memory pool). If the memory does not escape the scope, compiler allocates it on the stack which is much faster.
In your case, when you assign the p.subject1config = &subject1config, this tells the compiler that the value or the memory will leave the scope. Compiler will allocte the memory from heap and thus via the reference p.subject1config you can access the memory in the outer scope.
Go garbage collector regularly checks if the memory blocks still have any refences pointing to it, and frees up the memory when no references remain.
In short, you are confusing "variable" and "memory", probably because Go automatically manages memory for you.
EDIT: This answer might be too simplisic. Please read comments below for more info. See these links for more information:
http://www.tapirgames.com/blog/golang-memory-management
https://dougrichardson.org/2016/01/23/go-memory-allocations.html

STM32F4 running FreeRTOS in external RAM

We have a thesis project at work were the guys are trying to get external RAM to work for the STM32F417 MCU.
The project is trying out some stuff that is really resource hungry and the internal RAM just isn't enough.
The question is how to best do this.
The current approach has been to just replace the RAM address in the link script (gnu ld) with the address for external RAM.
The problem with that approach is that during initialisation, the chip has to run on internal RAM since the FSMC has not been initialized.
It seems to work but as soon as pvPortMalloc is run we get a hard fault and it is probably due to dereferencing bogus addresses, we can see that variables are not initialized correctly at system init (which makes sense I guess since the internal RAM is not used at all, when it probably should be).
I realize that this is a vague question, but what is the general approach when running code in external RAM on a Cortex M4 MCU, more specifically the STM32F4?
Thanks
FreeRTOS defines and uses a single big memory area for stack and heap management; this is simply an array of bytes, the size of which is specified by the configTOTAL_HEAP_SIZE symbol in FreeRTOSConfig.h. FreeRTOS allocates tasks stack in this memory area using its pvPortMalloc function, therefore the main goal here is to place the FreeRTOS heap area into external SRAM.
The FreeRTOS heap memory area is defined in heap_*.c (with the exception of heap_3.c that uses the standard library malloc and it doesn't define any custom heap area), the variable is called ucHeap. You can use your compiler extensions to set its section. For GCC, that would be something like:
static uint8_t ucHeap[ configTOTAL_HEAP_SIZE ] __attribute__ ((section (".sram_data")));
Now we need to configure the linker script to place this custom section into external SRAM. There are several ways to do this and it depends again on the toolchain you're using. With GCC one way to do this would be to define a memory region for the SRAM and a section for ".sram_data" to append to the SRAM region, something like:
MEMORY
{
...
/* Define SRAM region */
sram : ORIGIN = <SRAM_START_ADDR>, LENGTH = <SRAM_SIZE>
}
SECTIONS
{
...
/* Define .sram_data section and place it in sram region */
.sram_data :
{
*(.sram_data)
} >sram
...
}
This will place the ucHeap area in external SRAM, while all the other text and data sections will be placed in the default memory regions (internal flash and ram).
A few notes:
make sure you initialize the SRAM controller/FSMC prior to calling any FreeRTOS function (like xTaskCreate)
once you start the tasks, all stack allocated variables will be placed in ucHeap (i.e. ext RAM), but global variables are still allocated in internal RAM. If you still have internal RAM size issues, you can configure other global variables to be placed in the ".sram_section" using compiler extensions (as shown for ucHeap)
if your code uses dynamic memory allocation, make sure you use pvPortMalloc/vPortFree, instead of the stdlib malloc/free. This is because only pvPortMalloc/vPortFree will use the ucHeap area in ext RAM (and they are thread-safe, which is a plus)
if you're doing a lot of dynamic task creation/deletion and memory allocation with pvPortMalloc/vPortFree with different memory block sizes, consider using heap_4.c instead of heap_2.c. heap_2.c has memory fragmentation problems when using several different block sizes, whereas heap_4.c is able to combine adjacent free memory blocks into a single large block
Another (and possibly simpler) solution would be to define the ucHeap variable as a pointer instead of an array, like this:
static uint8_t * const ucHeap = <SRAM_START_ADDR>;
This wouldn't require any special linker script editing, everything can be placed in the default sections. Note that with this solution the linker won't explicitly reserve any memory for the heap and you will loose some potentially useful information/errors (like heap area not fitting in ext RAM). But as long as you only have ucHeap in external RAM and you have configTOTAL_HEAP_SIZE smaller than external RAM size, that might work just fine.
When the application starts up it will try to initialise data by either clearing it to zero, or initialising it to a non-zero value, depending on the section the variable is placed in. Using a normal run time model, that will happen before main() is called. So you have something like:
1) Reset vector calls init code
2) C run time init code initialises variables
3) C run time init code calls main()
If you use the linker to place variables in external RAM then you need to ensure the RAM is accessible before that initialisation takes place, otherwise you will get a hard fault. Therefore you need to either have a boot loader that sets up the system for you, then starts your application....or more simply just edit the start up code to do the following:
1) Reset vector calls init code
2) >>>C run time init code configures external RAM<<<
3) C run time init code initialised variables
4) C run time init code calls main().
That way the RAM is available before you try to access it.
However, if all you want to do is have the FreeRTOS heap in external RAM, then you can leave the init code untouched, and just use an appropriate heap implementation - basically one that does not just declare a large static array. For example, if you use heap_5 then all you need to do is ensure the heap init function is called before any allocation is performed, because the heap init just describes which RAM to use as the heap, rather than statically declaring the heap.

explict flush directive with OpenMP: when is it necessary and when is it helpful

One OpenMP directive I have never used and don't know when to use is flush(with and without a list).
I have two questions:
1.) When is an explicit `omp flush` or `omp flush(var1, ...) necessary?
2.) Is it sometimes not necessary but helpful (i.e. can it make the code fast)?
The main reason I can't understand when to use an explicit flush is that flushes are done implicitly after many directives (e.g. as barrier, single, ...) which synchronize the threads. I can't, for example, see way using flush and not synchronizing (e.g. with nowait) would be helpful.
I understand that different compilers may implement omp flush in different ways. Some may interpret a flush with a list as as one without (i.e. flush all shared objects) OpenMP flush vs flush(list). But I only care about what the specification requires. In other words, I want to know where an explicit flush in principle may be necessary or helpful.
Edit: I think I need to clarify my second question. Let me give an example. I would like to know if there are cases where removing an implicit flush (e.g. with nowait) and instead using an explicit flush instead but only on certain shared variables would be faster (and still give the correct result). Something like the following:
float a,b;
#pragma omp parallel
{
#pragma omp for nowait // No barrier. Do not flush on exit.
//code which uses only shared variable a
#pragma omp flush(a) // Flush only variable a rather than all shared variables.
#pragma omp for
//Code which uses both shared variables a and b.
}
I think that code still needs a barrier after the the first for loop but all barriers have an implicit flush so that defeats the purpose. Is it possible to have a barrier which does not do a flush?
The flush directive tells the OpenMP compiler to generate code to make the thread's private view on the shared memory consistent again. OpenMP usually handles this pretty well and does the right thing for typical programs. Hence, there's no need for flush.
However, there are cases where the OpenMP compiler needs some help. One of these cases is when you try to implement your own spin lock. In these cases, you would need a combination of flushes to make things work, since otherwise the spin variables will not be updated. Getting the sequence of flushes correct will be tough and will be very, very error prone.
The general recommendation is that flushes should not be used. If at all, programmers should avoid flush with a list (flush(var,...)) at all means. Some folks are actually talking about deprecating it in future OpenMP.
Performance-wise the impact of flush should be more negative than positive. Since it causes the compiler to generate memory fences and additional load/store operations, I would expect it to slow down things.
EDIT: For your second question, the answer is no. OpenMP makes sure that each thread has a consistent view on the shared memory when it needs to. If threads do not synchronize, they do not need to update their view on the shared memory, because they do not see any "interesting" change there. That means that any read a thread makes does not read any data that has been changed by some other thread. If that would be the case, then you'd have a race condition and a potential bug in your program. To avoid the race, you need to synchronize (which then implies a flush to make each participating thread's view consistent again). A similar argument applies to barriers. You use barriers to start a new epoch in the computation of a parallel region. Since you're keeping the threads in lock-step, you will very likely also have some shared state between the threads that has been computed in the previous epoch.
BTW, OpenMP may keep private data for a thread, but it does not have to. So, it is likely that the OpenMP compiler will keep variables in registers for a while, which causes them to be out of sync with the shared memory. However, updates to array elements are typically reflected pretty soon in the shared memory, since the amount of private storage for a thread is usually small (register sets, caches, scratch memory, etc.). OpenMP only gives you some weak restrictions on what you can expect. An actual OpenMP implementation (or the hardware) may be as strict as it wishes to be (e.g., write back any change immediately and to flushes all the time).
Not exactly an answer, but Michael Klemm's question is closed for comments. I think an excellent example of why flushes are so hard to understand and use properly is the following one copied (and shortened a bit) from the OpenMP Examples:
//http://www.openmp.org/wp-content/uploads/openmp-examples-4.0.2.pdf
//Example mem_model.2c, from Chapter 2 (The OpenMP Memory Model)
int main() {
int data, flag = 0;
#pragma omp parallel num_threads(2)
{
if (omp_get_thread_num()==0) {
/* Write to the data buffer that will be read by thread */
data = 42;
/* Flush data to thread 1 and strictly order the write to data
relative to the write to the flag */
#pragma omp flush(flag, data)
/* Set flag to release thread 1 */
flag = 1;
/* Flush flag to ensure that thread 1 sees S-21 the change */
#pragma omp flush(flag)
}
else if (omp_get_thread_num()==1) {
/* Loop until we see the update to the flag */
#pragma omp flush(flag, data)
while (flag < 1) {
#pragma omp flush(flag, data)
}
/* Values of flag and data are undefined */
printf("flag=%d data=%d\n", flag, data);
#pragma omp flush(flag, data)
/* Values data will be 42, value of flag still undefined */
printf("flag=%d data=%d\n", flag, data);
}
}
return 0;
}
Read the comments and try to understand.

D1's auto and scope difference in memory allocation

D's docs saying that when you use scope for local variables, then they will be allocated on stack (even if you're allocating class instance). But what about auto keyword? Does it guarantee that the instance will be allocated on stack?
void foo() { auto instance = new MyClass();}
void foo() { scope instance = new MyClass();}
So can I suggest that this two statements are equal (in terms of allocation)?
No, auto only infers the type.
There's no point in using auto if you want it to be allocated on the stack; that's what scope is (was) for.
They've brilliantly (read: not so much) decided to remove scope, delete, etc. from the language, so it will probably allocate on the heap anyway. Your best bet is to use the function called scoped in one of the modules, to allocate on the stack.
To answer the second question: in D1 those two statements are not equal. First one allocates on the heap, second one is (supposed) to allocate on the stack.

Call to _freea really necessary?

I am developping on Windows with DevStudio, in C/C++ unmanaged.
I want to allocate some memory on the stack instead of the heap because I don't want to have to deal with releasing that memory manually (I know about smart pointers and all those things. I have a very specific case of memory allocation I need to deal with), similar to the use of A2W() and W2A() macros.
_alloca does that, but it is deprecated. It is suggested to use malloca instead. But _malloca documentation says that a call to ___freea is mandatory for each call to _malloca. It then defeats my purpose to use _malloca, I will use malloc or new instead.
Anybody knows if I can get away with not calling _freea without leaking and what the impacts are internally?
Otherwise, I will end-up just using deprecated _alloca function.
It is always important to call _freea after every call to _malloca.
_malloca is like _alloca, but adds some extra security checks and enhancements for your protection. As a result, it's possible for _malloca to allocate on the heap instead of the stack. If this happens, and you do not call _freea, you will get a memory leak.
In debug mode, _malloca ALWAYS allocates on the heap, so also should be freed.
Search for _ALLOCA_S_THRESHOLD for details on how the thresholds work, and why _malloca exists instead of _alloca, and it should make sense.
Edit:
There have been comments suggesting that the person just allocate on the heap, and use smart pointers, etc.
There are advantages to stack allocations, which _malloca will provide you, so there are reasons for wanting to do this. _alloca will work the same way, but is much more likely to cause a stack overflow or other problem, and unfortunately does not provide nice exceptions, but rather tends to just tear down your process. _malloca is much safer in this regard, and protects you, but the cost is that you still need to free your memory with _freea since it's possible (but unlikely in release mode) that _malloca will choose to allocate on the heap instead of the stack.
If your only goal is to avoid having to free memory, I would recommend using a smart pointer that will handle the freeing of memory for you as the member goes out of scope. This would assign memory on the heap, but be safe, and prevent you from having to free the memory. This will only work in C++, though - if you're using plain ol' C, this approach will not work.
If you are trying to allocate on the stack for other reasons (typically performance, since stack allocations are very, very fast), I would recommend using _malloca and living with the fact that you'll need to call _freea on your values.
Another thing to consider is using an RAII class to manage the allocation - of course that's only useful if your macro (or whatever) can be restricted to C++.
If you want to avoid hitting the heap for performance reasons, take a look at the techniques used by Matthew Wilson's auto_buffer<> template class (http://www.stlsoft.org/doc-1.9/classstlsoft_1_1auto__buffer.html). This will allocate on the stack unless your runtime size request exceeds a size specified at compiler time - so you get the speed of no heap allocation for the majority of allocations (if you size the template right), but everything still works correctly if your exceed that size.
Since STLsoft has a whole lot of cruft to deal with portability issues, you may want to look at a simpler version of auto_buffer<> which is described in Wilson's book, "Imperfect C++".
I found it quite handy in an embedded project.
To allocate memory on the stack, simply declare a variable of the appropriate type and size.
I answered this before, but I'd missed something fundamental that meant that it only worked in debug mode. I moved the call to _malloca into the constructor of a class that would auto-free.
In debug this is fine, as it always allocates on the heap. However, in release, it allocates on the stack, and upon returning from the constructor, the stack pointer has been reset, and the memory lost.
I went back and took a different approach, resulting in a combination of using a macro (eurgh) to allocate the memory and instantiate an object that will automatically call _freea on that memory. As it's a macro, it's allocated in the same stack frame, and so will actually work in release mode. It's just as convenient as my class, but slightly less nice to use.
I did the following:
class EXPORT_LIB_CLASS CAutoMallocAFree
{
public:
CAutoMallocAFree( void *pMem ) : m_pMem( pMem ) {}
~CAutoMallocAFree() { _freea( m_pMem ); }
private:
void *m_pMem;
CAutoMallocAFree();
CAutoMallocAFree( const CAutoMallocAFree &rhs );
CAutoMallocAFree &operator=( const CAutoMallocAFree &rhs );
};
#define AUTO_MALLOCA( Var, Type, Length ) \
Type* Var = (Type *)( _malloca( ( Length ) * sizeof ( Type ) ) ); \
CAutoMallocAFree __MALLOCA_##Var( (void *) Var );
This way I can allocate using the following macro call, and it's released when the instantiated class goes out of scope:
AUTO_MALLOCA( pBuffer, BYTE, Len );
Ar.LoadRaw( pBuffer, Len );
My apologies for posting something that was plainly wrong!
If you're using _malloca() then you must call _freea() to prevent memory leak because _malloca() can do the allocation either on stack or heap. It resorts to allocate on heap if the given size exceeds_ALLOCA_S_THRESHOLD value. Thus, it's safer to call _freea() which won't do anything if allocation happened on stack.
If you're using _alloca() which seems to be deprecated as of today; there is no need to call _freea() as the allocation happens on stack.
If your concern is having to free temp memory, and you know all about things like smart-pointers then why not use a similar pattern where memory is freed when it goes out of scope?
template <class T>
class TempMem
{
TempMem(size_t size)
{
mAddress = new T[size];
}
~TempMem
{
delete [] mAddress;
}
T* mAddress;
}
void foo( void )
{
TempMem<int> buffer(1024);
// alternatively you could override the T* operator..
some_memory_stuff(buffer.mAddress);
// temp-mem auto-freed
}

Resources