Dynamic allocation algorithm for NOR Flash memories - memory-management

I need to implement a dynamic allocation algorithm for a NOR flash memory.
The idea is to implement a memory management algorithm and an abstraction mechanism that allows you to dynamically allocate rewritable blocks that are smaller than the Erasable Block Size.
I would like to understand if there are already case studies and approaches to known algorithms, from which to take inspiration.
Thanks

Dynamic allocation is related to how you write your program: you want to allocate memory at runtime depending on the execution context and not once for all when your program boots. It requires pure software mechanisms, basically book a dedicated area of memory and implement allocation / free functions.
It is not really related to the kind of memory you are using.
I guess you could implement malloc / free functions to Flash memory. However Flash is not designed to do many cycles of erase/program.
If you really need that, you should have a look to the concept of Flash Translation Layer. It is some kind of library providing a virtual memory space on the Flash. It aims to reduce Flash wear and to improve write time.
But again, you should really question the real need to do dynamic allocation to Flash.

Related

Memory Pool in FreeRTOS like in uC/OS II

Recently I wrote a C-Application for a Microblaze and I used uC/OS-II. uC/OS-II offers memory pools to allocate and deallocate blocks of memory with fixed size. I'm now writing a C-Application for an STM32 where I use this time FreeRTOS. It seems like FreeRTOS doesn't offer the same mechanism or did I miss something? I think the five heap implementations are not what I am looking for.
If there are actually no memory pools, is there any specific reason why?
The original version of FreeRTOS used memory pools. However it was found that users struggled to dimension the pools, which led to a constant stream of support requests. Also, as the original versions of FreeRTOS were intended for very RAM constrained systems, it was found that the RAM wasted by the use of oversized pools was not acceptable. It was therefore decided to move memory allocation to the portable layer, on the understanding that no one scheme is suitable for more than a subset of applications, and allowing users to provide their own scheme. As you mention, there are five example implementations provided, which cover nearly all applications, but if you absolutely must use a memory pool implementation, then you can easily add this by providing your own pvPortMalloc() and vPortFree() implementations (memory pools being one of the easier ones to implement).
Also note that, in FreeRTOS V9, you need not have any memory allocation scheme as everything can be statically allocated.

GPU access to system RAM

I am currently involved in developing a large scientific computing project, and I am exploring the possibility of hardware acceleration with GPUs as an alternative to the MPI/cluster approach. We are in a mainly memory-bound situation, with too much data to put in memory to fit on a GPU. To this end, I have two questions:
1) The books I have read say that it is illegal to access memory on the host with a pointer on the device (for obvious reasons). Instead, one must copy the memory from the host's memory to the device memory, then do the computations, and then copy back. My question is whether there is a work-around for this -- is there any way to read a value in system RAM from the GPU?
2) More generally, what algorithms/solutions exist for optimizing the data transfer between the CPU and the GPU during memory-bound computations such as these?
Thanks for your help in this! I am enthusiastic about making the switch to CUDA, simply because the parallelization is much more intuitive!
1) Yes, you can do this with most GPGPU packages.
The one I'm most familair with -- the AMD Stream SDK lets you allocate a buffer in "system" memory and use that as a texture that is read or written by your kernel. Cuda and OpenCL have the same ability, the key is to set the correct flags on the buffer allocation.
BUT...
You might not want to do that because the data is being read/written across the PCIe bus, which has a lot of overhead.
The implementation is free to interpret your requests liberally. I mean you can tell it to locate the buffer in system memory, but the software stack is free to do things like relocate it into GPU memory on the fly -- as long as the computed results are the same
2) All of the major GPGPU software enviroments (Cuda, OpenCL, the Stream SDK) support DMA transfers, which is what you probably want.
Even if you could do this, you probably wouldn't want to, since transfers over PCI-whatever will tend to be a bottleneck, whereas bandwidth between the GPU and its own memory is typically very high.
Having said that, if you have relatively little computation to perform per element on a large data set then GPGPU is probably not going to work well for you anyway.
I suggest cuda programming guide.
you will find many answers there.
Check for streams, unified addressing, cudaHostRegister.

Windows memory allocation questions

I am currently looking into malloc() implementation under Windows. But in my research I have stumbled upon things that puzzled me:
First, I know that at the API level, windows uses mostly the HeapAlloc() and VirtualAlloc() calls to allocate memory. I gather from here that the Microsoft implementation of malloc() (that which is included in the CRT - the C runtime) basically calls HeapAlloc() for blocks > 480 bytes and otherwise manage a special area allocated with VirtualAlloc() for small allocations, in order to prevent fragmentation.
Well that is all good and well. But then there are other implementation of malloc(), for instance nedmalloc, which claim to be up to 125% faster than Microsoft's malloc.
All this makes me wonder a few things:
Why can't we just call HeapAlloc() for small blocks? Does is perform poorly in regard to fragmentation (for example by doing "first-fit" instead of "best-fit")?
Actually, is there any way to know what is going under the hood of the various API allocation calls? That would be quite helpful.
What makes nedmalloc so much faster than Microsoft's malloc?
From the above, I got the impression that HeapAlloc()/VirtualAlloc() are so slow that it is much faster for malloc() to call them only once in a while and then to manage the allocated memory itself. Is that assumption true? Or is the malloc() "wrapper" just needed because of fragmentation? One would think that system calls like this would be quick - or at least that some thoughts would have been put into them to make them efficient.
If it is true, why is it so?
On average, how many (an order of magnitude) memory reads/write are performed by a typical malloc call (probably a function of the number of already allocated segments)? I would intuitively says it's in the tens for an average program, am I right?
Calling HeapAlloc doesn't sound cross-platform. MS is free to change their implementation when they wish; advise to stay away. :)
It is probably using memory pools more effectively, much like the Loki library does with its "small object allocator"
Heap allocations, which are general purpose by nature, are always slow via any implementation. The more "specialized" the allocator, the faster it will be. This returns us to point #2, which deals with memory pools (and the allocation sizes used that are specific to your application).
Don't know.
From the above, I got the impression that HeapAlloc()/VirtualAlloc() are so slow that it is much faster for malloc() to call them only once in a while and then to manage the allocated memory itself. Is that assumption true?
The OS-level system calls are designed and optimized for managing the entire memory space of processes. Using them to allocate 4 bytes for an integer is indeed suboptimal - you get overall better performance and memory usage by managing tiny allocations in library code, and letting the OS optimize for larger allocations. At least as far as I understand it.

Where are memory management algorithms used?

There are a set of memory management algorithms used in operating system construction, like pagination, segmentation, paged segmentation (paginación segmentada), segment pagination (segmentación paginada) and others.
Do you know if they are used besides that area, in not so low level software? They are used in bussiness applications?
These algoritms are for translating the program memory addresses onto the physical memory addresses. You will very rarely ever have to think of it in an application. In some extreme cases of applications working on very large datasets you may have to create a driver-like module to tune memory translation, but all the rest is still up to the operating system.
You might never write an OS yourself, but if you ever find yourself having to write a device driver, it will be imperitive that you understand these issues. So it is still quite useful to understand how these algorithms work.
Now you might be in school thinking, "Yuck, I'll just avoid that stuff". But you really have no idea where a 40-year carreer in the industry might take you.

Memory Allocation/Deallocation Bottleneck?

How much of a bottleneck is memory allocation/deallocation in typical real-world programs? Answers from any type of program where performance typically matters are welcome. Are decent implementations of malloc/free/garbage collection fast enough that it's only a bottleneck in a few corner cases, or would most performance-critical software benefit significantly from trying to keep the amount of memory allocations down or having a faster malloc/free/garbage collection implementation?
Note: I'm not talking about real-time stuff here. By performance-critical, I mean stuff where throughput matters, but latency doesn't necessarily.
Edit: Although I mention malloc, this question is not intended to be C/C++ specific.
It's significant, especially as fragmentation grows and the allocator has to hunt harder across larger heaps for the contiguous regions you request. Most performance-sensitive applications typically write their own fixed-size block allocators (eg, they ask the OS for memory 16MB at a time and then parcel it out in fixed blocks of 4kb, 16kb, etc) to avoid this issue.
In games I've seen calls to malloc()/free() consume as much as 15% of the CPU (in poorly written products), or with carefully written and optimized block allocators, as little as 5%. Given that a game has to have a consistent throughput of sixty hertz, having it stall for 500ms while a garbage collector runs occasionally isn't practical.
Nearly every high performance application now has to use threads to exploit parallel computation. This is where the real memory allocation speed killer comes in when writing C/C++ applications.
In a C or C++ application, malloc/new must take a lock on the global heap for every operation. Even without contention locks are far from free and should be avoided as much as possible.
Java and C# are better at this because threading was designed in from the start and the memory allocators work from per-thread pools. This can be done in C/C++ as well, but it isn't automatic.
First off, since you said malloc, I assume you're talking about C or C++.
Memory allocation and deallocation tend to be a significant bottleneck for real-world programs. A lot goes on "under the hood" when you allocate or deallocate memory, and all of it is system-specific; memory may actually be moved or defragmented, pages may be reorganized--there's no platform-independent way way to know what the impact will be. Some systems (like a lot of game consoles) also don't do memory defragmentation, so on those systems, you'll start to get out-of-memory errors as memory becomes fragmented.
A typical workaround is to allocate as much memory up front as possible, and hang on to it until your program exits. You can either use that memory to store big monolithic sets of data, or use a memory pool implementation to dole it out in chunks. Many C/C++ standard library implementations do a certain amount of memory pooling themselves for just this reason.
No two ways about it, though--if you have a time-sensitive C/C++ program, doing a lot of memory allocation/deallocation will kill performance.
In general the cost of memory allocation is probably dwarfed by lock contention, algorithmic complexity, or other performance issues in most applications. In general, I'd say this is probably not in the top-10 of performance issues I'd worry about.
Now, grabbing very large chunks of memory might be an issue. And grabbing but not properly getting rid of memory is something I'd worry about.
In Java and JVM-based languages, new'ing objects is now very, very, very fast.
Here's one decent article by a guy who knows his stuff with some references at the bottom to more related links:
http://www.ibm.com/developerworks/java/library/j-jtp09275.html
A Java VM will claim and release memory from the operating system pretty much indepdently of what the application code is doing. This allows it to grab and release memory in large chunks, which is hugely more efficient than doing it in tiny individual operations, as you get with manual memory management.
This article was written in 2005, and JVM-style memory management was already streets ahead. The situation has only improved since then.
Which language boasts faster raw
allocation performance, the Java
language, or C/C++? The answer may
surprise you -- allocation in modern
JVMs is far faster than the best
performing malloc implementations. The
common code path for new Object() in
HotSpot 1.4.2 and later is
approximately 10 machine instructions
(data provided by Sun; see Resources),
whereas the best performing malloc
implementations in C require on
average between 60 and 100
instructions per call (Detlefs, et.
al.; see Resources). And allocation
performance is not a trivial component
of overall performance -- benchmarks
show that many real-world C and C++
programs, such as Perl and
Ghostscript, spend 20 to 30 percent of
their total execution time in malloc
and free -- far more than the
allocation and garbage collection
overhead of a healthy Java
application.
In Java (and potentially other languages with a decent GC implementation) allocating an object is very cheap. In the SUN JVM it only needs 10 CPU Cycles. A malloc in C/c++ is much more expensive, just because it has to do more work.
Still even allocation objects in Java is very cheap, doing so for a lot of users of a web application in parallel can still lead to performance problems, because more Garbage Collector runs will be triggered.
Therefore there are those indirect costs of an allocation in Java caused by the deallocation done by the GC. These costs are difficult to quantify because they depend very much on your setup (how much memory do you have) and your application.
Allocating and releasing memory in terms of performance are relatively costly operations. The calls in modern operating systems have to go all the way down to the kernel so that the operating system is able to deal with virtual memory, paging/mapping, execution protection etc.
On the other side, almost all modern programming languages hide these operations behind "allocators" which work with pre-allocated buffers.
This concept is also used by most applications which have a focus on throughput.
I know I answered earlier, however, that was ananswer to the other answer's, not to your question.
To speak to you directly, if I understand correctly, your performance use case criteria is throughput.
This to me, means's that you should be looking almost exclusivly at NUMA aware allocators.
None of the earlier references; IBM JVM paper, Microquill C, SUN JVM. Cover this point so I am highly suspect of their application today, where, at least on the AMD ABI, NUMA is the pre-eminent memory-cpu governer.
Hands down; real world, fake world, whatever world... NUMA aware memory request/use technologies are faster. Unfortunately, I'm running Windows currently, and I have not found the "numastat" which is available in linux.
A friend of mine has written about this in depth in his implmentation for the FreeBSD kernel.
Dispite me being able to show at-hoc, the typically VERY large amount of local node memory requests on top of the remote node (underscoring the obvious performance throughput advantage), you can surly benchmark yourself, and that would likely be what you need todo as your performance charicterisitc is going to be highly specific.
I do know that in a lot of ways, at least earlier 5.x VMWARE faired rather poorly, at that time at least, for not taking advantage of NUMA, frequently demanding pages from the remote node. However, VM's are a very unique beast when it comes to memory compartmentailization or containerization.
One of the references I cited is to Microsoft's API implmentation for the AMD ABI, which has NUMA allocation specialized interfaces for user land application developers to exploit ;)
Here's a fairly recent analysis, visual and all, from some browser add-on developers who compare 4 different heap implmentations. Naturally the one they developed turns out on top (odd how the people who do the testing often exhibit the highest score's).
They do cover in some ways quantifiably, at least for their use case, what the exact trade off is between space/time, generally they had identified the LFH (oh ya and by the way LFH is simply a mode apparently of the standard heap) or similarly designed approach essentially consumes signifcantly more memory off the bat however over time, may wind up using less memory... the grafix are neat too...
I would think however that selecting a HEAP implmentation based on your typical workload after you well understand it ;) is a good idea, but to well understand your needs, first make sure your basic operations are correct before you optimize these odds and ends ;)
This is where c/c++'s memory allocation system works the best. The default allocation strategy is OK for most cases but it can be changed to suit whatever is needed. In GC systems there's not a lot you can do to change allocation strategies. Of course, there is a price to pay, and that's the need to track allocations and free them correctly. C++ takes this further and the allocation strategy can be specified per class using the new operator:
class AClass
{
public:
void *operator new (size_t size); // this will be called whenever there's a new AClass
void *operator new [] (size_t size); // this will be called whenever there's a new AClass []
void operator delete (void *memory); // if you define new, you really need to define delete as well
void operator delete [] (void *memory);define delete as well
};
Many of the STL templates allow you to define custom allocators as well.
As with all things to do with optimisation, you must first determine, through run time analysis, if memory allocation really is the bottleneck before writing your own allocators.
According to MicroQuill SmartHeap Technical Specification, "a typical application [...] spends 40% of its total execution time on managing memory". You can take this figure as an upper bound, i personally feel that a typical application spends more like 10-15% of execution time allocating/deallocating memory. It rarely is a bottleneck in single-threaded application.
In multithreaded C/C++ applications standard allocators become an issue due to lock contention. This is where you start to look for more scalable solutions. But keep in mind Amdahl's Law.
Pretty much all of you are off base if you are talking about the Microsoft heap. Syncronization is effortlessly handled as is fragmentation.
The current perferrred heap is the LFH, (LOW FRAGMENTATION HEAP), it is default in vista+ OS's and can be configured on XP, via gflag, with out much trouble
It is easy to avoid any locking/blocking/contention/bus-bandwitth issues and the lot with the
HEAP_NO_SERIALIZE
option during HeapAlloc or HeapCreate. This will allow you to create/use a heap without entering into an interlocked wait.
I would reccomend creating several heaps, with HeapCreate, and defining a macro, perhaps, mallocx(enum my_heaps_set, size_t);
would be fine, of course, you need realloc, free also to be setup as appropiate. If you want to get fancy, make free/realloc auto-detect which heap handle on it's own by evaluating the address of the pointer, or even adding some logic to allow malloc to identify which heap to use based on it's thread id, and building a heierarchy of per-thread heaps and shared global heap's/pools.
The Heap* api's are called internally by malloc/new.
Here's a nice article on some dynamic memory management issues, with some even nicer references. To instrument and analyze heap activity.
Others have covered C/C++ so I'll just add a little information on .NET.
In .NET heap allocation is generally really fast, as it it just a matter of just grabbing the memory in the generation zero part of the heap. Obviously this cannot go on forever, which is where garbage collection comes in. Garbage collection may affect the performance of your application significantly since user threads must be suspended during compaction of memory. The fewer full collects, the better.
There are various things you can do to affect the workload of the garbage collector in .NET. Generally if you have a lot of memory reference the garbage collector will have to do more work. E.g. by implementing a graph using an adjacency matrix instead of references between nodes the garbage collector will have to analyze fewer references.
Whether that is actually significant in your application or not depends on several factors and you should profile the application with actual data before turning to such optimizations.

Resources