Windows memory allocation questions

Windows memory allocation questions - windows

I am currently looking into malloc() implementation under Windows. But in my research I have stumbled upon things that puzzled me:
First, I know that at the API level, windows uses mostly the HeapAlloc() and VirtualAlloc() calls to allocate memory. I gather from here that the Microsoft implementation of malloc() (that which is included in the CRT - the C runtime) basically calls HeapAlloc() for blocks > 480 bytes and otherwise manage a special area allocated with VirtualAlloc() for small allocations, in order to prevent fragmentation.
Well that is all good and well. But then there are other implementation of malloc(), for instance nedmalloc, which claim to be up to 125% faster than Microsoft's malloc.
All this makes me wonder a few things:
Why can't we just call HeapAlloc() for small blocks? Does is perform poorly in regard to fragmentation (for example by doing "first-fit" instead of "best-fit")?
Actually, is there any way to know what is going under the hood of the various API allocation calls? That would be quite helpful.
What makes nedmalloc so much faster than Microsoft's malloc?
From the above, I got the impression that HeapAlloc()/VirtualAlloc() are so slow that it is much faster for malloc() to call them only once in a while and then to manage the allocated memory itself. Is that assumption true? Or is the malloc() "wrapper" just needed because of fragmentation? One would think that system calls like this would be quick - or at least that some thoughts would have been put into them to make them efficient.
If it is true, why is it so?
On average, how many (an order of magnitude) memory reads/write are performed by a typical malloc call (probably a function of the number of already allocated segments)? I would intuitively says it's in the tens for an average program, am I right?

Calling HeapAlloc doesn't sound cross-platform. MS is free to change their implementation when they wish; advise to stay away. :)
It is probably using memory pools more effectively, much like the Loki library does with its "small object allocator"
Heap allocations, which are general purpose by nature, are always slow via any implementation. The more "specialized" the allocator, the faster it will be. This returns us to point #2, which deals with memory pools (and the allocation sizes used that are specific to your application).
Don't know.

From the above, I got the impression that HeapAlloc()/VirtualAlloc() are so slow that it is much faster for malloc() to call them only once in a while and then to manage the allocated memory itself. Is that assumption true?
The OS-level system calls are designed and optimized for managing the entire memory space of processes. Using them to allocate 4 bytes for an integer is indeed suboptimal - you get overall better performance and memory usage by managing tiny allocations in library code, and letting the OS optimize for larger allocations. At least as far as I understand it.

Related

How to deal with external fragmentation, how paging helps with external fragmentation?

I know that there is a lot of questions regarding the issue I'm pointing here, but I couldn't find any complex answer (neither on StackOverflow nor in other sources).
I would like to ask about heap (RAM) fragmentation problem.
As I understood there are two kind of fragmentation:
internal - related with difference between allocation unit size (AU) and the size of the allocated memory AM (waste memory is equal to AM % AU),
external - related with noncontinuous areas of a free memory, so even if the sum of the free memory areas can handle the new allocation request, it fails if there is no continues area that can handle it.
This is quite clear. The problems start when the "paging" appears.
Sometimes I can find an information that paging solves the external fragmentation issue.
Indeed I agree that thanks to paging the OS is able to create the virtually continues areas of the memory, assigned to the process, even if physically the parts of the memory are scattered.
But how exactly does it help with the external fragmentation?
I mean, assuming that the size of a page has 4kB, and we want to allocate 16 kB, then of course we just need to find four empty pages frames, even if physically the frames are not a part of a continues area.
But what in case of the smaller allocation ?
I believe the page itself can still be fragmented and (in worst case) the OS still needs to provide a new frame if the old one cannot be used to allocate the requested memory.
So is it that (assuming the worst case) sooner or later, with paging or without, the long working application that allocates and releases the heap memory (different sizes) will fall into low-memory condition, because of external fragmentation ?
So the question is how to deal with the external fragmentation?
Own implementation of allocation algorithm ? Paging (as I wrote, not sure it helps) ? What else ? Does OS (Windows, Linux) provides some defragmentation methods ?
The most radical solution is to forbid using of the heap, but is it really necessary for the platforms with paging, virtual address spaces, virtual memory etc ... and the only issue is that the applications need to run unstoppable for a years ?
One more issue.. is internal fragmentation an ambiguous term ?
Somewhere I have spotted the definition that internal fragmentation points to the part of page frame, that is wasted because the process does not need more memory, but the single frame cannot be owned by more than a single processes.
I have bolded the questions, so the people who are in hurry could find the question without reading everything.
Regards!

"Fragmentation" is indeed not a very precise term. But we can say for sure that when a running application needs a block of n bytes and there are n or more bytes not in use, yet we can't get the required block, then "memory is too fragmented."
But how exactly does it [paging] help with the external allocation [I assume you mean fragmentation] ?
There's really nothing complicated here. External fragmentation is free memory between allocated blocks that's "too small" to satisfy any application requirement. This is a general concept. The definition of "too small" is application-dependent. Nonetheless, if allocated blocks can fall on any boundary, then it's easy, after many allocations and deallocations, for lots of such fragments to occur. Paging helps with external fragmentation in two ways.
First, it subdivides memory into fixed-size adjacent chunks - the pages - that are "large enough" so they're never useless. Again the definition of "large enough" is not precise. But most applications will have lots of requirements satisfiable by a single 4k page. Since no external fragmentation problem can occur for allocations of a page or less, the problem has been mitigated.
Second, the paging hardware provides a level of indirection between application pages and physical memory pages. Therefore any free physical memory page can be used to help satisfy any application request, no matter how large. For example, suppose you have 100 physical pages with every other physical page (50 of them) allocated. Without page-mapping hardware, the biggest request for contiguous memory that can be satisfied is 1 page. With mapping, it's 50 pages. (I'm disregarding virtual pages allocated initially with no mapped physical page. That's another discussion.)
But what in case of the smaller allocation ?
Again it's pretty simple. If the unit of allocation is a page, then any allocation smaller than a page yields an unused portion. This is internal fragmentation: unusable memory within an allocated block. The bigger you make allocation units (they don't have to be a single page), the more memory will be unusable due to internal fragmentation. On average, this will tend toward half of an allocation unit. Consequently, though OS's tend to allocate in units of pages, most application-side memory allocators request a very small number (often one) of big blocks (of pages) from the OS. They use much smaller allocation units internally: 4-16 bytes is pretty common.
So the question is how to deal with the external allocation [I assume you mean fragmentation] ? So is it that (assuming the worst case) sooner or later, with paging or without, the long working application that allocates and releases the heap memory (different sizes) will fall into low-memory condition, because of external fragmentation ?
If I understand you correctly, you're asking if fragmentation is inevitable. Except under very special conditions (e.g. the application only needs blocks of one size), the answer is yes. But that doesn't mean it's necessarily a problem.
Memory allocators use smart algorithms that limit fragmentation pretty effectively. For example, they may maintain "pools" with different block sizes, using the pool with block size most closely matching a given request. This tends to limit both internal and external fragmentation. A real world example that's very well documented is dlmalloc. The source code is also very clear.
Of course any general purpose allocator can fail under specific conditions. For this reason, modern languages (C++ and Ada are two I know) let you supply special-purpose allocators for objects of a given type. Typically -
for a fixed-size object - these might simply maintain a pre-allcoated free list, so fragmentation for that particular case is zero, and allocation/deallocation are very fast.
One more note: It's possible to totally eliminate fragmentation with copying/compacting garbage collection. Of course this requires underlying language support, and there's a performance bill to pay. A copying garbage collector compacts the heap by moving objects to eliminate unused space completely whenever it runs to reclaim storage. To do this it must update every pointer in the running program to the corresponding object's new location. While this may sound complex, I've implemented a copying garbage collector, and it's not so bad. The algorithms are extremely cool. Unfortunately, the semantics of many languages (e.g. C and C++) don't allow finding every pointer in the running program, which is required.
The most radical solution is to forbid using of the heap, but is it really necessary for the platforms with paging, virtual address spaces, virtual memory etc ... and the only issue is that the applications need to run unstoppable for a years ?
Though general purpose allocators are good, they're not guaranteed. It's not unusual for safety-critical or hard real time constrained systems to avoid heap use completely. On the other hand, when no absolute guarantee is needed, a general purpose allocator is often fine. There are many systems that run perfectly with tough loads for extended periods using general purpose allocators: fragmentation reaches an acceptable steady state and doesn't cause a problem.
One more issue.. is internal fragmentation an ambiguous term ?
The term isn't ambiguous, but is used in different contexts. The invariant is that it's referring to unused memory inside allocated blocks.
OS literature tends to assume the allocation unit is pages. For example, Linux sbrk lets you request the end of the data segment be set anywhere, but Linux allocates pages, not bytes, so the unused part of the last page is internal fragmentation from the OS's point of view.
Application-oriented discussions tend to assume allocation is in "blocks" or "chunks" of arbitrary size. dlmalloc uses about 128 discrete chunk sizes, each maintained in its own free list. Plus, it will custom allocate very large blocks using OS memory mapping system calls, so there's at most a page size (minus 1 byte) of mismatch between request and actual allocation. Clearly it's going to a lot of trouble to minimize internal fragmentation. The fragmentation caused a given allocation is the difference between the request and the chunk actually allocated. Since there are so many chunk sizes, that difference is strictly limited. On the other hand, the many chunk sizes increase chances of external fragmentation problems: free memory may consist entirely of chunks that are well-managed by dlmalloc, yet too small to honor an application requirement.

What data structure is used to implement the dynamic memory allocation heap?

I always assumed a heap (data structure) is used to implement a heap (dynamic memory allocation), but I've been told I'm wrong.
How are heaps (for example, the one implemented by typical malloc routines, or by Windows's HeapCreate, etc.) implemented, typically? What data structures do they use?
What I'm not asking:
While searching online, I've seen tons of descriptions of how to implement heaps with severe restrictions.
To name a few, I've seen lots of descriptions of how to implement:
Heaps that never release memory back to the OS (!)
Heaps that only give reasonable performance on small, similarly-sized blocks
Heaps that only give reasonable performance for large, contiguous blocks
etc.
And it's funny, they all avoid the harder question:
How are "normal", general-purpose heaps (like the one behind malloc, HeapCreate) implemented?
What data structures (and perhaps algorithms) do they use?

Allocators tend to be quite complex and often differ significantly in how they're implemented.
You can't really describe them in terms of one common data structure or algorithm, but there are some common themes:
Memory is taken from the system in large chunks -- often megabytes at a time.
These chunks are then split up into various smaller chunks as you perform allocations. Not exactly the same size as you allocate, but usually in certain ranges (200-250 bytes, 251-500 bytes, etc.). Sometimes this is multi-tiered, where you'd have an additional layer of "medium chunks" which come before your actual requests.
Controlling which "large chunk" to break a piece off of is a very difficult and important thing to do -- this greatly affects memory fragmentation.
One or more free pools (aka "free list", "memory pool", "lookaside list") are maintained for each of these ranges. Sometimes even thread-local pools. This can greatly speed up a pattern of allocating/deallocating many objects of similar size.
Large allocations are treated a bit differently so as to not waste a lot of RAM and not be pooled quite so much if at all.
If you wanted to check out some source code, jemalloc is a modern high-performance allocator and should be representative in complexity of other common ones. TCMalloc is another common general-purpose allocator, and their website goes into all the gory implementation details. Intel's Thread Building Blocks has an allocator built specifically for high concurrency.
One interesting difference can be seen between Windows and *nix. In *nix, the allocator has very low-level control over the address space an app uses. In Windows, you basically have a course-grained, slow allocator VirtualAlloc to base your own allocator off of.
This results in *nix-compatible allocators typically directly giving you an malloc/free implementation where it's assumed you'll only use one allocator for everything (otherwise they'd trample each-other), while Windows-specific allocators provide additional functions, leaving malloc/free alone, and can be used in harmony (for instance, you can use HeapCreate to make private heaps which can work alongside others).
In practice, this trade in flexibility gives *nix allocators a small leg up performance-wise. It's very rare to see an app intentionally use multiple heaps on Windows -- mostly it's by accident due to different DLLs using different runtimes which each have their own malloc/free, and can cause a lot of headaches if you're not diligent in tracking which heap some memory came from.

Note: The following answer assumes you're using a typical, modern system with virtual memory. The C and C++ standards do not require virtual memory; therefore of course you can't rely on such assumptions on hardware without this feature (e.g. GPUs typically don't have this feature; nor do extremely small hardware like the PIC).
This depends on the platform you're using. Heaps can be very complicated beasts; they don't use only a single data structure; and there is no "standard" data structure. Even where the heap code is located is different depending on the platform. For instance, the heap code is typically provided by the C Runtime on Unix boxes; but is typically provided by the operating system on Windows.
Yes, this is common on Unix machines; due to the way *nix's underlying APIs and memory model operate. Basically, the standard API to return memory to the operating system on these systems only allows returning memory on the "frontier" between where user memory is allocated and the "hole" in between user memory and system facilities like the stack. (The API in question is brk or sbrk). Instead of returning memory to the operating system, many heaps only try to reuse memory no longer in use by the program proper, and don't try to return memory to the system. This is less common on Windows, because its equivalent to sbrk (VirtualAlloc) doesn't have this limitation. (But like sbrk, it is very expensive and has caveats like only allocating page-sized and page-aligned chunks. So heaps try to call either as rarely as possible)
This sounds like a "block allocator", which divides the memory into fixed size chunks, and then just return one of the free chunks. To my (albeit limited) understanding, Windows' RtlHeap maintains a number of data structures like this for different known block sizes. (E.g. it'll have one for blocks of size 16, for instance) RtlHeap calls these "lookaside lists".
I don't really know of a specific structure that handles this case well. Large blocks are problematic for most allocation systems because they cause fragmentation of the address space.
The best reference I've found discussing the common allocation strategies employed on major platforms is the book Secure Coding in C and C++, by Robert Seacord. All of chapter 4 is dedicated to heap data structures (and problems caused when users use said heap systems incorrectly).

A "killer adversary" for memory allocators?

After reading this question about seemingly degenerate behavior for the Windows memory allocator, and remembering back to this paper about constructing worst-case inputs to quicksort implementations, I started wondering: would it be possible to build a program that, given a black-box memory allocator, forces that allocator to fail an allocation request even when sufficient memory is still available in the system? That is, is it possible to take a black-box memory allocator and force it to fail?
I know that this can probably be done by allocating and freeing memory in a checkerboard pattern to force massive fragmentation, so in my mind an ideal solution would cause a failure to occur with the fewest total bytes allocated at the time of failure. With respect to the original post that inspired this, it could in theory be possible to cause a failure with zero bytes allocated if the memory allocator has an internal bug.
Any ideas/thoughts on how to do this?

Depends what you mean by "sufficient memory available". For a simple fragmentation "attack":
Make a squillion small allocations until one fails[*].
Now, sort them in order of address[**].
Free 100 alternate allocations.
Attempt to allocate 100*small bytes.
Chances are the allocator will fail to find contiguous memory to satisfy that. If it has a small page size, and plenty of virtual address space compared with physical memory, then it might be able to rearrange things to do it - but that requires capabilities of the MMU on top of any anti-fragmentation strategy by the allocator.
If by "sufficient available memory" you mean a large block of memory that formerly was a contiguous block, has been split up into several allocations all of which have since been freed, and now the allocator treats it as separate blocks and so fails to allocate large bytes then no, I don't think you can force an arbitrary block-box allocator to fail to coalesce blocks. Some allocator or other might do much more work than Windows appears to be doing in that other question, to guarantee that adjacent free blocks are always coalesced.
[*] possible problem - over-committing memory allocators might not fail, you just get a segfault or your process is killed. On such systems you might need to track how much memory is available.
[**] possible problem - in C and C++, operator< isn't guaranteed to work. But on almost all systems it does, and in C++ there's std::less too.

Alternative for Garbage Collector

I'd like to know the best alternative for a garbage collector, with its pros and cons. My priority is speed, memory is less important. If there is garbage collector which doesn't make any pause, let me know.
I'm working on a safe language (i.e. a language with no dangling pointers, checking bounds, etc), and garbage collection or its alternative has to be used.

I suspect you will be best sticking with garbage collection (as per the JVM) unless you have a very good reason otherwise. Modern GCs are extremely fast, general purpose and safe. Unless you can design your language to take advantage of a very specific special case (as in one of the above allocators) then you are unlikely to beat the JVM.
The only really compelling reason I see nowadays as an argument against modern GC is latency issues caused by GC pauses. These are small, rare and not really an issue for most purposes (e.g. I've successfully written 3D engines in Java), but they still can cause problems in very tight realtime situations.
Having said that, there may still be some special cases where a different memory allocation scheme may make sense so I've listed a few interesting options below:
An example of a very fast, specialised memory management approach is the "per frame" allocator used in many games. This works by incrementing a single pointer to allocate memory, and at the end of a time period (typically a visual "frame") all objects are discarded at once by simply setting the pointer back to the base address and overwriting them in the next allocation. This can be "safe", however the constraints of object lifetime would be very strict. Might be a winner if you can guarantee that all memory allocation is bounded in size and only valid for the scope of handling e.g. a single server request.
Another very fast approach is to have dedicated object pools for different classes of object. Released objects can just be recycled in the pool, using something like a linked list of free object slots. Operating systems often used this kind of approach for common data structures. Again however you need to watch object lifetime and explicitly handle disposals by returning objects to the pool.
Reference counting looks superficially good but usually doesn't make sense because you frequently have to dereference and update the count on two objects whenever you change a pointer value. This cost is usually worse than the advantage of having simple and fast memory management, and it also doesn't work in the presence of cyclic references.
Stack allocation is extremely fast and can run safely. Depending on your language, it is possible to make do without a heap and run entirely on a stack based system. However I suspect this will somewhat constrain your language design so that might be a non-starter. Still might be worth considering for certain DSLs.
Classic malloc/free is pretty fast and can be made safe if you have sufficient constraints on object creation and lifetime which you may be able to enforce in your language. An example would be if e.g. you placed significant constraints on the use of pointers.
Anyway - hope this is useful food for thought!

If speed matters but memory does not, then the fastest and simplest allocation strategy is to never free. Allocation is simply a matter of bumping a pointer up. You cannot get faster than that.
Of course, never releasing anything has a huge potential for overflowing available memory. It is very rare that memory is truly "unimportant". Usually there is a large but finite amount of available memory. One strategy is called "region based allocation". Namely you allocate memory in a few big blocks called "regions", with the pointer-bumping strategy. Release occurs only by whole regions. This strategy can be applied with some success if the problem at hand can be structured into successive "tasks", each having its own region.
For more generic solutions, if you want real-time allocation (i.e. guaranteed limits on the response time from allocation requests) then garbage collection is the way to go. A real-time GC may look like this: objects are allocated with a pointer-bumping strategy. Also, on every allocation, the allocator performs a little bit of garbage collection, in which "live" objects are copied somewhere else. In a way the GC runs "at the same time" than the application. This implies a bit of extra work for accessing objects, because you cannot move an object and update all pointers to point to the new object location while keeping the "real-time" promise. Solutions may imply barriers, e.g. an extra indirection. Generational GC allow for barrier-free access to most objects while keeping pause times under strict bounds.
This article is a must-read for whoever wants to study memory allocation, in particular garbage collection.

With C++ it's possible to make a heap allocation ONCE for your objects, then reuse that memory for subsequent objects, I've seen it work and it was blindingly fast.
It's only applicable to a certian set of problems, and it's difficult to do it right, but it is possible.
One of the joys of C++ is you have complete control over memory management, you can decide to use classic new/delete, or implement your own reference counting or Garbage Collection.
However - here be dragons - you really, really need to know what you're doing.

If memory doesn't matter, then what #Thomas says applies. Considering the gargantuan memory spaces of modern hardware, this may very well be a viable option -- it really depends on the process.
Manual memory management doesn't necessarily solve your problems directly, but it does give you complete control over WHEN memory events happen. Generic malloc, for example, is not an O(1) operation. It does all sorts of potentially horrible things in there, both within the heap managed by malloc itself as well as the operating system. For example, ya never know when "malloc(10)" may cause the VM to page something out, now your 10 bytes of RAM have an unknown disk I/O component -- oops! Even worse, that page out could be YOUR memory, which you'll need to immediately page back in! Now c = *p is a disk hit. YAY!
But if you are aware of these, then you can safely set up your code so that all of the time critical parts effectively do NO memory management, instead they work off of pre-allocated structures for the task.
With a GC system, you may have a similar option -- it depends on the collector. I don't think the Sun JVM, for example, has the ability to be "turned off" for short periods of time. But if you work with pre-allocated structures, and call all of your own code (or know exactly what's going on in the library routine you call), you probably have a good chance of not hitting the memory manager.
Because, the crux of the matter is that memory management is a lot of work. If you want to get rid of memory management, the write old school FORTRAN with ARRAYs and COMMON blocks (one of the reasons FORTRAN can be so fast). Of course, you can write "FORTRAN" in most any language.
With modern languages, modern GCs, etc., memory management has been pushed aside and become a "10%" problem. We are now pretty sloppy with creating garbage, copying memory, etc. etc., because the GCs et al make it easy for us to be sloppy. And for 90% of the programs, this is not an issue, so we don't worry about. Nowadays, it's a tuning issue, late in the process.
So, your best bet is set it all up at once, use it, then toss it all away. The "use it" part is where you will get consistent, reliable results (assuming enough memory on the system of course).

As an "alternative" to garbage collection, C++ specifically has smart pointers. boost::shared_ptr<> (or std::tr1::shared_ptr<>) works exactly like Python's reference counted garbage collection. In my eyes, shared_ptr IS garbage collection. (although you may need to do a few weak_ptr<> stuff to make sure that circular references don't happen)
I would argue that auto_ptr<> (or in C++0x, the unique_ptr<>...) is a viable alternative, with its own set of benefits and tradeoffs. Auto_ptr has a clunky syntax and can't be used in STL containers... but it gets the job done. During compile-time, you "move" the ownership of the pointer from variable to variable. If a variable owns the pointer when it goes out of scope, it will call its destructor and free the memory. Only one auto_ptr<> (or unique_ptr<>) is allowed to own the real pointer. (at least, if you use it correctly).
As another alternative, you can store everything on the stack and just pass references around to all the functions you need.
These alternatives don't really solve the general memory management problem that garbage collection solves. Nonetheless, they are efficient and well tested. An auto_ptr doesn't use any more space than the pointer did originally... and there is no overhead on dereferencing an auto_ptr. "Movement" (or assignment in Auto_ptr) has a tiny amount of overhead to keep track of the owner. I haven't done any benchmarks, but I'm pretty sure they're faster than garbage collection / shared_ptr.

If you truly want no pauses at all, disallow all memory allocation except for stack allocation, region-based buffers, and static allocation. Despite what you may have been told, malloc() can actually cause severe pauses if the free list becomes fragmented, and if you often find yourself building massive object graphs, naive manual free can and will lose to stop-and-copy; the only way to really avoid this is to amortize over preallocated pages, such as the stack or a bump-allocated pool that's freed all at once. I don't know how useful this is, but I know that the proprietary graphical programming language LabVIEW by default allocates a static region of memory for each subroutine-equivalent, requiring programmers to manually enable stack allocation; this is the kind of thing that's useful in a hard-real-time environment where you need absolute guarantees on memory usage.
If what you want is to make it easy to reason about pauses and give your developers control over allocation and placement, then there is already a language called Rust that has the same stated goals as your language; while not a completely safe language, it does have a safe subset, allowing you to create safe abstractions for raw bit-twiddling. It uses pointer type annotations to eliminate use-after-free bugs. It also doesn't have null pointers in safe code, because null pointers cost a billion dollars at least.
If bounded pauses are enough, though, there are a wide variety of algorithms that will work. If you really have a small working set compared to available memory, then I would recommend the MOS collector (aka the Train Algorithm), which collects incrementally and provably always makes progress toward freeing unreferenced objects.

It's a common fallacy that managed languages are not suitable for high performance low latency scenarios. Yes, with limited resources (such as an embedded platform) and sloppy programming you can shoot yourself in the foot just as spectacularly as with C++ (and that can be VERY VERY spectacular).
This problem has come whilst developing games in Java/C# and the solution was to utilise a memory pool and not let object die, hence not needing garbage collector to run when you don't expect it. This is really the same approach as with low latency unmanaged systems - TO TRY REALLY REALLY HARD NOT TO ALLOCATE MEMORY.
So, considering the fact that implementing such system in Java/C# is very similar to C++, the advantage of doing it the girly man way(managed), you have the "niceness" of other language features that free up your mental clock cycles to concentrate on important things.

Memory Allocation/Deallocation Bottleneck?

How much of a bottleneck is memory allocation/deallocation in typical real-world programs? Answers from any type of program where performance typically matters are welcome. Are decent implementations of malloc/free/garbage collection fast enough that it's only a bottleneck in a few corner cases, or would most performance-critical software benefit significantly from trying to keep the amount of memory allocations down or having a faster malloc/free/garbage collection implementation?
Note: I'm not talking about real-time stuff here. By performance-critical, I mean stuff where throughput matters, but latency doesn't necessarily.
Edit: Although I mention malloc, this question is not intended to be C/C++ specific.

It's significant, especially as fragmentation grows and the allocator has to hunt harder across larger heaps for the contiguous regions you request. Most performance-sensitive applications typically write their own fixed-size block allocators (eg, they ask the OS for memory 16MB at a time and then parcel it out in fixed blocks of 4kb, 16kb, etc) to avoid this issue.
In games I've seen calls to malloc()/free() consume as much as 15% of the CPU (in poorly written products), or with carefully written and optimized block allocators, as little as 5%. Given that a game has to have a consistent throughput of sixty hertz, having it stall for 500ms while a garbage collector runs occasionally isn't practical.

Nearly every high performance application now has to use threads to exploit parallel computation. This is where the real memory allocation speed killer comes in when writing C/C++ applications.
In a C or C++ application, malloc/new must take a lock on the global heap for every operation. Even without contention locks are far from free and should be avoided as much as possible.
Java and C# are better at this because threading was designed in from the start and the memory allocators work from per-thread pools. This can be done in C/C++ as well, but it isn't automatic.

First off, since you said malloc, I assume you're talking about C or C++.
Memory allocation and deallocation tend to be a significant bottleneck for real-world programs. A lot goes on "under the hood" when you allocate or deallocate memory, and all of it is system-specific; memory may actually be moved or defragmented, pages may be reorganized--there's no platform-independent way way to know what the impact will be. Some systems (like a lot of game consoles) also don't do memory defragmentation, so on those systems, you'll start to get out-of-memory errors as memory becomes fragmented.
A typical workaround is to allocate as much memory up front as possible, and hang on to it until your program exits. You can either use that memory to store big monolithic sets of data, or use a memory pool implementation to dole it out in chunks. Many C/C++ standard library implementations do a certain amount of memory pooling themselves for just this reason.
No two ways about it, though--if you have a time-sensitive C/C++ program, doing a lot of memory allocation/deallocation will kill performance.

In general the cost of memory allocation is probably dwarfed by lock contention, algorithmic complexity, or other performance issues in most applications. In general, I'd say this is probably not in the top-10 of performance issues I'd worry about.
Now, grabbing very large chunks of memory might be an issue. And grabbing but not properly getting rid of memory is something I'd worry about.
In Java and JVM-based languages, new'ing objects is now very, very, very fast.
Here's one decent article by a guy who knows his stuff with some references at the bottom to more related links:
http://www.ibm.com/developerworks/java/library/j-jtp09275.html

A Java VM will claim and release memory from the operating system pretty much indepdently of what the application code is doing. This allows it to grab and release memory in large chunks, which is hugely more efficient than doing it in tiny individual operations, as you get with manual memory management.
This article was written in 2005, and JVM-style memory management was already streets ahead. The situation has only improved since then.
Which language boasts faster raw
allocation performance, the Java
language, or C/C++? The answer may
surprise you -- allocation in modern
JVMs is far faster than the best
performing malloc implementations. The
common code path for new Object() in
HotSpot 1.4.2 and later is
approximately 10 machine instructions
(data provided by Sun; see Resources),
whereas the best performing malloc
implementations in C require on
average between 60 and 100
instructions per call (Detlefs, et.
al.; see Resources). And allocation
performance is not a trivial component
of overall performance -- benchmarks
show that many real-world C and C++
programs, such as Perl and
Ghostscript, spend 20 to 30 percent of
their total execution time in malloc
and free -- far more than the
allocation and garbage collection
overhead of a healthy Java
application.

In Java (and potentially other languages with a decent GC implementation) allocating an object is very cheap. In the SUN JVM it only needs 10 CPU Cycles. A malloc in C/c++ is much more expensive, just because it has to do more work.
Still even allocation objects in Java is very cheap, doing so for a lot of users of a web application in parallel can still lead to performance problems, because more Garbage Collector runs will be triggered.
Therefore there are those indirect costs of an allocation in Java caused by the deallocation done by the GC. These costs are difficult to quantify because they depend very much on your setup (how much memory do you have) and your application.

Allocating and releasing memory in terms of performance are relatively costly operations. The calls in modern operating systems have to go all the way down to the kernel so that the operating system is able to deal with virtual memory, paging/mapping, execution protection etc.
On the other side, almost all modern programming languages hide these operations behind "allocators" which work with pre-allocated buffers.
This concept is also used by most applications which have a focus on throughput.

I know I answered earlier, however, that was ananswer to the other answer's, not to your question.
To speak to you directly, if I understand correctly, your performance use case criteria is throughput.
This to me, means's that you should be looking almost exclusivly at NUMA aware allocators.
None of the earlier references; IBM JVM paper, Microquill C, SUN JVM. Cover this point so I am highly suspect of their application today, where, at least on the AMD ABI, NUMA is the pre-eminent memory-cpu governer.
Hands down; real world, fake world, whatever world... NUMA aware memory request/use technologies are faster. Unfortunately, I'm running Windows currently, and I have not found the "numastat" which is available in linux.
A friend of mine has written about this in depth in his implmentation for the FreeBSD kernel.
Dispite me being able to show at-hoc, the typically VERY large amount of local node memory requests on top of the remote node (underscoring the obvious performance throughput advantage), you can surly benchmark yourself, and that would likely be what you need todo as your performance charicterisitc is going to be highly specific.
I do know that in a lot of ways, at least earlier 5.x VMWARE faired rather poorly, at that time at least, for not taking advantage of NUMA, frequently demanding pages from the remote node. However, VM's are a very unique beast when it comes to memory compartmentailization or containerization.
One of the references I cited is to Microsoft's API implmentation for the AMD ABI, which has NUMA allocation specialized interfaces for user land application developers to exploit ;)
Here's a fairly recent analysis, visual and all, from some browser add-on developers who compare 4 different heap implmentations. Naturally the one they developed turns out on top (odd how the people who do the testing often exhibit the highest score's).
They do cover in some ways quantifiably, at least for their use case, what the exact trade off is between space/time, generally they had identified the LFH (oh ya and by the way LFH is simply a mode apparently of the standard heap) or similarly designed approach essentially consumes signifcantly more memory off the bat however over time, may wind up using less memory... the grafix are neat too...
I would think however that selecting a HEAP implmentation based on your typical workload after you well understand it ;) is a good idea, but to well understand your needs, first make sure your basic operations are correct before you optimize these odds and ends ;)

This is where c/c++'s memory allocation system works the best. The default allocation strategy is OK for most cases but it can be changed to suit whatever is needed. In GC systems there's not a lot you can do to change allocation strategies. Of course, there is a price to pay, and that's the need to track allocations and free them correctly. C++ takes this further and the allocation strategy can be specified per class using the new operator:
class AClass
{
public:
void *operator new (size_t size); // this will be called whenever there's a new AClass
void *operator new [] (size_t size); // this will be called whenever there's a new AClass []
void operator delete (void *memory); // if you define new, you really need to define delete as well
void operator delete [] (void *memory);define delete as well
};
Many of the STL templates allow you to define custom allocators as well.
As with all things to do with optimisation, you must first determine, through run time analysis, if memory allocation really is the bottleneck before writing your own allocators.

According to MicroQuill SmartHeap Technical Specification, "a typical application [...] spends 40% of its total execution time on managing memory". You can take this figure as an upper bound, i personally feel that a typical application spends more like 10-15% of execution time allocating/deallocating memory. It rarely is a bottleneck in single-threaded application.
In multithreaded C/C++ applications standard allocators become an issue due to lock contention. This is where you start to look for more scalable solutions. But keep in mind Amdahl's Law.

Pretty much all of you are off base if you are talking about the Microsoft heap. Syncronization is effortlessly handled as is fragmentation.
The current perferrred heap is the LFH, (LOW FRAGMENTATION HEAP), it is default in vista+ OS's and can be configured on XP, via gflag, with out much trouble
It is easy to avoid any locking/blocking/contention/bus-bandwitth issues and the lot with the
HEAP_NO_SERIALIZE
option during HeapAlloc or HeapCreate. This will allow you to create/use a heap without entering into an interlocked wait.
I would reccomend creating several heaps, with HeapCreate, and defining a macro, perhaps, mallocx(enum my_heaps_set, size_t);
would be fine, of course, you need realloc, free also to be setup as appropiate. If you want to get fancy, make free/realloc auto-detect which heap handle on it's own by evaluating the address of the pointer, or even adding some logic to allow malloc to identify which heap to use based on it's thread id, and building a heierarchy of per-thread heaps and shared global heap's/pools.
The Heap* api's are called internally by malloc/new.
Here's a nice article on some dynamic memory management issues, with some even nicer references. To instrument and analyze heap activity.

Others have covered C/C++ so I'll just add a little information on .NET.
In .NET heap allocation is generally really fast, as it it just a matter of just grabbing the memory in the generation zero part of the heap. Obviously this cannot go on forever, which is where garbage collection comes in. Garbage collection may affect the performance of your application significantly since user threads must be suspended during compaction of memory. The fewer full collects, the better.
There are various things you can do to affect the workload of the garbage collector in .NET. Generally if you have a lot of memory reference the garbage collector will have to do more work. E.g. by implementing a graph using an adjacency matrix instead of references between nodes the garbage collector will have to analyze fewer references.
Whether that is actually significant in your application or not depends on several factors and you should profile the application with actual data before turning to such optimizations.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio