What are other commonly used memory allocators except the slab allocator? - memory-management

Hi I recently learned that the slab allocator (using an embedded pointer for singly linked list) is able to allocate equal-sized memories to the customer without causing extra waste (for example, cookies in malloc).
What are other commonly used memory allocators in production? I mean, is slab allocator good enough to handle all types of workloads?
Thanks.

Related

How to deal with external fragmentation, how paging helps with external fragmentation?

I know that there is a lot of questions regarding the issue I'm pointing here, but I couldn't find any complex answer (neither on StackOverflow nor in other sources).
I would like to ask about heap (RAM) fragmentation problem.
As I understood there are two kind of fragmentation:
internal - related with difference between allocation unit size (AU) and the size of the allocated memory AM (waste memory is equal to AM % AU),
external - related with noncontinuous areas of a free memory, so even if the sum of the free memory areas can handle the new allocation request, it fails if there is no continues area that can handle it.
This is quite clear. The problems start when the "paging" appears.
Sometimes I can find an information that paging solves the external fragmentation issue.
Indeed I agree that thanks to paging the OS is able to create the virtually continues areas of the memory, assigned to the process, even if physically the parts of the memory are scattered.
But how exactly does it help with the external fragmentation?
I mean, assuming that the size of a page has 4kB, and we want to allocate 16 kB, then of course we just need to find four empty pages frames, even if physically the frames are not a part of a continues area.
But what in case of the smaller allocation ?
I believe the page itself can still be fragmented and (in worst case) the OS still needs to provide a new frame if the old one cannot be used to allocate the requested memory.
So is it that (assuming the worst case) sooner or later, with paging or without, the long working application that allocates and releases the heap memory (different sizes) will fall into low-memory condition, because of external fragmentation ?
So the question is how to deal with the external fragmentation?
Own implementation of allocation algorithm ? Paging (as I wrote, not sure it helps) ? What else ? Does OS (Windows, Linux) provides some defragmentation methods ?
The most radical solution is to forbid using of the heap, but is it really necessary for the platforms with paging, virtual address spaces, virtual memory etc ... and the only issue is that the applications need to run unstoppable for a years ?
One more issue.. is internal fragmentation an ambiguous term ?
Somewhere I have spotted the definition that internal fragmentation points to the part of page frame, that is wasted because the process does not need more memory, but the single frame cannot be owned by more than a single processes.
I have bolded the questions, so the people who are in hurry could find the question without reading everything.
Regards!
"Fragmentation" is indeed not a very precise term. But we can say for sure that when a running application needs a block of n bytes and there are n or more bytes not in use, yet we can't get the required block, then "memory is too fragmented."
But how exactly does it [paging] help with the external allocation [I assume you mean fragmentation] ?
There's really nothing complicated here. External fragmentation is free memory between allocated blocks that's "too small" to satisfy any application requirement. This is a general concept. The definition of "too small" is application-dependent. Nonetheless, if allocated blocks can fall on any boundary, then it's easy, after many allocations and deallocations, for lots of such fragments to occur. Paging helps with external fragmentation in two ways.
First, it subdivides memory into fixed-size adjacent chunks - the pages - that are "large enough" so they're never useless. Again the definition of "large enough" is not precise. But most applications will have lots of requirements satisfiable by a single 4k page. Since no external fragmentation problem can occur for allocations of a page or less, the problem has been mitigated.
Second, the paging hardware provides a level of indirection between application pages and physical memory pages. Therefore any free physical memory page can be used to help satisfy any application request, no matter how large. For example, suppose you have 100 physical pages with every other physical page (50 of them) allocated. Without page-mapping hardware, the biggest request for contiguous memory that can be satisfied is 1 page. With mapping, it's 50 pages. (I'm disregarding virtual pages allocated initially with no mapped physical page. That's another discussion.)
But what in case of the smaller allocation ?
Again it's pretty simple. If the unit of allocation is a page, then any allocation smaller than a page yields an unused portion. This is internal fragmentation: unusable memory within an allocated block. The bigger you make allocation units (they don't have to be a single page), the more memory will be unusable due to internal fragmentation. On average, this will tend toward half of an allocation unit. Consequently, though OS's tend to allocate in units of pages, most application-side memory allocators request a very small number (often one) of big blocks (of pages) from the OS. They use much smaller allocation units internally: 4-16 bytes is pretty common.
So the question is how to deal with the external allocation [I assume you mean fragmentation] ? So is it that (assuming the worst case) sooner or later, with paging or without, the long working application that allocates and releases the heap memory (different sizes) will fall into low-memory condition, because of external fragmentation ?
If I understand you correctly, you're asking if fragmentation is inevitable. Except under very special conditions (e.g. the application only needs blocks of one size), the answer is yes. But that doesn't mean it's necessarily a problem.
Memory allocators use smart algorithms that limit fragmentation pretty effectively. For example, they may maintain "pools" with different block sizes, using the pool with block size most closely matching a given request. This tends to limit both internal and external fragmentation. A real world example that's very well documented is dlmalloc. The source code is also very clear.
Of course any general purpose allocator can fail under specific conditions. For this reason, modern languages (C++ and Ada are two I know) let you supply special-purpose allocators for objects of a given type. Typically -
for a fixed-size object - these might simply maintain a pre-allcoated free list, so fragmentation for that particular case is zero, and allocation/deallocation are very fast.
One more note: It's possible to totally eliminate fragmentation with copying/compacting garbage collection. Of course this requires underlying language support, and there's a performance bill to pay. A copying garbage collector compacts the heap by moving objects to eliminate unused space completely whenever it runs to reclaim storage. To do this it must update every pointer in the running program to the corresponding object's new location. While this may sound complex, I've implemented a copying garbage collector, and it's not so bad. The algorithms are extremely cool. Unfortunately, the semantics of many languages (e.g. C and C++) don't allow finding every pointer in the running program, which is required.
The most radical solution is to forbid using of the heap, but is it really necessary for the platforms with paging, virtual address spaces, virtual memory etc ... and the only issue is that the applications need to run unstoppable for a years ?
Though general purpose allocators are good, they're not guaranteed. It's not unusual for safety-critical or hard real time constrained systems to avoid heap use completely. On the other hand, when no absolute guarantee is needed, a general purpose allocator is often fine. There are many systems that run perfectly with tough loads for extended periods using general purpose allocators: fragmentation reaches an acceptable steady state and doesn't cause a problem.
One more issue.. is internal fragmentation an ambiguous term ?
The term isn't ambiguous, but is used in different contexts. The invariant is that it's referring to unused memory inside allocated blocks.
OS literature tends to assume the allocation unit is pages. For example, Linux sbrk lets you request the end of the data segment be set anywhere, but Linux allocates pages, not bytes, so the unused part of the last page is internal fragmentation from the OS's point of view.
Application-oriented discussions tend to assume allocation is in "blocks" or "chunks" of arbitrary size. dlmalloc uses about 128 discrete chunk sizes, each maintained in its own free list. Plus, it will custom allocate very large blocks using OS memory mapping system calls, so there's at most a page size (minus 1 byte) of mismatch between request and actual allocation. Clearly it's going to a lot of trouble to minimize internal fragmentation. The fragmentation caused a given allocation is the difference between the request and the chunk actually allocated. Since there are so many chunk sizes, that difference is strictly limited. On the other hand, the many chunk sizes increase chances of external fragmentation problems: free memory may consist entirely of chunks that are well-managed by dlmalloc, yet too small to honor an application requirement.

Could someone help me understand VkPhysicalDeviceMemoryProperties?

I'm trying to figure it out, but I'm getting a little stuck.
The way the types and heaps are related is simple, if a bit strange. (why not just give VkMemoryHeap a VkMemoryType member?)
I think I understand what all the VkMemoryPropertyFlags mean, they seem fairly straightforward.
But what's with the VkMemoryHeap.flags member? It apparently only has one non-zero valid value, VkMemoryHeapFlagBits.VK_MEMORY_HEAP_DEVICE_LOCAL_BIT, and though that wouldn't be too odd on it's own, but there's also a VkMemoryPropertyFlagBits.VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT that could be present on the memory type of the heap.
What does the VkMemoryHeap.flags member mean and how does it relate to the VkMemoryType.flags member?
Vulkan recognizes two distinct concepts when it comes to memory. There are the actual physical pieces of RAM that the device can talk to. Then there are ways to allocate memory from one of those pools of RAM.
A heap represents a specific piece of RAM. VkMemoryHeap is the object that describes one of the available heaps of RAM that the device can talk to. There really aren't that many things that define a particular heap. Just two: the number of bytes of that RAMs storage and the storage's location relative to the Vulkan device (local vs. non-local).
A memory type is a particular means of allocating memory from a specific heap. VkMemoryType is the object that describes a particular way of allocating memory. And there are a lot more descriptive flags for how you can allocate memory from a heap.
For a more concrete example, consider a standard PC setup with a discrete GPU. The device has its own local RAM, but the discrete GPU can also access CPU memory. So a Vulkan device will have two heaps: one of them will be local, the other non-local.
However, there will usually be more than two memory types. You usually have one memory type that represents local memory, which does not have the VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT set. That means you can't map the memory; you can only access it via transfer operations from some other memory type (or from rendering operations or whatever).
But you will often have two memory types that both use the same non-local heap. They will both be VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, thus allowing mapping. However, one of them will likely have the VK_MEMORY_PROPERTY_HOST_CACHED_BIT flag set, while the other will be VK_MEMORY_PROPERTY_HOST_COHERENT_BIT. This allows you to choose whether you want cached CPU access (thus requiring an explicit flush of ranges of modified memory) or uncached CPU access.
But while they are two separate memory types, they both allocate from the same heap. Which is why VkMemoryType has an index that refers to the heap who's memory it is allocating from.
Only thing I'm not getting is how the two DEVICE_LOCAL flags interact.
Did you look at the specification? It's not exactly hiding how this works:
if propertyFlags has the VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT bit set, memory allocated with this type is the most efficient for device access. This property will only be set for memory types belonging to heaps with the VK_MEMORY_HEAP_DEVICE_LOCAL_BIT set.
Is it saying that if the memory is local then all types corresponding to that memory are local, or that they can be local?
You seem to be trying to impose the wrong meaning to these things. Just look at what the specification says and take it at face value.
PROPERTY_DEVICE_LOCAL denotes a memory type which will achieve the best device access performance. The only connection between this and MEMORY_DEVICE_LOCAL is that memory types with PROPERTY_DEVICE_LOCAL will only be associated with memory heaps that use MEMORY_DEVICE_LOCAL.
That's the only relevant meaning here.
If you want an example of when a memory heap would be device local, yet have memory types that aren't, consider a GPU that has no memory of its own. There's only one heap, which is therefore MEMORY_DEVICE_LOCAL.
However, allocating memory from that pool in a way that makes it host-visible may decrease the performance of device access to that memory. Therefore, for such hardware, the host-visible memory types for the same heap will not use PROPERTY_DEVICE_LOCAL.
Then again, other hardware doesn't lose performance from making memory host-visible. So they only have one memory type, which has all of the available properties. For Intel, their on-chip GPUs apparently have access to some level of the CPU's caches.

windows memory management: Difference between VirtualAlloc() and heap functions [duplicate]

There are lots of method to allocate memory in Windows environment, such as VirtualAlloc, HeapAlloc, malloc, new.
Thus, what's the difference among them?
Each API is for different uses. Each one also requires that you use the correct deallocation/freeing function when you're done with the memory.
VirtualAlloc
A low-level, Windows API that provides lots of options, but is mainly useful for people in fairly specific situations. Can only allocate memory in (edit: not 4KB) larger chunks. There are situations where you need it, but you'll know when you're in one of these situations. One of the most common is if you have to share memory directly with another process. Don't use it for general-purpose memory allocation. Use VirtualFree to deallocate.
HeapAlloc
Allocates whatever size of memory you ask for, not in big chunks than VirtualAlloc. HeapAlloc knows when it needs to call VirtualAlloc and does so for you automatically. Like malloc, but is Windows-only, and provides a couple more options. Suitable for allocating general chunks of memory. Some Windows APIs may require that you use this to allocate memory that you pass to them, or use its companion HeapFree to free memory that they return to you.
malloc
The C way of allocating memory. Prefer this if you are writing in C rather than C++, and you want your code to work on e.g. Unix computers too, or someone specifically says that you need to use it. Doesn't initialise the memory. Suitable for allocating general chunks of memory, like HeapAlloc. A simple API. Use free to deallocate. Visual C++'s malloc calls HeapAlloc.
new
The C++ way of allocating memory. Prefer this if you are writing in C++. It puts an object or objects into the allocated memory, too. Use delete to deallocate (or delete[] for arrays). Visual studio's new calls HeapAlloc, and then maybe initialises the objects, depending on how you call it.
In recent C++ standards (C++11 and above), if you have to manually use delete, you're doing it wrong and should use a smart pointer like unique_ptr instead. From C++14 onwards, the same can be said of new (replaced with functions such as make_unique()).
There are also a couple of other similar functions like SysAllocString that you may be told you have to use in specific circumstances.
It is very important to understand the distinction between memory allocation APIs (in Windows) if you plan on using a language that requires memory management (like C or C++.) And the best way to illustrate it IMHO is with a diagram:
Note that this is a very simplified, Windows-specific view.
The way to understand this diagram is that the higher on the diagram a memory allocation method is, the higher level implementation it uses. But let's start from the bottom.
Kernel-Mode Memory Manager
It provides all memory reservations & allocations for the operating system, as well as support for memory-mapped files, shared memory, copy-on-write operations, etc. It's not directly accessible from the user-mode code, so I'll skip it here.
VirtualAlloc / VirtualFree
These are the lowest level APIs available from the user mode. The VirtualAlloc function basically invokes ZwAllocateVirtualMemory that in turn does a quick syscall to ring0 to relegate further processing to the kernel memory manager. It is also the fastest method to reserve/allocate block of new memory from all available in the user mode.
But it comes with two main conditions:
It only allocates memory blocks aligned on the system granularity boundary.
It only allocates memory blocks of the size that is the multiple of the system granularity.
So what is this system granularity? You can get it by calling GetSystemInfo. It is returned as the dwAllocationGranularity parameter. Its value is implementation (and possibly hardware) specific, but on many 64-bit Windows systems it is set at 0x10000 bytes, or 64K.
So what all this means, is that if you try to allocate, say just an 8 byte memory block with VirtualAlloc:
void* pAddress = VirtualAlloc(NULL, 8, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
If successful, pAddress will be aligned on the 0x10000 byte boundary. And even though you requested only 8 bytes, the actual memory block that you will get will be the entire page (or, something like 4K bytes. The exact page size is returned in the dwPageSize parameter.) But, on top of that, the entire memory block spanning 0x10000 bytes (or 64K in most cases) from pAddress will not be available for any further allocations. So in a sense, by allocating 8 bytes you could as well be asking for 65536.
So the moral of the story here is not to substitute VirtualAlloc for generic memory allocations in your application. It must be used for very specific cases, as is done with the heap below. (Usually for reserving/allocating large blocks of memory.)
Using VirtualAlloc incorrectly can lead to severe memory fragmentation.
HeapCreate / HeapAlloc / HeapFree / HeapDestroy
In a nutshell, the heap functions are basically a wrapper for VirtualAlloc function. Other answers here provide a pretty good concept of it. I'll add that, in a very simplistic view, the way heap works is this:
HeapCreate reserves a large block of virtual memory by calling VirtualAlloc internally (or ZwAllocateVirtualMemory to be specific). It also sets up an internal data structure that can track further smaller size allocations within the reserved block of virtual memory.
Any calls to HeapAlloc and HeapFree do not actually allocate/free any new memory (unless, of course the request exceeds what has been already reserved in HeapCreate) but instead they meter out (or commit) a previously reserved large chunk, by dissecting it into smaller memory blocks that a user requests.
HeapDestroy in turn calls VirtualFree that actually frees the virtual memory.
So all this makes heap functions perfect candidates for generic memory allocations in your application. It is great for arbitrary size memory allocations. But a small price to pay for the convenience of the heap functions is that they introduce a slight overhead over VirtualAlloc when reserving larger blocks of memory.
Another good thing about heap is that you don't really need to create one. It is generally created for you when your process starts. So one can access it by calling GetProcessHeap function.
malloc / free
Is a language-specific wrapper for the heap functions. Unlike HeapAlloc, HeapFree, etc. these functions will work not only if your code is compiled for Windows, but also for other operating systems (such as Linux, etc.)
This is a recommended way to allocate/free memory if you program in C. (Unless, you're coding a specific kernel mode device driver.)
new / delete
Come as a high level (well, for C++) memory management operators. They are specific for the C++ language, and like malloc for C, are also the wrappers for the heap functions. They also have a whole bunch of their own code that deals C++-specific initialization of constructors, deallocation in destructors, raising an exception, etc.
These functions are a recommended way to allocate/free memory and objects if you program in C++.
Lastly, one comment I want to make about what has been said in other responses about using VirtualAlloc to share memory between processes. VirtualAlloc by itself does not allow sharing of its reserved/allocated memory with other processes. For that one needs to use CreateFileMapping API that can create a named virtual memory block that can be shared with other processes. It can also map a file on disk into virtual memory for read/write access. But that is another topic.
VirtualAlloc is a specialized allocation of the OS virtual memory (VM) system. Allocations in the VM system must be made at an allocation granularity which (the allocation granularity) is architecture dependent. Allocation in the VM system is one of the most basic forms of memory allocation. VM allocations can take several forms, memory is not necessarily dedicated or physically backed in RAM (though it can be). VM allocation is typically a special purpose type of allocation, either because of the allocation has to
be very large,
needs to be shared,
must be aligned on a particular value (performance reasons) or
the caller need not use all of this memory at once...
etc...
HeapAlloc is essentially what malloc and new both eventually call. It is designed to be very fast and usable under many different types of scenarios of a general purpose allocation. It is the "Heap" in a classic sense. Heaps are actually setup by a VirtualAlloc, which is what is used to initially reserve allocation space from the OS. After the space is initialized by VirtualAlloc, various tables, lists and other data structures are configured to maintain and control the operation of the HEAP. Some of that operation is in the form of dynamically sizing (growing and shrinking) the heap, adapting the heap to particular usages (frequent allocations of some size), etc..
new and malloc are somewhat the same, malloc is essentially an exact call into HeapAlloc( heap-id-default ); new however, can [additionally] configure the allocated memory for C++ objects. For a given object, C++ will store vtables on the heap for each caller. These vtables are redirects for execution and form part of what gives C++ it's OO characteristics like inheritance, function overloading, etc...
Some other common allocation methods like _alloca() and _malloca() are stack based; FileMappings are really allocated with VirtualAlloc and set with particular bit flags which designate those mappings to be of type FILE.
Most of the time, you should allocate memory in a way which is consistent with the use of that memory ;). new in C++, malloc for C, VirtualAlloc for massive or IPC cases.
*** Note, large memory allocations done by HeapAlloc are actually shipped off to VirtualAlloc after some size (couple hundred k or 16 MB or something I forget, but fairly big :) ).
*** EDIT
I briefly remarked about IPC and VirtualAlloc, there is also something very neat about a related VirtualAlloc which none of the responders to this question have discussed.
VirtualAllocEx is what one process can use to allocate memory in an address space of a different process. Most typically, this is used in combination to get remote execution in the context of another process via CreateRemoteThread (similar to CreateThread, the thread is just run in the other process).
In outline:
VirtualAlloc, HeapAlloc etc. are Windows APIs that allocate memory of various types from the OS directly. VirtualAlloc manages pages in the Windows virtual memory system, while HeapAlloc allocates from a specific OS heap. Frankly, you are unlikely to ever need to use eiither of them.
malloc is a Standard C (and C++) library function that allocates memory to your process. Implementations of malloc will typically use one of the OS APIs to create a pool of memory when your app starts and then allocate from it as you make malloc requests
new is a Standard C++ operator which allocates memory and then calls constructors appropriately on that memory. It may be implemented in terms of malloc or in terms of the OS APIs, in which case it too will typically create a memory pool on application startup.
VirtualAlloc ===> sbrk() under UNIX
HeapAlloc ====> malloc() under UNIX
VirtualAlloc => Allocates straight into virtual memory, you reserve/commit in blocks. This is great for large allocations, for example large arrays.
HeapAlloc / new => allocates the memory on the default heap (or any other heap that you may create). This allocates per object and is great for smaller objects. The default heap is serializable therefore it has guarantee thread allocation (this can cause some issues on high performance scenarios and that's why you can create your own heaps).
malloc => uses the C runtime heap, similar to HeapAlloc but it is common for compatibility scenarios.
In a nutshell, the heap is just a chunk of virtual memory that is governed by a heap manager (rather than raw virtual memory)
The last model on the memory world is memory mapped files, this scenario is great for large chunk of data (like large files). This is used internally when you open an EXE (it does not load the EXE in memory, just creates a memory mapped file).

Difference between kmalloc and kmem_cache_alloc

What is difference between kmem_cache_alloc and kmalloc() in kernel memory allocation? which one is used when?
Kmalloc - allocates contiguous region from the physical memory. But keep in mind, allocating and free'ing memory is a lot of work.
Kmem_cache_alloc - Here, your process keeps some copies of the some pre-defined size objects pre-allocated. Say you have struct that you know you will be requiring very frequently, so instead of allocating it from the main memory (kmalloc) when you need it, you already keep multiple copies of it allocated & when you want it, it returns the address of the block already allocated (saves a lot of time). Similarly, when you free it, you don't give it back, it actually isn't free'd, it goes back to the allocated pool so that if some process again asks for it, you can return this address of the already allocated struct.
kmalloc: It uses the generic slab caches available to any kernel code. so your module will share slab cache with other components in kernel.
kmem_cache_alloc: It will allocate objects from a dedicated slab cache created by kmem_cache_create. If you specifically want a better slab cache management dedicated to your module only, that too for a specific type of objects, use kmem_cache_create followed by kmem_cache_alloc. USB/SCSI drivers use this. kmem_cache_create takes sizeof your object you want to create slab of, a name which appears in /proc/slabinfo and flags to govern behavior of your slab cache.
Ref: https://www.mail-archive.com/kernelnewbies#nl.linux.org/msg13191.html & LDD

What data structure is used to implement the dynamic memory allocation heap?

I always assumed a heap (data structure) is used to implement a heap (dynamic memory allocation), but I've been told I'm wrong.
How are heaps (for example, the one implemented by typical malloc routines, or by Windows's HeapCreate, etc.) implemented, typically? What data structures do they use?
What I'm not asking:
While searching online, I've seen tons of descriptions of how to implement heaps with severe restrictions.
To name a few, I've seen lots of descriptions of how to implement:
Heaps that never release memory back to the OS (!)
Heaps that only give reasonable performance on small, similarly-sized blocks
Heaps that only give reasonable performance for large, contiguous blocks
etc.
And it's funny, they all avoid the harder question:
How are "normal", general-purpose heaps (like the one behind malloc, HeapCreate) implemented?
What data structures (and perhaps algorithms) do they use?
Allocators tend to be quite complex and often differ significantly in how they're implemented.
You can't really describe them in terms of one common data structure or algorithm, but there are some common themes:
Memory is taken from the system in large chunks -- often megabytes at a time.
These chunks are then split up into various smaller chunks as you perform allocations. Not exactly the same size as you allocate, but usually in certain ranges (200-250 bytes, 251-500 bytes, etc.). Sometimes this is multi-tiered, where you'd have an additional layer of "medium chunks" which come before your actual requests.
Controlling which "large chunk" to break a piece off of is a very difficult and important thing to do -- this greatly affects memory fragmentation.
One or more free pools (aka "free list", "memory pool", "lookaside list") are maintained for each of these ranges. Sometimes even thread-local pools. This can greatly speed up a pattern of allocating/deallocating many objects of similar size.
Large allocations are treated a bit differently so as to not waste a lot of RAM and not be pooled quite so much if at all.
If you wanted to check out some source code, jemalloc is a modern high-performance allocator and should be representative in complexity of other common ones. TCMalloc is another common general-purpose allocator, and their website goes into all the gory implementation details. Intel's Thread Building Blocks has an allocator built specifically for high concurrency.
One interesting difference can be seen between Windows and *nix. In *nix, the allocator has very low-level control over the address space an app uses. In Windows, you basically have a course-grained, slow allocator VirtualAlloc to base your own allocator off of.
This results in *nix-compatible allocators typically directly giving you an malloc/free implementation where it's assumed you'll only use one allocator for everything (otherwise they'd trample each-other), while Windows-specific allocators provide additional functions, leaving malloc/free alone, and can be used in harmony (for instance, you can use HeapCreate to make private heaps which can work alongside others).
In practice, this trade in flexibility gives *nix allocators a small leg up performance-wise. It's very rare to see an app intentionally use multiple heaps on Windows -- mostly it's by accident due to different DLLs using different runtimes which each have their own malloc/free, and can cause a lot of headaches if you're not diligent in tracking which heap some memory came from.
Note: The following answer assumes you're using a typical, modern system with virtual memory. The C and C++ standards do not require virtual memory; therefore of course you can't rely on such assumptions on hardware without this feature (e.g. GPUs typically don't have this feature; nor do extremely small hardware like the PIC).
This depends on the platform you're using. Heaps can be very complicated beasts; they don't use only a single data structure; and there is no "standard" data structure. Even where the heap code is located is different depending on the platform. For instance, the heap code is typically provided by the C Runtime on Unix boxes; but is typically provided by the operating system on Windows.
Yes, this is common on Unix machines; due to the way *nix's underlying APIs and memory model operate. Basically, the standard API to return memory to the operating system on these systems only allows returning memory on the "frontier" between where user memory is allocated and the "hole" in between user memory and system facilities like the stack. (The API in question is brk or sbrk). Instead of returning memory to the operating system, many heaps only try to reuse memory no longer in use by the program proper, and don't try to return memory to the system. This is less common on Windows, because its equivalent to sbrk (VirtualAlloc) doesn't have this limitation. (But like sbrk, it is very expensive and has caveats like only allocating page-sized and page-aligned chunks. So heaps try to call either as rarely as possible)
This sounds like a "block allocator", which divides the memory into fixed size chunks, and then just return one of the free chunks. To my (albeit limited) understanding, Windows' RtlHeap maintains a number of data structures like this for different known block sizes. (E.g. it'll have one for blocks of size 16, for instance) RtlHeap calls these "lookaside lists".
I don't really know of a specific structure that handles this case well. Large blocks are problematic for most allocation systems because they cause fragmentation of the address space.
The best reference I've found discussing the common allocation strategies employed on major platforms is the book Secure Coding in C and C++, by Robert Seacord. All of chapter 4 is dedicated to heap data structures (and problems caused when users use said heap systems incorrectly).

Resources