Memory allocation under uCOS-III

Memory allocation under uCOS-III - memory-management

I'm developing a C-library to be used under uCOS-III. The CPU is an ARM Cortex M4 SAM4C. Within the library I want to use a third party product X, whose particular name is not relevant here. The source code for X is completely available and compiles without problems.
Inside X a lot of memory allocations are executed, using calloc() and free().
The problem is, that plain usage of malloc is not advisable for embedded systems, because of memory fragmentation. The documentation for uCOS-III explicitly advises against using malloc - instead OSMemCreate/OSMemGet/OSMemPut shall be used to allocate and free chunks of memory out of a statically allocated memory block.
Question-1:
What is the general advice to get around the "standard implementation" of malloc? I would prefer a kind of malloc, where I have access to a fixed memory pool (e.g. dedicated for a special task)
Question-2:
How should OSMemCreate() be used correctly? I have first to initialize a memory partition with a certain block size. The amount of requested memory is between 4 Bytes and about 800 bytes. I can get blocks on request, but with fixed size. If block-size=4 I cannot allocate 16 Bytes, since blocks are not contiguous in memory. If block-size=800 and I need only 4 bytes, most of the block is left unused and I will very soon run out of blocks.
So I don't know, how to solve my original problem by use of OSMemCreate...
Can anybody give me an advice how I could proceed?
Many thanks,
Michael

1) Don't link with the standard library version of malloc/free. Instead create your own implementation of malloc/free that serves as a wrapper to OSMemGet/OSMemPut.
2) You can create more than one memory partition with OSMemCreate. Create small, medium, and large partitions that hold block sizes which are tuned for your application to reduce waste.
If you want malloc to get an appropriately sized block from your various memory partitions then you'll have to invent some magic so that free returns the block to the appropriate memory partition. (Perhaps malloc allocates an extra word, stores the pointer to the memory partition in the first word, and then returns the address after the word where the pointer is stored. Then free knows to get the memory partition pointer from the preceding word.)
An alternative to using malloc/free is to rewrite that code to use statically allocated variables or call OSMemGet/OSMemPut directly.

Related

windows memory management: Difference between VirtualAlloc() and heap functions [duplicate]

There are lots of method to allocate memory in Windows environment, such as VirtualAlloc, HeapAlloc, malloc, new.
Thus, what's the difference among them?

Each API is for different uses. Each one also requires that you use the correct deallocation/freeing function when you're done with the memory.
VirtualAlloc
A low-level, Windows API that provides lots of options, but is mainly useful for people in fairly specific situations. Can only allocate memory in (edit: not 4KB) larger chunks. There are situations where you need it, but you'll know when you're in one of these situations. One of the most common is if you have to share memory directly with another process. Don't use it for general-purpose memory allocation. Use VirtualFree to deallocate.
HeapAlloc
Allocates whatever size of memory you ask for, not in big chunks than VirtualAlloc. HeapAlloc knows when it needs to call VirtualAlloc and does so for you automatically. Like malloc, but is Windows-only, and provides a couple more options. Suitable for allocating general chunks of memory. Some Windows APIs may require that you use this to allocate memory that you pass to them, or use its companion HeapFree to free memory that they return to you.
malloc
The C way of allocating memory. Prefer this if you are writing in C rather than C++, and you want your code to work on e.g. Unix computers too, or someone specifically says that you need to use it. Doesn't initialise the memory. Suitable for allocating general chunks of memory, like HeapAlloc. A simple API. Use free to deallocate. Visual C++'s malloc calls HeapAlloc.
new
The C++ way of allocating memory. Prefer this if you are writing in C++. It puts an object or objects into the allocated memory, too. Use delete to deallocate (or delete[] for arrays). Visual studio's new calls HeapAlloc, and then maybe initialises the objects, depending on how you call it.
In recent C++ standards (C++11 and above), if you have to manually use delete, you're doing it wrong and should use a smart pointer like unique_ptr instead. From C++14 onwards, the same can be said of new (replaced with functions such as make_unique()).
There are also a couple of other similar functions like SysAllocString that you may be told you have to use in specific circumstances.

It is very important to understand the distinction between memory allocation APIs (in Windows) if you plan on using a language that requires memory management (like C or C++.) And the best way to illustrate it IMHO is with a diagram:
Note that this is a very simplified, Windows-specific view.
The way to understand this diagram is that the higher on the diagram a memory allocation method is, the higher level implementation it uses. But let's start from the bottom.
Kernel-Mode Memory Manager
It provides all memory reservations & allocations for the operating system, as well as support for memory-mapped files, shared memory, copy-on-write operations, etc. It's not directly accessible from the user-mode code, so I'll skip it here.
VirtualAlloc / VirtualFree
These are the lowest level APIs available from the user mode. The VirtualAlloc function basically invokes ZwAllocateVirtualMemory that in turn does a quick syscall to ring0 to relegate further processing to the kernel memory manager. It is also the fastest method to reserve/allocate block of new memory from all available in the user mode.
But it comes with two main conditions:
It only allocates memory blocks aligned on the system granularity boundary.
It only allocates memory blocks of the size that is the multiple of the system granularity.
So what is this system granularity? You can get it by calling GetSystemInfo. It is returned as the dwAllocationGranularity parameter. Its value is implementation (and possibly hardware) specific, but on many 64-bit Windows systems it is set at 0x10000 bytes, or 64K.
So what all this means, is that if you try to allocate, say just an 8 byte memory block with VirtualAlloc:
void* pAddress = VirtualAlloc(NULL, 8, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
If successful, pAddress will be aligned on the 0x10000 byte boundary. And even though you requested only 8 bytes, the actual memory block that you will get will be the entire page (or, something like 4K bytes. The exact page size is returned in the dwPageSize parameter.) But, on top of that, the entire memory block spanning 0x10000 bytes (or 64K in most cases) from pAddress will not be available for any further allocations. So in a sense, by allocating 8 bytes you could as well be asking for 65536.
So the moral of the story here is not to substitute VirtualAlloc for generic memory allocations in your application. It must be used for very specific cases, as is done with the heap below. (Usually for reserving/allocating large blocks of memory.)
Using VirtualAlloc incorrectly can lead to severe memory fragmentation.
HeapCreate / HeapAlloc / HeapFree / HeapDestroy
In a nutshell, the heap functions are basically a wrapper for VirtualAlloc function. Other answers here provide a pretty good concept of it. I'll add that, in a very simplistic view, the way heap works is this:
HeapCreate reserves a large block of virtual memory by calling VirtualAlloc internally (or ZwAllocateVirtualMemory to be specific). It also sets up an internal data structure that can track further smaller size allocations within the reserved block of virtual memory.
Any calls to HeapAlloc and HeapFree do not actually allocate/free any new memory (unless, of course the request exceeds what has been already reserved in HeapCreate) but instead they meter out (or commit) a previously reserved large chunk, by dissecting it into smaller memory blocks that a user requests.
HeapDestroy in turn calls VirtualFree that actually frees the virtual memory.
So all this makes heap functions perfect candidates for generic memory allocations in your application. It is great for arbitrary size memory allocations. But a small price to pay for the convenience of the heap functions is that they introduce a slight overhead over VirtualAlloc when reserving larger blocks of memory.
Another good thing about heap is that you don't really need to create one. It is generally created for you when your process starts. So one can access it by calling GetProcessHeap function.
malloc / free
Is a language-specific wrapper for the heap functions. Unlike HeapAlloc, HeapFree, etc. these functions will work not only if your code is compiled for Windows, but also for other operating systems (such as Linux, etc.)
This is a recommended way to allocate/free memory if you program in C. (Unless, you're coding a specific kernel mode device driver.)
new / delete
Come as a high level (well, for C++) memory management operators. They are specific for the C++ language, and like malloc for C, are also the wrappers for the heap functions. They also have a whole bunch of their own code that deals C++-specific initialization of constructors, deallocation in destructors, raising an exception, etc.
These functions are a recommended way to allocate/free memory and objects if you program in C++.
Lastly, one comment I want to make about what has been said in other responses about using VirtualAlloc to share memory between processes. VirtualAlloc by itself does not allow sharing of its reserved/allocated memory with other processes. For that one needs to use CreateFileMapping API that can create a named virtual memory block that can be shared with other processes. It can also map a file on disk into virtual memory for read/write access. But that is another topic.

VirtualAlloc is a specialized allocation of the OS virtual memory (VM) system. Allocations in the VM system must be made at an allocation granularity which (the allocation granularity) is architecture dependent. Allocation in the VM system is one of the most basic forms of memory allocation. VM allocations can take several forms, memory is not necessarily dedicated or physically backed in RAM (though it can be). VM allocation is typically a special purpose type of allocation, either because of the allocation has to
be very large,
needs to be shared,
must be aligned on a particular value (performance reasons) or
the caller need not use all of this memory at once...
etc...
HeapAlloc is essentially what malloc and new both eventually call. It is designed to be very fast and usable under many different types of scenarios of a general purpose allocation. It is the "Heap" in a classic sense. Heaps are actually setup by a VirtualAlloc, which is what is used to initially reserve allocation space from the OS. After the space is initialized by VirtualAlloc, various tables, lists and other data structures are configured to maintain and control the operation of the HEAP. Some of that operation is in the form of dynamically sizing (growing and shrinking) the heap, adapting the heap to particular usages (frequent allocations of some size), etc..
new and malloc are somewhat the same, malloc is essentially an exact call into HeapAlloc( heap-id-default ); new however, can [additionally] configure the allocated memory for C++ objects. For a given object, C++ will store vtables on the heap for each caller. These vtables are redirects for execution and form part of what gives C++ it's OO characteristics like inheritance, function overloading, etc...
Some other common allocation methods like _alloca() and _malloca() are stack based; FileMappings are really allocated with VirtualAlloc and set with particular bit flags which designate those mappings to be of type FILE.
Most of the time, you should allocate memory in a way which is consistent with the use of that memory ;). new in C++, malloc for C, VirtualAlloc for massive or IPC cases.
*** Note, large memory allocations done by HeapAlloc are actually shipped off to VirtualAlloc after some size (couple hundred k or 16 MB or something I forget, but fairly big :) ).
*** EDIT
I briefly remarked about IPC and VirtualAlloc, there is also something very neat about a related VirtualAlloc which none of the responders to this question have discussed.
VirtualAllocEx is what one process can use to allocate memory in an address space of a different process. Most typically, this is used in combination to get remote execution in the context of another process via CreateRemoteThread (similar to CreateThread, the thread is just run in the other process).

In outline:
VirtualAlloc, HeapAlloc etc. are Windows APIs that allocate memory of various types from the OS directly. VirtualAlloc manages pages in the Windows virtual memory system, while HeapAlloc allocates from a specific OS heap. Frankly, you are unlikely to ever need to use eiither of them.
malloc is a Standard C (and C++) library function that allocates memory to your process. Implementations of malloc will typically use one of the OS APIs to create a pool of memory when your app starts and then allocate from it as you make malloc requests
new is a Standard C++ operator which allocates memory and then calls constructors appropriately on that memory. It may be implemented in terms of malloc or in terms of the OS APIs, in which case it too will typically create a memory pool on application startup.

VirtualAlloc ===> sbrk() under UNIX
HeapAlloc ====> malloc() under UNIX

VirtualAlloc => Allocates straight into virtual memory, you reserve/commit in blocks. This is great for large allocations, for example large arrays.
HeapAlloc / new => allocates the memory on the default heap (or any other heap that you may create). This allocates per object and is great for smaller objects. The default heap is serializable therefore it has guarantee thread allocation (this can cause some issues on high performance scenarios and that's why you can create your own heaps).
malloc => uses the C runtime heap, similar to HeapAlloc but it is common for compatibility scenarios.
In a nutshell, the heap is just a chunk of virtual memory that is governed by a heap manager (rather than raw virtual memory)
The last model on the memory world is memory mapped files, this scenario is great for large chunk of data (like large files). This is used internally when you open an EXE (it does not load the EXE in memory, just creates a memory mapped file).

Java Card memory leak in for loop?

I know that Java Card VM's doesn't have have a garbage collector, but what happens with a for loop:
for(short x=0;x<10;x++)
{}
Does the x variable get utilized after the for loop, or it turns into garbage?
Just in case I have a transient byte array called index from size of 2 (instead of i in for loop) and I use the array in for loops:
for(index[0]=0;index[0]<10;index[0]++)
{}
But it is a little slower than the first version. If I use a normal variable for the index in a for loop then it gets really slow.
So, what happens with the x variable in the first for loop? Is it safe to use for loops like that, or not?

The x variable does not really exist in byte code. There are operations on a location in the Java stack that represents x (be it Java byte code or the code after conversion by the Java Card converter). Now the Java stack is a virtual stack. This stack is implemented on a CPU that has registers and a non-virtual stack. In general, if there are enough registers available, then the variable x is simply put in a register until it is out of scope. The register may of course be reused. The CPU stack itself is a LIFO (last in first out) queue in transient memory (RAM). The stack continuously grows and shrinks during the execution of the byte code that makes up your Applet. Like registers, the stack memory is reused over and over again. All the local variables (those defined inside code blocks as well as method arguments) are treated this way.
If you put your variable in a transient array then you are putting the variable on a RAM based heap. The Java Card RAM heap will never go out of scope. That means that if you update the value that the change needs to be written to transient memory. That is of course slower than a localized update of a CPU register, as you found by experimentation. Usually the memory in the transient memory is never freed. That said, you can of course reuse the memory for other purposes, as long as you have a reference to the array. Note that the references themselves (the index in index[0]) may be either in persistent memory (EEPROM or flash) or transient memory.
It's unclear what you mean with "normal variable". If this is something that has been created with new or if it is a field in an object instance then it persists in the heap in persistent memory (EEPROM or flash). EEPROM and flash have limited amount of write cycles and writing to EEPROM or flash is much much slower than writing to RAM.
Java Card contains two kinds of transient memory: CLEAR_ON_RESET and CLEAR_ON_DESELECT. The difference between the two is that the CLEAR_ON_RESET allows memory to be shared between Applet instances while the CLEAR_ON_DESELECT allows memory to be reused by different Applets.
Java Card classic doesn't contain a garbage collector that runs during Applet execution, you can usually only request garbage collection during startup using JCSystem.requestObjectDeletion() which will clean up the memory that is not referenced anymore, both on the heap in transient memory as well as in persistent memory. Cleaning up the memory means scanning all the memory, marking all the unreferenced blocks and then compacting the memory. This is similar to defragmenting a hard disk; it can take a uncomfortably long time.
ROM is filled during the manufacturing phase. It may contain the operating system, the Java Card API implementation, byte code (including constants) of pre-loaded applets etc. It only be read in the field, so it isn't of any consequence to the question asked.

Let us have a short introduction about the Memory. In brief, there is 3 types of memories in the Smart cards as below:
ROM (And sometimes FLASH)
EEPROM
RAM
ROM:
The card's OS and Java Card APIs and some factory proprietary packages stored here. The contents of this memory is fixed and you can't modify it. Writing in this memory happens only once in the chip production and the process is named Masking.
EEPROM:
This is modifiable memory that your applets load into and it is consist of 4 sections named as below:
Text : also known as code segment contains the machine instructions of the program. The code can be thought of like the text of a novel: It tells the story of what the program does
Data : contains the static data of the program, i.e. the variables that exist throughout program execution. Global variables in a C or C++ program are static, as are variables declared as static in C, C++, or Java.
Heap : is a pool of memory used for dynamically allocated memory, such as with malloc() in C or new in C++ and Java.
Stack : contains the system stack, which is used as temporary storage.
A power-less (Card tearing for example) doesn't have any effect on the contents of this memory.
RAM:
This is a modifiable type of memory also. There is three main difference between RAM and EEPROM:
RAM is really faster than EEPROM. (1000 times faster)
Contents of RAM will destroyed in the power-loss.
The number of writing in EEPROM is limited (Typically 100.000 times.) while RAM has a really higher number.
And What now?
When you write for(short x=0; x<10; x++), you define x as a local variable. Local variables stores in Stack. The stack pointer will reset on the power loss and the used stack part will reclaim. So the main problem of the absence of Garbage Collector is about Heap.
i.e when you define a local variable using new keyword, you specify that part of Heap to a local variable for ever. When the Runtime Environment finished that method, the object will destroy and become unavailable, while that section of Heap doesn't reclaimed. So you will lose that part of Heap. The case that you used for yourfor loop, seems OK and doesn't make any problem because you didn't use new keyword.
Note that in newer versions of Java Card (2.2.2 and higher) there is a manual Garbage Collector (look JCSystem.requestObjectDeletion documentation). But consider that it is really slow and even dangerous in some situations(Look Java Card power less during garbage collection question).

Linux page allocation

In linux if malloc can't allocate data chunk from preallocated page. it uses mmap() or brk() to assign new pages. I wanted to clarify a few things :
I don't understand the following statment , I thought that when I use mmap() or brk() the kernel maps me a whole page. but the allocator alocates me only what I asked for from that new page? In this case trying to access unallocated space (In the newly mapped page) will not cause a page fault? (I understand that its not recommended )
The libc allocator manages each page: slicing them into smaller blocks, assigning them to processes, freeing them, and so on. For example, if your program uses 4097 bytes total, you need to use two pages, even though in reality the allocator gives you somewhere between 4105 to 4109 bytes
How does the allocator knows the VMA bounders ?(I assume no system call used) because the VMA struct that hold that information can only get accessed from kernel mode?

The system-level memory allocation (via mmap and brk) is all page-aligned and page-sized. This means if you use malloc (or other libc API that allocates memory) some small amount of memory (say 10 bytes), you are guaranteed that all the other bytes on that page of memory are readable without triggering a page fault.
Malloc and family do their own bookkeeping within the pages returned from the OS, so the mmap'd pages used by libc also contain a bunch of malloc metadata, in addition to whatever space you've allocated.
The libc allocator knows where everything is because it invokes the brk() and mmap() calls. If it invokes mmap() it passes in a size, and the kernel returns a start address. Then the libc allocator just stores those values in its metadata.
Doug Lea's malloc implementation is a very, very well documented memory allocator implementation and its comments will shed a lot of light on how allocators work in general:
http://g.oswego.edu/dl/html/malloc.html

Howto create 100M Byte Buffer

I am testing the throughput of an interface on Linux. I am using the DMA todo the data transfer. DMA needs contiguous memory location. But the kmalloc is unable to allocate more then 1MB. Is there any other way to create big buffer location upto 100M Bytes?

I thought kmalloc couldn't allocate over 128kB, how did you get it to allocate 1MB ?
Anyway, assuming you're working on a freshly booted system, you can reserve up to 2048 contiguous pages. Pages are generally 4k, so this makes 8MB.
_get_free_pages(_GFP_DMA, get_order(2048));
However, if you need more memory, you should do the allocation at boot-time.
If you are writing a driver, this can be achieved with the alloc_bootmem_* functions.
If you are writing a module, you have to pass mem= argument to your kernel and later use ioremap.
For example, if you have 2GB, you can pass mem=1GB to forbid the kernel from using the upper 1GB, and later call ioremap(0x40000000, 0x40000000) to get access to the upper 1GB, just for you.
But you know, you should just split your huge buffer into many small ones, it'll be much easier and much more like real-life applications.

Does calling free or delete ever release memory back to the "system"

Here's my question: Does calling free or delete ever release memory back to the "system". By system I mean, does it ever reduce the data segment of the process?
Let's consider the memory allocator on Linux, i.e ptmalloc.
From what I know (please correct me if I am wrong), ptmalloc maintains a free list of memory blocks and when a request for memory allocation comes, it tries to allocate a memory block from this free list (I know, the allocator is much more complex than that but I am just putting it in simple words). If, however, it fails, it gets the memory from the system using say sbrk or brk system calls. When a memory is free'd, that block is placed in the free list.
Now consider this scenario, on peak load, a lot of objects have been allocated on heap. Now when the load decreases, the objects are free'd. So my question is: Once the object is free'd will the allocator do some calculations to find whether it should just keep this object in the free list or depending upon the current size of the free list it may decide to give that memory back to the system i.e decrease the data segment of the process using sbrk or brk?
Documentation of glibc tells me that if the allocation request is much larger than page size, it will be allocated using mmap and will be directly released back to the system once free'd. Cool. But let's say I never ask for allocation of size greater than say 50 bytes and I ask a lot of such 50 byte objects on peak load on the system. Then what?
From what I know (correct me please), a memory allocated with malloc will never be released back to the system ever until the process ends i.e. the allocator will simply keep it in the free list if I free it. But the question that is troubling me is then, if I use a tool to see the memory usage of my process (I am using pmap on Linux, what do you guys use?), it should always show the memory used at peak load (as the memory is never given back to the system, except when allocated using mmap)? That is memory used by the process should never ever decrease(except the stack memory)? Is it?
I know I am missing something, so please shed some light on all this.
Experts, please clear my concepts regarding this. I will be grateful. I hope I was able to explain my question.

There isn't much overhead for malloc, so you are unlikely to achieve any run-time savings. There is, however, a good reason to implement an allocator on top of malloc, and that is to be able to trace memory leaks. For example, you can free all memory allocated by the program when it exits, and then check to see if your memory allocator calls balance (i.e. same number of calls to allocate/deallocate).
For your specific implementation, there is no reason to free() since the malloc won't release to system memory and so it will only release memory back to your own allocator.
Another reason for using a custom allocator is that you may be allocating many objects of the same size (i.e you have some data structure that you are allocating a lot). You may want to maintain a separate free list for this type of object, and free/allocate only from this special list. The advantage of this is that it will avoid memory fragmentation.

No.
It's actually a bad strategy for a number of reasons, so it doesn't happen --except-- as you note, there can be an exception for large allocations that can be directly made in pages.
It increases internal fragmentation and therefore can actually waste memory. (You can only return aligned pages to the OS, so pulling aligned pages out of a block will usually create two guaranteed-to-be-small blocks --smaller than a page, anyway-- to either side of the block. If this happens a lot you end up with the same total amount of usefully-allocated memory plus lots of useless small blocks.)
A kernel call is required, and kernel calls are slow, so it would slow down the program. It's much faster to just throw the block back into the heap.
Almost every program will either converge on a steady-state memory footprint or it will have an increasing footprint until exit. (Or, until near-exit.) Therefore, all the extra processing needed by a page-return mechanism would be completely wasted.

It is entirely implementation dependent. On Windows VC++ programs can return memory back to the system if the corresponding memory pages contain only free'd blocks.

I think that you have all the information you need to answer your own question. pmap shows the memory that is currenly being used by the process. So, if you call pmap before the process achieves peak memory, then no it will not show peak memory. if you call pmap just before the process exits, then it will show peak memory for a process that does not use mmap. If the process uses mmap, then if you call pmap at the point where maximum memory is being used, it will show peak memory usage, but this point may not be at the end of the process (it could occur anywhere).
This applies only to your current system (i.e. based on the documentation you have provided for free and mmap and malloc) but as the previous poster has stated, behavior of these is implmentation dependent.

This varies a bit from implementation to implementation.
Think of your memory as a massive long block, when you allocate to it you take a bit out of your memory (labeled '1' below):
111
If I allocate more more memory with malloc it gets some from the system:
1112222
If I now free '1':
___2222
It won't be returned to the system, because two is in front of it (and memory is given as a continous block). However if the end of the memory is freed, then that memory is returned to the system. If I freed '2' instead of '1'. I would get:
111
the bit where '2' was would be returned to the system.
The main benefit of freeing memory is that that bit can then be reallocated, as opposed to getting more memory from the system. e.g:
33_2222

I believe that the memory allocator in glibc can return memory back to the system, but whether it will or not depends on your memory allocation patterns.
Let's say you do something like this:
void *pointers[10000];
for(i = 0; i < 10000; i++)
pointers[i] = malloc(1024);
for(i = 0; i < 9999; i++)
free(pointers[i]);
The only part of the heap that can be safely returned to the system is the "wilderness chunk", which is at the end of the heap. This can be returned to the system using another sbrk system call, and the glibc memory allocator will do that when the size of this last chunk exceeds some threshold.
The above program would make 10000 small allocations, but only free the first 9999 of them. The last one should (assuming nothing else has called malloc, which is unlikely) be sitting right at the end of the heap. This would prevent the allocator from returning any memory to the system at all.
If you were to free the remaining allocation, glibc's malloc implementation should be able to return most of the pages allocated back to the system.
If you're allocating and freeing small chunks of memory, a few of which are long-lived, you could end up in a situation where you have a large chunk of memory allocated from the system, but you're only using a tiny fraction of it.

Here are some "advantages" to never releasing memory back to the system:
Having already used a lot of memory makes it very likely you will do so again, and
when you release memory the OS has to do quite a bit of paperwork
when you need it again, your memory allocator has to re-initialise all its data structures in the region it just received
Freed memory that isn't needed gets paged out to disk where it doesn't actually make that much difference
Often, even if you free 90% of your memory, fragmentation means that very few pages can actually be released, so the effort required to look for empty pages isn't terribly well spent

Many memory managers can perform TRIM operations where they return entirely unused blocks of memory to the OS. However, as several posts here have mentioned, it's entirely implementation dependent.
But lets say I never ask for allocation of size greater than say 50 bytes and I ask a lot of such 50 byte objects on peak load on the system. Then what ?
This depends on your allocation pattern. Do you free ALL of the small allocations? If so and if the memory manager has handling for a small block allocations, then this may be possible. However, if you allocate many small items and then only free all but a few scattered items, you may fragment memory and make it impossible to TRIM blocks since each block will have only a few straggling allocations. In this case, you may want to use a different allocation scheme for the temporary allocations and the persistant ones so you can return the temporary allocations back to the OS.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio