Is MAP_HUGETLB synonym of coherent memory? (when successful) - linux-kernel

Am I correct to assume the mmap'd memory using MAP_HUGETLB|MAP_ANONYMOUS is actually 100% physically coherent? at least on the huge page size, 2MB or 1GB.
Otherwise I don't know how it could work/be performant since the TLB would need more entries...

Yes, they are. Indeed, as you point out, in case they weren't, multiple page table entries would be needed for a single huge page, which would defeat the entire purpose of having a huge page.
Here's an excerpt from Documentation/admin-guide/mm/hugetlbpage.rst:
The default for the allowed nodes--when the task has default memory
policy--is all on-line nodes with memory. Allowed nodes with
insufficient available, contiguous memory for a huge page will be
silently skipped when allocating persistent huge pages. See the
discussion below <mem_policy_and_hp_alloc> of the interaction
of task memory policy, cpusets and per node attributes with the
allocation and freeing of persistent huge pages.
The success or failure of huge page allocation depends on the amount of physically contiguous memory that is present in system at the time
of the allocation attempt. If the kernel is unable to allocate huge
pages from some nodes in a NUMA system, it will attempt to make up the
difference by allocating extra pages on other nodes with sufficient
available contiguous memory, if any.
See also: How do I allocate a DMA buffer backed by 1GB HugePages in a linux kernel module?

Related

Variable allocation and tracking

I started searching and reading about ALDS and memory management recently after I got a doubt about memory allocation, and after a couple of days of study I learnt a lot of things about memory management but the actual doubt remains unsolved.
So the doubt is, while allocating memory to a variable, how exactly does the system know which block of memory is available and which is free, and similarly when we destruct an object or set a variable as null or when GC frees up some memory, what exactly does it do with that block of memory, as I know the actual data is never erased on deletion, that block just gets marked as free somewhere in some table, but does that table keep track of each and every bit on the memory, if yes then wouldn't that become a lot of data in itself to store?
For an example, if I declare a linked list, then a block will be allocated in heap with it's next block having null value as there is no other node to reference, now as I keep adding more nodes into it, system will keep allocating more blocks each containing reference to next one. Now these blocks can be present on random locations depending on the availability of memory at allocation time, and can only be accessed through their proceeding nodes.
So now, for any given block of memory, how the system will know if its free and has just garbage value in it, or its actually a node of some linked list.
On a modern operating system the process has a logical, linear address space. Part of that address space is reserved for the system and is common to all processes. Some of the address space may be reserved but most of the remainder is available to the process.
The address space is defined by PAGE TABLES. The structure of the page table is defined by the processor but the operating system maintains a table for each process. Memory is allocated to a process in PAGES. The smallest I am aware of is 512 bytes but the size can go up to a megabyte or even larger in some processors and some processor configurations.The size is always a power of 2.
The page table defines:
Whether an page has actually been mapped to the process
Whether the pages has a corresponding physical memory location
If so, the mapping to that physical location.
There operating system only knows about pages.
At the next level down there are memory managers. These are not part of the operating system. Memory managers manage heaps that consist of pages allocated by the operating system. The memory manage has to keep track of the heap size and what memory has been allocated within it.
Memory managers operate is a huge number of different ways. There are malloc/free implementations galore that you can link into your code to get different behaviors.

Whole memory cycle in executing a program

I have been thinking about how the whole information(data) is passed while executing any program or query.
The below diagram I used expand my assumption:
All data are stored in a disk storage.
The whole platter of the disk is divided into many sectors, and sectors are divided into blocks. Blocks are divided into pages, and pages are contain in a page table and sequence id.
The most frequently used data are stored in cache for faster access.
If data is not found in cache then program goes to check Main Memory and if page fault occurs, then it goes into disk storage.
Virtual Memory is used as a address mapping from RAM to Disk Storage.
Do you think I am missing anything here? Is my assumption correct regarding how memory management works? Will appreciate any helpful comments. Thank you
I think you are mixing too many things up together.
All data are stored in a disk storage.
In most disk based operating systems, all user data (and sometimes kernel data) is stored on disk (somewhere) and mapped to memory.
The whole platter of the disk is divided into many sectors, and sectors are divided into blocks. Blocks are divided into pages, and pages are contain in a page table and sequence id.
No.
Most disks these days use logical I/O so that the software only sees blocks, not tracks, sectors, and platters (as in ye olde days).
Blocks exist only on disk. Pages only exist in memory. Blocks are divided into pages.
The most frequently used data are stored in cache for faster access.
There are two common caches. I cannot tell which you are referring to. One is the CPU cache (hardware) and the other is software caches maintained by the operating system.
If data is not found in cache then program goes to check Main Memory and if page fault occurs, then it goes into disk storage.
No.
This sounds like you are referring to the CPU cache. Page faults are triggered when reading the page table.
Virtual Memory is used as a address mapping from RAM to Disk Storage.
Logical memory mapping is used to map logical pages to physical page frames. Virtual memory is used to map logical pages to disk storage.

Large memory block allocation and 4K blocks

Consider this quote from Mark Russiniovich's books on Windows internals. This is about large-page allocation mechanism, intended for allocating large non-paged memory blocks in physical memory
http://books.google.com/books?id=CdxMRjJksScC&pg=PA194&lpg=PA194#v=onepage
Attempts to allocate large pages may fail after the operating system
has been running for an extended period, because the physical memory
for each large page must occupy a significant number (see Table 10-1)
of physically contiguous small pages, and this extent of physical
pages must furthermore begin on a large page boundary. (For example,
physical pages 0 through 511 could be used as a large page on an x64
system, as could physical pages 512 through 1,023, but pages 10
through 521 could not.) Free physical memory does become fragmented as
the system runs. This is not a problem for allocations using small
pages but can cause large page allocations to fail.
If I understand this correctly, he's saying that fragmentation produced by scattered 4K pages can prevent successful allocation of large 2M pages in physical memory. But why? Ordinary 4K physical pages are easily relocatable and can also be easily swapped out. In other words, if we have a physical memory region not occupied by other 2M pages, we can always "clean it up": make it available by relocating any interfering 4K pages from that physical memory region to some other location. I.e. from the "naive" point of view, 2M allocations should "always succeed", as long as we have enough free physical RAM.
What is wrong with my logic? What exactly is Mark talking about when he says that physical memory fragmentation caused by 4K pages can prevent successful allocation of large pages?
It actually worked this way in Windows XP. But the cost was too prohibitive and a design change in Vista disabled this approach. Explained well in this blog post, I'll quote the essential part:
In Windows Vista, the memory manager folks recognized that these long delays made very large pages less attractive for applications, so they changed the behavior so requests for very large pages from applications went through the "easy parts" of looking for contiguous physical memory, but gave up before the memory manager went into desperation mode, preferring instead just to fail.
He's talking about a specific problem that exists with allocating and freeing contiguous memory blocks over time, and you're describing a solution. Nothing is wrong with your logic and that's roughly what the .NET Garbage Collector does to reduce memory fragmentation. You're spot on.
If you have 10 seats per row at a baseball game and seats 2, 4, 6, and 8 are taken (fragmented), you will never be able to get 3 seats in that row for you and your friends unless you ask someone to move (compacted).
There's nothing special about the 4k blocks he's describing.

Difference between ZRAM and ZSWAP

Does anyone know what is the difference between feature ZRAM and ZSWAP in linux kernel? Seems they are very similar-- store compressed pages in ram.
zram
Status: Available in mainline kernel as of version 3.14 (March 2014)
Implementation: compressed block device, memory is dynamically
allocated as data is stored
Usage: Configure zram block device as a swap device to eliminate need
for physical swap defice or swap file
Benefits:
Eliminates need for physical swap device. This beame popular when
netbooks first showed up. Zram (then compcache) allowed users to
avoid swap shortening the lifespan of SSDs in these memory
constrained systems.
A zram block device can be used for other applications other than
swap, anything you might use a block device for conceivably.
Drawbacks:
Once a page is stored in zram it will remain there until paged in or
invalidated. The first pages to be paged out will be the oldest
pages (LRU list), these are 'cold' pages that are infrequently
access. As the system continues to swap it will move on to pages
that are warmer (more frequently accessed), these may not be able to
be stored because of the swap slots consumed by the cold pages. What
zram can not do (compcache had the option to configure a block
backing device) is to evict pages out to physical disk. Ideally you
want to age data out of the in-kernel compressed swap space out to
disk so that you can use kernel memory for caching warm swap pages
or free it for more productive use.
zswap
Status: Available in mainline kernel as of version 3.11 (September 2013)
Implementation: compressed in-kernel cache for swap pages. In-kernel
cache is compressed, the compression algorithm is pluggable using the
CryptoAPI and the storage for pages is dynamically allocated. Older
pages can be evicted to disk making this a sort of write-behind
cache.
Usage: Cache swap pages destined for regular swap devices (or swap
files).
Benefits:
Integration with swap code (using Frontswap API) allows zswap to
choose to store only pages that compress well and handle memory
allocation failures, in those cases pages are sent to the backing
swap device.
Oldest pages in the cache are pushed out to backing swap device to
make room for newer pages, this solves the LRU inversion problem
that a lack of page eviction would present.
Drawbacks:
Needs a physical swap device (or swapfile).
ZRAM is a module of the Linux kernel, previously called "compcache". ZRAM increases performance by avoiding paging on disk and instead uses a compressed block device in RAM in which paging takes place until it is necessary to use the swap space on the hard disk drive. Since using RAM is faster than using disks, zram allows Linux to make more use of RAM when swapping/paging is required, especially on older computers with less RAM installed.
ZSWAP is a lightweight compressed cache for swap pages. It takes pages that are
in the process of being swapped out and attempts to compress them into a
dynamically allocated RAM-based memory pool. zswap basically trades CPU cycles
for potentially reduced swap I/O. This trade-off can also result in a
significant performance improvement if reads from the compressed cache are
faster than reads from a swap device.

Why is the kernel concerned about issuing PHYSICALLY contiguous pages?

When a process requests physical memory pages from the Linux kernel, the kernel does its best to provide a block of pages that are physically contiguous in memory. I was wondering why it matters that the pages are PHYSICALLY contiguous; after all, the kernel can obscure this fact by simply providing pages that are VIRTUALLY contiguous.
Yet the kernel certainly tries its hardest to provide pages that are PHYSICALLY contiguous, so I'm trying to figure out why physical contiguity matters so much. I did some research &, across a few sources, uncovered the following reasons:
1) makes better use of the cache & achieves lower avg memory access times (GigaQuantum: I don’t understand: how?)
2) you have to fiddle with the kernel page tables in order to map pages that AREN’T physically contiguous (GigaQuantum: I don’t understand this one: isn’t each page mapped separately? What fiddling has to be done?)
3) mapping pages that aren’t physically contiguous leads to greater TLB thrashing (GigaQuantum: I don’t understand: how?)
Per the comments I inserted, I don't really understand these 3 reasons. Nor did any of my research sources adequately explain/justify these 3 reasons. Can anyone explain these in a little more detail?
Thanks! Will help me to better understand the kernel...
The main answer really lies in your second point. Typically, when memory is allocated within the kernel, it isn't mapped at allocation time - instead, the kernel maps as much physical memory as it can up-front, using a simple linear mapping. At allocation time it just carves out some of this memory for the allocation - since the mapping isn't changed, it has to already be contiguous.
The large, linear mapping of physical memory is efficient: both because large pages can be used for it (which take up less space for page table entries and less TLB entries), and because altering the page tables is a slow process (so you want to avoid doing this at allocation/deallocation time).
Allocations that are only logically linear can be requested, using the vmalloc() interface rather than kmalloc().
On 64 bit systems the kernel's mapping can encompass the entireity of physical memory - on 32 bit systems (except those with a small amount of physical memory), only a proportion of physical memory is directly mapped.
Actually the behavior of memory allocation you describe is common for many OS kernels and the main reason is kernel physical pages allocator. Typically, kernel has one physical pages allocator that is used for allocation of pages for both kernel space (including pages for DMA) and user space. In kernel space you need continuos memory, because it's expensive (for in-kernel code) to map pages every time you need them. On x86_64, for example, it's completely worthless because kernel can see the whole address space (on 32bit systems there's 4G limitation of virtual address space, so typically top 1G are dedicated to kernel and bottom 3G to user-space).
Linux kernel uses buddy algorithm for page allocation, so that allocation of bigger chunk takes fewer iterations than allocation of smaller chunk (well, smaller chunks are obtained by splitting bigger chunks). Moreover, using of one allocator for both kernel space and user space allows the kernel to reduce fragmentation. Imagine that you allocate pages for user space by 1 page per iteration. If user space needs N pages, you make N iterations. What happens if kernel wants some continuos memory then? How can it build big enough continuos chunk if you stole 1 page from each big chunk and gave them to user space?
[update]
Actually, kernel allocates continuos blocks of memory for user space not as frequently as you might think. Sure, it allocates them when it builds ELF image of a file, when it creates readahead when user process reads a file, it creates them for IPC operations (pipe, socket buffers) or when user passes MAP_POPULATE flag to mmap syscall. But typically kernel uses "lazy" page loading scheme. It gives continuos space of virtual memory to user-space (when user does malloc first time or does mmap), but it doesn't fill the space with physical pages. It allocates pages only when page fault occurs. The same is true when user process does fork. In this case child process will have "read-only" address space. When child modifies some data, page fault occurs and kernel replaces the page in child address space with a new one (so that parent and child have different pages now). Typically kernel allocates only one page in these cases.
Of course there's a big question of memory fragmentation. Kernel space always needs continuos memory. If kernel would allocate pages for user-space from "random" physical locations, it'd be much more hard to get big chunk of continuos memory in kernel after some time (for example after a week of system uptime). Memory would be too fragmented in this case.
To solve this problem kernel uses "readahead" scheme. When page fault occurs in an address space of some process, kernel allocates and maps more than one page (because there's possibility that process will read/write data from the next page). And of course it uses physically continuos block of memory (if possible) in this case. Just to reduce potential fragmentation.
A couple of that I can think of:
DMA hardware often accesses memory in terms of physical addresses. If you have multiple pages worth of data to transfer from hardware, you're going to need a contiguous chunk of physical memory to do so. Some older DMA controllers even require that memory to be located at low physical addresses.
It allows the OS to leverage large pages. Some memory management units allow you to use a larger page size in your page table entries. This allows you to use fewer page table entries (and TLB slots) to access the same quantity of virtual memory. This reduces the likelihood of a TLB miss. Of course, if you want to allocate a 4MB page, you're going to need 4MB of contiguous physical memory to back it.
Memory-mapped I/O. Some devices could be mapped to I/O ranges that require a contiguous range of memory that spans multiple frames.
Contiguous or Non-Contiguous Memory Allocation request from the kernel depends on your application.
E.g. of Contiguous memory allocation: If you require a DMA operation to be performed then you will be requesting the contiguous memory through kmalloc() call as DMA operation requires a memory which is also physically contiguous , as in DMA you will provide only the starting address of the memory chunk and the other device will read or write from that location.
Some of the operation do not require the contiguous memory so you can request a memory chunk through vmalloc() which gives the pointer to non contagious physical memory.
So it is entirely dependent on the application which is requesting the memory.
Please remember that it is a good practice that if you are requesting the contiguous memory than it should be need based only as kernel is trying best to allocation the memory which is physically contiguous.Well kmalloc() and vmalloc() has their limits also.
Placing things we are going to be reading a lot physically close together takes advantage of spacial locality, things we need are more likely to be cached.
Not sure about this one
I believe this means if pages are not contiguous, the TLB has to do more work to find out where they all are. If they are contigous, we can express all the pages for a processes as PAGES_START + PAGE_OFFSET. If they aren't, we need to store a seperate index for all of the pages of a given processes. Because the TLB has a finite size and we need to access more data, this means we will be swapping in and out a lot more.
kernel does not need physically contiguous pages actually it just needs efficencies ans stabilities.
monolithic kernel tends to have one page table for kernel space shared among processes
and does not want page faults on kernel space that makes kernel designs too complex
so usual implementations on 32 bit architecture is always 3g/1g split for 4g address space
for 1g kernel space, normal mappings of code and data should not generate recursive page faults that is too complex to manage:
you need to find empty page frames, create mapping on mmu, and handle tlb flush for new mappings on every kernel side page fault
kernel is already busy of doing user side page faults
furthermore, 1:1 linear mapping could have much less page table entries because it can utilize bigger size of page unit (>4kb)
less entries leads to less tlb misses.
so buddy allocator on kernel linear address space always provides physically contiguous page frames
even most codes doesn't need contiguous frames
but many device drivers which need contiguous page frames already believe that allocated buffers through general kernel allocator are physically contiguous

Resources