Difference between ZRAM and ZSWAP - linux-kernel

Does anyone know what is the difference between feature ZRAM and ZSWAP in linux kernel? Seems they are very similar-- store compressed pages in ram.

zram
Status: Available in mainline kernel as of version 3.14 (March 2014)
Implementation: compressed block device, memory is dynamically
allocated as data is stored
Usage: Configure zram block device as a swap device to eliminate need
for physical swap defice or swap file
Benefits:
Eliminates need for physical swap device. This beame popular when
netbooks first showed up. Zram (then compcache) allowed users to
avoid swap shortening the lifespan of SSDs in these memory
constrained systems.
A zram block device can be used for other applications other than
swap, anything you might use a block device for conceivably.
Drawbacks:
Once a page is stored in zram it will remain there until paged in or
invalidated. The first pages to be paged out will be the oldest
pages (LRU list), these are 'cold' pages that are infrequently
access. As the system continues to swap it will move on to pages
that are warmer (more frequently accessed), these may not be able to
be stored because of the swap slots consumed by the cold pages. What
zram can not do (compcache had the option to configure a block
backing device) is to evict pages out to physical disk. Ideally you
want to age data out of the in-kernel compressed swap space out to
disk so that you can use kernel memory for caching warm swap pages
or free it for more productive use.
zswap
Status: Available in mainline kernel as of version 3.11 (September 2013)
Implementation: compressed in-kernel cache for swap pages. In-kernel
cache is compressed, the compression algorithm is pluggable using the
CryptoAPI and the storage for pages is dynamically allocated. Older
pages can be evicted to disk making this a sort of write-behind
cache.
Usage: Cache swap pages destined for regular swap devices (or swap
files).
Benefits:
Integration with swap code (using Frontswap API) allows zswap to
choose to store only pages that compress well and handle memory
allocation failures, in those cases pages are sent to the backing
swap device.
Oldest pages in the cache are pushed out to backing swap device to
make room for newer pages, this solves the LRU inversion problem
that a lack of page eviction would present.
Drawbacks:
Needs a physical swap device (or swapfile).

ZRAM is a module of the Linux kernel, previously called "compcache". ZRAM increases performance by avoiding paging on disk and instead uses a compressed block device in RAM in which paging takes place until it is necessary to use the swap space on the hard disk drive. Since using RAM is faster than using disks, zram allows Linux to make more use of RAM when swapping/paging is required, especially on older computers with less RAM installed.
ZSWAP is a lightweight compressed cache for swap pages. It takes pages that are
in the process of being swapped out and attempts to compress them into a
dynamically allocated RAM-based memory pool. zswap basically trades CPU cycles
for potentially reduced swap I/O. This trade-off can also result in a
significant performance improvement if reads from the compressed cache are
faster than reads from a swap device.

Related

Is MAP_HUGETLB synonym of coherent memory? (when successful)

Am I correct to assume the mmap'd memory using MAP_HUGETLB|MAP_ANONYMOUS is actually 100% physically coherent? at least on the huge page size, 2MB or 1GB.
Otherwise I don't know how it could work/be performant since the TLB would need more entries...
Yes, they are. Indeed, as you point out, in case they weren't, multiple page table entries would be needed for a single huge page, which would defeat the entire purpose of having a huge page.
Here's an excerpt from Documentation/admin-guide/mm/hugetlbpage.rst:
The default for the allowed nodes--when the task has default memory
policy--is all on-line nodes with memory. Allowed nodes with
insufficient available, contiguous memory for a huge page will be
silently skipped when allocating persistent huge pages. See the
discussion below <mem_policy_and_hp_alloc> of the interaction
of task memory policy, cpusets and per node attributes with the
allocation and freeing of persistent huge pages.
The success or failure of huge page allocation depends on the amount of physically contiguous memory that is present in system at the time
of the allocation attempt. If the kernel is unable to allocate huge
pages from some nodes in a NUMA system, it will attempt to make up the
difference by allocating extra pages on other nodes with sufficient
available contiguous memory, if any.
See also: How do I allocate a DMA buffer backed by 1GB HugePages in a linux kernel module?

What is the differences between swap space and cpu cache?

When I googled about TLB vs CPU cache, I found this graph.
According to the virtual memory mechanism, if we cannot find things in physical memory or disk cache, CPU will look for things in the disk(swap place). That's to say, if we extend this picture, a disk cache should be drawn inside main memory and a disk should be drawn after the main memory. Do I understand them correctly?
Reference: whats-difference-between-cpu-cache-and-tlb
everything about CPU cache is different / opposite from swap space. Hardware managed vs. SW managed, on die vs. even "farther" away than DRAM.
If you consider an old way of doing virtual memory (back in the days when it was recommended to have swap space = 2x DRAM), and you have an OS that truly allocates swap space as backing for all virtual memory allocations:
I guess you could sort of look at main memory as a backing store for CPU cache, similar to how swap space is the backing store for anonymous memory pages. (i.e. that isn't memory-mapped to a file on disk.)
The manual software-managed and software-visible nature of page faults does mean there are major differences, though.
One of the most important is that CPU cache (normally) caches based on physical address, while swap space is purely about virtual address space. You can never have swap space in physical address space (except with memory mapped non-volatile storage like an NV-DIMM...)

Will Windows be still able to allocate memory when free space in physical memory is very low?

On Windows 32-bit system the application is being developed using Visual Studio:
Lets say lots of other application running on my machine and they have occupied almost all of physical memory and only 1 MB memory is left free. If my application (which has not yet allocated any memory) tries to allocate, say 2 MB, will the call be successful?
My guess: In theory, each Windows application has 2GB of virtual memory available.
So I believe this call should be successful (regardless how much physical memory is available). But I am not sure on this. That's why asking here.
Windows gives a rock-hard guarantee that this will always work. A process can only allocate virtual memory when Windows can commit space in the paging file for the allocation. If necessary, it will grow the paging file to make the space available. If that fails, for example when the paging file grows beyond the preset limit, then the allocation fails as well. Windows doesn't have the equivalent of the Linux "OOM killer", it does not support over-committing that may require an operating system to start randomly killing processes to find RAM.
Do note that the "always works" clause does have a sting. There is no guarantee on how long this will take. In very extreme circumstances the machine can start thrashing where just about every memory access in the running processes causes a page fault. Code execution slows down to a crawl, you can lose control with the mouse pointer frozen when Explorer or the mouse or video driver start thrashing as well. You are well past the point of shopping for RAM when that happens. Windows applies quotas to processes to prevent them from hogging the machine, but if you have enough processes running then that doesn't necessarily avoid the problem.
Of course. It would be lousy design if memory had to be wasted now in order to be used later. Operating systems constantly re-purpose memory to its most advantageous use at any moment. They don't have to waste memory by keeping it free just so that it can be used later.
This is one of the benefits of virtual memory with a page file. Because the memory is virtual, the system can allocate more virtual memory than physical memory. Virtual memory that cannot fit in physical memory, is pushed out to the page file.
So the fact that your system may be using all of the physical memory does not mean that your program will not be able to allocate memory. In the scenario that you describe, your 2MB memory allocation will succeed. If you then access that memory, the virtual memory will be paged in to physical memory and very likely some other pages (maybe in your process, maybe in another process) will be pushed out to the page file.
Well, it will succeed as long as there's some memory for it - apart from physical memory, there's also the page file.
However, once you reach the limit of both RAM and the page file, you're done for and that's when the out of memory situation really starts being fun.
Now, systems like Windows Vista will try to use all of your available RAM, pretty much for caching. That's a good thing, and when there's a request for memory from an application, the cache will be thrown away as needed.
As for virtual memory, you can request much more than you have available, regardless of your RAM or page file size. Only when you commit the memory does it actually need some backing - either RAM or the page file. On 64-bit, you can easily request terabytes of virtual memory - that doesn't mean you'll get it when you try to commit it, though :P
If your application is unable to allocate a physical memory (RAM) block to store information, the operating system takes over and 'pages' or stores sections that are in RAM on disk to free up physical memory so that your program is able to perform the allocation. This is done automatically and is completely invisible to your applications.
So, in your example, on a system that has 1MB RAM free, if your application tries to allocate memory, the operating system will page certain contents of physical memory to disk and free up RAM for your application. Your application will not crash in this case.
This, obviously is much more complicated than that.
There are several ways to configure a page file on Windows (fixed size, variable size and on which disk). If you run out of physical memory, and out of hard drive space (because your page file has grown very large due to excessive 'paging') or reach the limit of your paging file (if it is a static limit) then your applications will fail due out an out-of-memory exception. With today's systems with large local storage however, this is a rare event.
Be sure to read about paging for the full picture. Check out:
http://en.wikipedia.org/wiki/Paging
In certain cases, you will notice that you have sufficient free physical memory. Say 100MB and your program tries to allocate a 10MB block to store a large object but fails. This is caused by physical memory fragmentation. Although the total free memory is 100MB, there is no single contiguous block of 10MB that can be used to store your object. This will result in an exception that needs to be handled in your code. If you allocate large objects in your code you may want to separate the allocation into smaller blocks to facilitate allocation, and then aggregate them back in your code logic. For example, instead of having a single 10m vector, you can declare 10 x 1m vectors in an array and allocate memory for each individual one.

Is it true, that modern OS may skip copy when realloc is called

While reading the https://stackoverflow.com/a/3190489/196561 I have a question. What the Qt authors says in the Inside the Qt 4 Containers:
... QVector uses realloc() to grow by increments of 4096 bytes. This makes sense because modern operating systems don't copy the entire data when reallocating a buffer; the physical memory pages are simply reordered, and only the data on the first and last pages need to be copied.
My questions are:
1) Is it true that modern OS (Linux - the most interesting for me; FreeBSD, OSX, Windows) and their realloc implementations really capable to realloc pages of data using reordering of virtual-to-physical mapping and without byte-by-byte copy?
2) What is the system call used to achieve this memory move? (I think it can be splice with SPLICE_F_MOVE, but it was flawed and no-op now (?))
3) Is it profitable to use such page shuffling instead of byte-by-byte copy, especially in multicore multithreaded world, where every change of virtual-to-physical mapping need to flush (invalidate) changed page table entries from TLBs in all tens of CPU cores with IPI? (In linux this is smth like flush_tlb_range or flush_tlb_page)
upd for q3: some tests of mremap vs memcpy
Is it true that modern OS (Linux - the most interesting for me; FreeBSD, OSX, Windows) and their realloc implementations really capable to realloc pages of data using reordering of virtual-to-physical mapping and without byte-by-byte copy?
What is the system call used to achieve this memory move? (I think it can be splice with SPLICE_F_MOVE, but it was flawed and no-op now (?))
See thejh's answer.
Who are the actors?
You have at least three actors with your Qt example.
Qt Vector class
glibc's realloc()
Linux's mremap
QVector::capacity() shows that Qt allocates more elements than required. This means that a typical addition of an element will not realloc() anything. The glibc allocator is based on Doug Lea's allocator. This is a binning allocator which supports the use of Linux's mremap. A binning allocator groups similar sized allocations in bins, so a typical random sized allocation will still have some room to grow without needing to call the system. Ie, the free pool or slack is located a the end of the allocated memory.
An answer
Is it profitable to use such page shuffling instead of byte-by-byte copy, especially in multicore multithreaded world, where every change of virtual-to-physical mapping need to flush (invalidate) changed page table entries from TLBs in all tens of CPU cores with IPI? (In linux this is smth like flush_tlb_range or flush_tlb_page)
First off, faster ... than mremap is mis-using mremap() as R notes there.
There are several things that make mremap() valuable as a primitive for realloc().
Reduced memory consumption.
Preserving page mappings.
Avoiding moving data.
Everything in this answer is based upon Linux's implementation, but the semantics can be transferred to other OS's.
Reduce memory consumption
Consider a naive realloc().
void *realloc(void *ptr, size_t size)
{
size_t old_size = get_sz(ptr); /* From bin, address, map table, etc */
if(size <= old_size) {
resize(ptr);
return ptr;
}
void * new_p = malloc(size);
if(new_p) {
memcpy(new_p, ptr, old_size); /* fully committed old_size + new size */
free(ptr);
}
return new_p;
}
In order to support this, you may need double the memory of the realloc() before you go to swap or simply fail to reallocate.
Preserve page mappings
Linux will by default map new allocations to a zero page; a 4k page full of zero data. This is useful for sparsely mapped data structures. If no one writes to the data page, then no physical memory is allocated besides a possible PTE table. These are copy on write or COW. By using the naive realloc(), these mappings will not be preserved and full physical memory is allocated for all zero pages.
If the task is involved in a fork(), the initial realloc() data maybe shared between parent and child. Again, COW will cause physical allocation of pages. The naive implementation will discount this and require separate physical memory per process.
If the system is under memory pressure, the existing realloc() pages may not be in physical memory but in swap. The naive realloc will cause disk reads of the swap page into memory, copy to the updated location, and then likely write the data out to the disk.
Avoid moving data
The issue you consider of updating TLBs is minimal compared to the data. A single TLB is typically 4 bytes and represents a page (4K) of physical data. If you flush the entire TLB for a 4GB system, that is 4MB of data that needs to be restored. Copying large amounts of data will blow the L1 and L2 caches. TLB fetches naturally pipeline better than d-cache and i-cache. It is rare that code will get two TLB misses in a row as most code is sequential.
CPUs are of two variants, VIVT (non-x86) and VIPT as per x86. The VIVT versions usually have mechanisms to invalidate single TLB entries. For a VIPT system, the caches should not need to be invalidated as they are physically tagged.
On a multi-core systems, it is atypical to run one process on all cores. Only cores with the process performing mremap() need to have page table updates. As a process is migrated to a core (typical context switch), it needs to have the page table migrated anyways.
Conclusion
You can construct some pathological cases where a naive copy will work better. As Linux (and most OS's) are for multi-tasking, more than one process will be running. Also, the worst case will be when swapping and the naive implementation will always be worse here (unless you have a disk faster than memory). For minimal realloc() sizes, the dlmalloc or QVector should have fallow space to avoid system level mremap(). A typical mremap() may just expand a virtual address range by growing the region with a random page from the free pool. It is only when the virtual address range must move that mremap() may need a tlb flush, with all the following being true,
The realloc() memory should not be shared with a parent or child process.
The memory should not be sparse (mostly zero or untouched).
The system should not be under memory pressure using swap.
The tlb flush and IPI needs to take place only if the same process is current on other cores. L1-cache loading in not needed for the mremap(), but is for the naive version. The L2 is usually shared between cores and will be current in all cases. The naive version will force the L2 to reload. The mremap() might leave some unused data out of the L2-cache; this is normally a good thing, but could be a disadvantage under some work loads. There are probably better ways to do this such as pre-fetching data.
For Linux, see man 2 mremap:
void *mremap(void *old_address, size_t old_size,
size_t new_size, int flags, ... /* void *new_address */);
mremap() expands (or shrinks) an existing memory mapping, potentially moving it at the same
time (controlled by the flags argument and the available virtual address space).

Why is the kernel concerned about issuing PHYSICALLY contiguous pages?

When a process requests physical memory pages from the Linux kernel, the kernel does its best to provide a block of pages that are physically contiguous in memory. I was wondering why it matters that the pages are PHYSICALLY contiguous; after all, the kernel can obscure this fact by simply providing pages that are VIRTUALLY contiguous.
Yet the kernel certainly tries its hardest to provide pages that are PHYSICALLY contiguous, so I'm trying to figure out why physical contiguity matters so much. I did some research &, across a few sources, uncovered the following reasons:
1) makes better use of the cache & achieves lower avg memory access times (GigaQuantum: I don’t understand: how?)
2) you have to fiddle with the kernel page tables in order to map pages that AREN’T physically contiguous (GigaQuantum: I don’t understand this one: isn’t each page mapped separately? What fiddling has to be done?)
3) mapping pages that aren’t physically contiguous leads to greater TLB thrashing (GigaQuantum: I don’t understand: how?)
Per the comments I inserted, I don't really understand these 3 reasons. Nor did any of my research sources adequately explain/justify these 3 reasons. Can anyone explain these in a little more detail?
Thanks! Will help me to better understand the kernel...
The main answer really lies in your second point. Typically, when memory is allocated within the kernel, it isn't mapped at allocation time - instead, the kernel maps as much physical memory as it can up-front, using a simple linear mapping. At allocation time it just carves out some of this memory for the allocation - since the mapping isn't changed, it has to already be contiguous.
The large, linear mapping of physical memory is efficient: both because large pages can be used for it (which take up less space for page table entries and less TLB entries), and because altering the page tables is a slow process (so you want to avoid doing this at allocation/deallocation time).
Allocations that are only logically linear can be requested, using the vmalloc() interface rather than kmalloc().
On 64 bit systems the kernel's mapping can encompass the entireity of physical memory - on 32 bit systems (except those with a small amount of physical memory), only a proportion of physical memory is directly mapped.
Actually the behavior of memory allocation you describe is common for many OS kernels and the main reason is kernel physical pages allocator. Typically, kernel has one physical pages allocator that is used for allocation of pages for both kernel space (including pages for DMA) and user space. In kernel space you need continuos memory, because it's expensive (for in-kernel code) to map pages every time you need them. On x86_64, for example, it's completely worthless because kernel can see the whole address space (on 32bit systems there's 4G limitation of virtual address space, so typically top 1G are dedicated to kernel and bottom 3G to user-space).
Linux kernel uses buddy algorithm for page allocation, so that allocation of bigger chunk takes fewer iterations than allocation of smaller chunk (well, smaller chunks are obtained by splitting bigger chunks). Moreover, using of one allocator for both kernel space and user space allows the kernel to reduce fragmentation. Imagine that you allocate pages for user space by 1 page per iteration. If user space needs N pages, you make N iterations. What happens if kernel wants some continuos memory then? How can it build big enough continuos chunk if you stole 1 page from each big chunk and gave them to user space?
[update]
Actually, kernel allocates continuos blocks of memory for user space not as frequently as you might think. Sure, it allocates them when it builds ELF image of a file, when it creates readahead when user process reads a file, it creates them for IPC operations (pipe, socket buffers) or when user passes MAP_POPULATE flag to mmap syscall. But typically kernel uses "lazy" page loading scheme. It gives continuos space of virtual memory to user-space (when user does malloc first time or does mmap), but it doesn't fill the space with physical pages. It allocates pages only when page fault occurs. The same is true when user process does fork. In this case child process will have "read-only" address space. When child modifies some data, page fault occurs and kernel replaces the page in child address space with a new one (so that parent and child have different pages now). Typically kernel allocates only one page in these cases.
Of course there's a big question of memory fragmentation. Kernel space always needs continuos memory. If kernel would allocate pages for user-space from "random" physical locations, it'd be much more hard to get big chunk of continuos memory in kernel after some time (for example after a week of system uptime). Memory would be too fragmented in this case.
To solve this problem kernel uses "readahead" scheme. When page fault occurs in an address space of some process, kernel allocates and maps more than one page (because there's possibility that process will read/write data from the next page). And of course it uses physically continuos block of memory (if possible) in this case. Just to reduce potential fragmentation.
A couple of that I can think of:
DMA hardware often accesses memory in terms of physical addresses. If you have multiple pages worth of data to transfer from hardware, you're going to need a contiguous chunk of physical memory to do so. Some older DMA controllers even require that memory to be located at low physical addresses.
It allows the OS to leverage large pages. Some memory management units allow you to use a larger page size in your page table entries. This allows you to use fewer page table entries (and TLB slots) to access the same quantity of virtual memory. This reduces the likelihood of a TLB miss. Of course, if you want to allocate a 4MB page, you're going to need 4MB of contiguous physical memory to back it.
Memory-mapped I/O. Some devices could be mapped to I/O ranges that require a contiguous range of memory that spans multiple frames.
Contiguous or Non-Contiguous Memory Allocation request from the kernel depends on your application.
E.g. of Contiguous memory allocation: If you require a DMA operation to be performed then you will be requesting the contiguous memory through kmalloc() call as DMA operation requires a memory which is also physically contiguous , as in DMA you will provide only the starting address of the memory chunk and the other device will read or write from that location.
Some of the operation do not require the contiguous memory so you can request a memory chunk through vmalloc() which gives the pointer to non contagious physical memory.
So it is entirely dependent on the application which is requesting the memory.
Please remember that it is a good practice that if you are requesting the contiguous memory than it should be need based only as kernel is trying best to allocation the memory which is physically contiguous.Well kmalloc() and vmalloc() has their limits also.
Placing things we are going to be reading a lot physically close together takes advantage of spacial locality, things we need are more likely to be cached.
Not sure about this one
I believe this means if pages are not contiguous, the TLB has to do more work to find out where they all are. If they are contigous, we can express all the pages for a processes as PAGES_START + PAGE_OFFSET. If they aren't, we need to store a seperate index for all of the pages of a given processes. Because the TLB has a finite size and we need to access more data, this means we will be swapping in and out a lot more.
kernel does not need physically contiguous pages actually it just needs efficencies ans stabilities.
monolithic kernel tends to have one page table for kernel space shared among processes
and does not want page faults on kernel space that makes kernel designs too complex
so usual implementations on 32 bit architecture is always 3g/1g split for 4g address space
for 1g kernel space, normal mappings of code and data should not generate recursive page faults that is too complex to manage:
you need to find empty page frames, create mapping on mmu, and handle tlb flush for new mappings on every kernel side page fault
kernel is already busy of doing user side page faults
furthermore, 1:1 linear mapping could have much less page table entries because it can utilize bigger size of page unit (>4kb)
less entries leads to less tlb misses.
so buddy allocator on kernel linear address space always provides physically contiguous page frames
even most codes doesn't need contiguous frames
but many device drivers which need contiguous page frames already believe that allocated buffers through general kernel allocator are physically contiguous

Resources