page size for architecture x86-64 - linux-kernel

I'm a bit confused on why page size 2KB is too small for x86-64 architecture. I know an ideal page size is 4KB. I do not know how to explain this why 2KB is not ideal. Thanks in advance
i said it will lead to larger tables.

The ideal page size is a compromise between advantages and disadvantages. "Pages too small" creates too much overhead (in the form of more levels of tables, more memory consumed by page tables and higher TLB miss costs; for the same virtual address size) when converting virtual addresses into physical addresses. "Pages too big" creates too much memory wastage (e.g. if you need 123 bytes you have to round up to the page size so larger pages mean more wasted RAM).
For 80x86 this decision was made in the 1980s (for 80386), and 4 KiB pages was probably optimal at that time. Since then the benefits of keeping the page size the same (to avoid breaking existing software) has outweighed the benefits of changing the page size. If there was no software (and if there wasn't larger page sizes) and the decision could be made again today, a page size closer to 16 KiB would probably be optimal.
However; due to the way "hierarchy of tables" works it's easy to introduce much larger page sizes. 80x86 has been doing this since the 1990s. For (modern) 80x86 these page sizes are 2 MiB and 1 GiB (and conceptually 512 GiB but it's not supported yet).
In addition; it's possible for a CPU to coalesce adjacent page table entries that happen to be compatible; to get the benefits of a larger page size for TLB efficiency (but without getting the benefits of lower memory consumption for page tables). AMD Zen started doing this a few years ago to create a "16 KiB coalesced page size". This also opens up the opportunity for an operating system to emulate a larger page size (e.g. tell user-space software that the page size is 16 KiB even though the real page size is 4 KiB) and get some of the benefits. This coalescing idea could also be applied to larger page sizes in future (e.g. if 4 adjacent 2 MiB pages happen to be compatible coalesce to get an "8 MiB coalesced page size").
Of course the existence of multiple larger (real and coalesced) page sizes diminishes the benefits of increasing the size of the smallest page size (software that could benefit the most from a switch from 4 KiB pages to 16 KiB pages is probably already using 2 MiB pages anyway).

Related

How to view paging system (demand paging) as another layer of cache?

I tried solving the following question
Consider a machine with 128MiB (i.e. 2^27 bytes) of main memory and an MMU which has a page size of 8KiB (i.e.2^13 bytes). The operating system provides demand paging to a large spinning disk.
Viewing this paging system as another layer of caching below the processor’s last-level cache (LLC), answer following questions regarding the characteristics of this “cache”:
Line size in bytes? 2^13 (every page has 2^13 bytes)
Associativity? Full Associative
Number of lines in cache? 2^14 (Memory size / page size)
Tag size in bits? 14 (Number of lines in cache is 2^14 which gives us 14 bits for tag)
Replacement policy? I am not sure at all (maybe clock algorithm which approximates LRU)
Writeback or write-through? write back (It is not consistent with Disk at all times)
Write-allocate? yes, because after page fault we bring the page to memory for both writing and reading
Exclusivity/Inclusivity? I think non-inclusive and non exclusive (NINE), maybe because memory mapped files are partially in memory and partially in swap file or ELF file (program text). Forexample stack of process is only in memory except when we run out of memory and send it to a swap file. Am I right?
I would be glad if someone checked my answers and help me solve this correctly, Thanks! Sorry, if this is not the place to ask these kind of questions
To start; your answers for line size, associativity and number of lines are right.
Tag size in bits? 14 (Number of lines in cache is 2^14 which gives us 14 bits for tag)
Tag size would be "location on disk / line size", plus some other bits (e.g. for managing the replacement policy). We don't know how big the disk is (other than "large").
However; it's possibly not unreasonable to work this out backwards - start from the assumption that the OS was designed to support a wide range of different hard disk sizes and that the tag is a nice "power of 2"; then assume that 4 bits are used for other purposes (e.g. "age" for LRU). If the tag is 32 bits (a very common "power of 2") then it would imply the OS could support a maximum disk size of 2 TiB ("1 << (32-4) * 8 KiB"), which is (keeping "future proofing" in mind) a little too small for an OS designed in the last 10 years or so. The next larger "power of 2" is 64 bits, which is very likely for modern hardware, but less likely for older hardware (e.g. 32-bit CPUs, smaller disks). Based on "128 MiB of RAM" I'd suspect that the hardware is very old (e.g. normal desktop/server systems started having more than 128 MiB in the late 1990s), so I'd go with "32 bit tag".
Replacement policy? I am not sure at all (maybe clock algorithm which approximates LRU)
Writeback or write-through? write back (It is not consistent with Disk at all times) Write-allocate? yes, because after page fault we bring the page to memory for both writing and reading
There isn't enough information to be sure.
A literal write-through policy would be a performance disaster (imagine having to write 8 KiB to a relatively slow disk every time anything pushed data on the stack). A write back policy would be very bad for fault tolerance (e.g. if there's a power failure you'd lose far too much data).
This alone is enough to imply that it's some kind of custom design (neither strictly write-back nor strictly write-through).
To complicate things more; an OS could take into account "eviction costs". E.g. if the data in memory is already on the disk then the page can be re-used immediately, but if the data in memory has been modified then that page would have to be saved to disk before the memory can be re-used; so if/when the OS needs to evict data from cache to make room (e.g. for more recently used data) it'd be reasonable to prefer evicting an unmodified page (which is cheaper to evict). In addition; for spinning disks, it's common for an OS to optimize disk access to minimize head movement (where the goal is to reduce seek times and improve disk IO performance).
The OS might combine all of these factors when deciding when modified data is written to disk.
Exclusivity/Inclusivity? I think non-inclusive and non exclusive (NINE), maybe because memory mapped files are partially in memory and partially in swap file or ELF file (program text). Forexample stack of process is only in memory except when we run out of memory and send it to a swap file. Am I right?
If RAM is treated as a cache of disk, then the system is an example of single-level store (see https://en.wikipedia.org/wiki/Single-level_store ) and isn't a normal OS (with normal virtual memory - e.g. swap space and file systems). Typically systems that use single-level store are built around the idea of having "persistent objects" and do not have files at all. With this in mind; I don't think it's reasonable to make assumptions that would make sense for a normal operating system (e.g. assume that there are executable files, or that memory mapped files are supported, or that some part of the disk is "swap space" and another part of the disk is not).
However; I would assume that you're right about "non-inclusive and non exclusive (NINE)" - inclusive would be bad for performance (for the same reason write-through would be bad for performance) and exclusive would be very bad for fault tolerance (for the same reason that write-back is bad for fault tolerance).

Slowdown when accessing data at page boundaries?

(My question is related to computer architecture and performance understanding. Did not find a relevant forum, so post it here as a general question.)
I have a C program which accesses memory words that are located X bytes apart in virtual address space. For instance, for (int i=0;<some stop condition>;i+=X){array[i]=4;}.
I measure the execution time with a varying value of X. Interestingly, when X is the power of 2 and is about page size, e.g., X=1024,2048,4096,8192..., I get to huge performance slowdown. But on all other values of X, like 1023 and 1025, there is no slowdown. The performance results are attached in the figure below.
I test my program on several personal machines, all are running Linux with x86_64 on Intel CPU.
What could be the cause of this slowdown? We have tried row buffer in DRAM, L3 cache, etc. which do not seem to make sense...
Update (July 11)
We did a little test here by adding NOP instructions to the original code. And the slowdown is still there. This sorta veto the 4k alias. The cause by conflict cache misses is more likely the case here.
There's 2 things here:
Set-associative cache aliasing creating conflict misses if you only touch the multiple-of-4096 addresses. Inner fast caches (L1 and L2) are normally indexed by a small range of bits from the physical address. So striding by 4096 bytes means those address bits are the same for all accesses so you're only one of the sets in L1d cache, and some small number in L2.
Striding by 1024 means you'd only be using 4 sets in L1d, with smaller powers of 2 using progressively more sets, but non-power-of-2 distributing over all the sets. (Intel CPUs have used 32KiB 8-way associative L1d caches for a long time; 32K/8 = 4K per way. Ice Lake bumped it up to 48K 12-way, so the same indexing where the set depends only on bits below the page number. This is not a coincidence for VIPT caches that want to index in parallel with TLB.)
But with a non-power-of-2 stride, your accesses will be distributed over more sets in the cache. Performance advantages of powers-of-2 sized data? (answer describes this disadvantage)
Which cache mapping technique is used in intel core i7 processor? - shared L3 cache is resistant to aliasing from big power-of-2 offsets because it uses a more complex indexing function.
4k aliasing (e.g. in some Intel CPUs). Although with only stores this probably doesn't matter. It's mainly a factor for memory disambiguation, when the CPU has to quickly figure out if a load might be reloading recently-stored data, and it does so in the first pass by just looking just at page-offset bits.
This is probably not what's going on for you, but for more details see:
L1 memory bandwidth: 50% drop in efficiency using addresses which differ by 4096+64 bytes and
Why are elementwise additions much faster in separate loops than in a combined loop?
Either or both of these effects could be a factor in Why is there huge performance hit in 2048x2048 versus 2047x2047 array multiplication?
Another possible factor is that HW prefetching stops at physical page boundaries. Why does the speed of memcpy() drop dramatically every 4KB? But changing a stride from 1024 to 1023 wouldn't help that by a big factor. "Next-page" prefetching in IvyBridge and later is only TLB prefetching, not data from the next page.
I kind of assumed x86 for most of this answer, but the cache aliasing / conflict-miss stuff applies generally. Set-associative caches with simple indexing are universally used for L1d caches. (Or on older CPUs, direct-mapped where each "set" only has 1 member). The 4k aliasing stuff might be mostly Intel-specific.
Prefetching across virtual page boundaries is likely also a general problem.

Is there a performance gain on x86 Windows/Linux apps in minimizing page count?

I'm writing and running software on Windows and Linux of the last five years, say, on machines of the last five years.
I believe the running processes will never get close to the amount of memory we have (we're typically using like 3GB out of 32GB) but I'm certain we'll be filling up the cache to overflowing.
I have a data structure modification which makes the data structure run a bit slower (that is, do quite a bit more pointer math) but reduces both the cache line count and page count (working set) of the application.
My understanding is that if processes' memory is being paged out to disk, reducing the page count and thus disk I/O is a huge win in terms of wall clock time and latency (e.g. to respond to network events, user events, timers, etc.) However with memory so cheap nowadays, I don't think processes using my data structure will ever page out of memory. So is there still an advantage in improving this? For example if I'm using 10,000 cache lines, does it matter whether they're spread over 157 4k pages (=643k) or 10,000 4k pages (=40MB)? Subquestion: even if you're nowhere near filling up RAM, are there cases a modern OS will page a running process out anyways? (I think Linux of the 90s might have done so to increase file caching, for instance.)
In contrast, I also can reduce cache lines and I suspect this will probably wall clock time quite a bit. Just to be clear on what I'm counting, I'm counting the number of 64-byte-aligned areas of memory that I'm touching at least one byte in as the total cache lines used by this structure. Example: if hypothetically I have 40,000 16 byte structures, a computer whose processes are constantly referring to more memory that it has L1 cache will probably run everything faster if those structures are packed into 10,000 64 byte cache lines instead of each straddling two cache lines apiece for 80,000 cache lines?
The general recommendation is to minimize the number of cache lines and virtual pages that exhibit high temporal locality on the same core during the same time interval of the lifetime of a given application.
When the total number of needed physical pages is about to exceed the capacity of main memory, the OS will attempt to free up some physical memory by paging out some of the resident pages to secondary storage. This situation may result in major page faults, which can significantly impact performance. Even if you know with certainty that the system will not reach that point, there are other performance issues that may occur when unnecessarily using more virtual pages (i.e., there is wasted space in the used pages).
On Intel and AMD processors, page table entries and other paging structures are cached in hardware caches in the memory management unit to efficiently translate virtual addresses to physical addresses. These include the L1 and L2 TLBs. On an L2 TLB miss, a hardware component called the page walker is engaged to peform the required address translation. More pages means more misses. On pre-Broadwell1 and pre-Zen microarchitectures, there can only be one outstanding page walk at any time. On later microarchitectures, there can only be two. In addition, on Intel Ivy Bridge and later, it may be more difficult for the TLB prefetcher to keep up with the misses.
On Intel and AMD processors, the L1D and L2 caches are designed so that all cache lines within the same 4K page are guaranteed to be mapped to different cache sets. So if all of the cache lines of a page are used, in contest to, for example, spreading out the cache lines in 10 different pages, conflict misses in these cache levels may be reduced. That said, on all AMD processors and on Intel pre-Haswell microarchitectures, bank conflicts are more likely to occur when accesses are more spread across cache sets.
On Intel and AMD processors, hardware data prefetchers don't work across 4K boundaries. An access pattern that could be detected by one or more prefetchers but has the accesses spread out across many pages would benefit less from hardware prefetching.
On Windows/Intel, the accessed bits of the page table entries of all present pages are reset every second by the working set manager. So when accesses are unnecessarily spread out in the virtual address space, the number of page walks that require microcode assists (to set the accessed bit) per memory access may become larger.
The same applies to minor page faults as well (on Intel and AMD). Both Windows and Linux use demand paging, i.e., an allocated virtual page is only mapped to a physical page on demand. When a virtual page that is not yet mapped to a physical page is accessed, a minor page fault occurs. Just like with the accessed bit, the number of minor page faults per access may be larger.
On Intel processors, with more pages accessed, it becomes more likely for nearby accesses on the same logical core to 4K-alias. For more information on 4K aliasing, see: L1 memory bandwidth: 50% drop in efficiency using addresses which differ by 4096+64 bytes.
There are probably other potential issues.
Subquestion: even if you're nowhere near filling up RAM, are there
cases a modern OS will page a running process out anyways? (I think
Linux of the 90s might have done so to increase file caching, for
instance.)
On both Windows and Linux, there is a maximum working set size limit on each process. On Linux, this is called RLIMIT_RSS and is not enforced. On Windows, this limit can be either soft (the default) or hard. The limit is only enforced if it is hard (which can be specified by calling the SetProcessWorkingSetSizeEx function and passing the QUOTA_LIMITS_HARDWS_MIN_ENABLE flag). When a process reaches its hard working set limit, further memory allocation requests will be satisfied by paging out some of its pages to the page file, even if free physical memory is available.
I don't know about Linux of the 90s.
Footnotes:
(1) The Intel optimization manual mentions in Section 2.2.3 that Skylake can do two page walks in parallel compared to one in Haswell and earlier microarchitectures. To my knowledge, the manual does not clearly mention whether Broadwell can do one or two page walks in parallel. However, according to slide 10 of these Intel slides (entitled "Intel Xeon Processor D: The First Xeon processor optimized for dense solutions"), Xeon Broadwell supports two page walks. I think this also applies to all Broadwell models.

Cache effects and importance of locality

I have read this blog and I am still unsure about the importance of locality. Why is locality important for cache performance? Is it because it leads to fewer cache misses? Furthermore, how is a program written in order to achieve good locality and hence good cache performance?
Caches are smaller, and usually much smaller, than the main memories they are associated with. For example, on x86 chips, the L1 cache is typically 32 KiB, while memory sizes of 32 GiB or larger are common, which is more than a million times larger.
Without spatial locality, memory requests would be uniformly distributed in the memory of the application, and then given the very large ratios between memory size and cache size, the chance of hitting in the cache would be microscopic (about one in one million for the example above). So the cache hit rate would be microscopic and the cache would be useless.

Large memory block allocation and 4K blocks

Consider this quote from Mark Russiniovich's books on Windows internals. This is about large-page allocation mechanism, intended for allocating large non-paged memory blocks in physical memory
http://books.google.com/books?id=CdxMRjJksScC&pg=PA194&lpg=PA194#v=onepage
Attempts to allocate large pages may fail after the operating system
has been running for an extended period, because the physical memory
for each large page must occupy a significant number (see Table 10-1)
of physically contiguous small pages, and this extent of physical
pages must furthermore begin on a large page boundary. (For example,
physical pages 0 through 511 could be used as a large page on an x64
system, as could physical pages 512 through 1,023, but pages 10
through 521 could not.) Free physical memory does become fragmented as
the system runs. This is not a problem for allocations using small
pages but can cause large page allocations to fail.
If I understand this correctly, he's saying that fragmentation produced by scattered 4K pages can prevent successful allocation of large 2M pages in physical memory. But why? Ordinary 4K physical pages are easily relocatable and can also be easily swapped out. In other words, if we have a physical memory region not occupied by other 2M pages, we can always "clean it up": make it available by relocating any interfering 4K pages from that physical memory region to some other location. I.e. from the "naive" point of view, 2M allocations should "always succeed", as long as we have enough free physical RAM.
What is wrong with my logic? What exactly is Mark talking about when he says that physical memory fragmentation caused by 4K pages can prevent successful allocation of large pages?
It actually worked this way in Windows XP. But the cost was too prohibitive and a design change in Vista disabled this approach. Explained well in this blog post, I'll quote the essential part:
In Windows Vista, the memory manager folks recognized that these long delays made very large pages less attractive for applications, so they changed the behavior so requests for very large pages from applications went through the "easy parts" of looking for contiguous physical memory, but gave up before the memory manager went into desperation mode, preferring instead just to fail.
He's talking about a specific problem that exists with allocating and freeing contiguous memory blocks over time, and you're describing a solution. Nothing is wrong with your logic and that's roughly what the .NET Garbage Collector does to reduce memory fragmentation. You're spot on.
If you have 10 seats per row at a baseball game and seats 2, 4, 6, and 8 are taken (fragmented), you will never be able to get 3 seats in that row for you and your friends unless you ask someone to move (compacted).
There's nothing special about the 4k blocks he's describing.

Resources