In what circumstances can large pages produce a speedup? - performance

Modern x86 CPUs have the ability to support larger page sizes than the legacy 4K (ie 2MB or 4MB), and there are OS facilities (Linux, Windows) to access this functionality.
The Microsoft link above states large pages "increase the efficiency of the translation buffer, which can increase performance for frequently accessed memory". Which isn't very helpful in predicting whether large pages will improve any given situation. I'm interested in concrete, preferably quantified, examples of where moving some program logic (or a whole application) to use huge pages has resulted in some performance improvement. Anyone got any success stories ?
There's one particular case I know of myself: using huge pages can dramatically reduce the time needed to fork a large process (presumably as the number of TLB records needing copying is reduced by a factor on the order of 1000). I'm interested in whether huge pages can also be a benefit in less exotic scenarios.

The biggest difference in performance will come when you are doing widely spaced random accesses to a large region of memory -- where "large" means much bigger than the range that can be mapped by all of the small page entries in the TLBs (which typically have multiple levels in modern processors).
To make things more complex, the number of TLB entries for 4kB pages is often larger than the number of entries for 2MB pages, but this varies a lot by processor. There is also a lot of variation in how many "large page" entries are available in the Level 2 TLB.
For example, on an AMD Opteron Family 10h Revision D ("Istanbul") system, cpuid reports:
L1 DTLB: 4kB pages: 48 entries; 2MB pages: 48 entries; 1GB pages: 48 entries
L2 TLB: 4kB pages: 512 entries; 2MB pages: 128 entries; 1GB pages: 16 entries
While on an Intel Xeon 56xx ("Westmere") system, cpuid reports:
L1 DTLB: 4kB pages: 64 entries; 2MB pages: 32 entries
L2 TLB: 4kB pages: 512 entries; 2MB pages: none
Both can map 2MB (512*4kB) using small pages before suffering level 2 TLB misses, while the Westmere system can map 64MB using its 32 2MB TLB entries and the AMD system can map 352MB using the 176 2MB TLB entries in its L1 and L2 TLBs. Either system will get a significant speedup by using large pages for random accesses over memory ranges that are much larger than 2MB and less than 64MB. The AMD system should continue to show good performance using large pages for much larger memory ranges.
What you are trying to avoid in all these cases is the worst case (note 1) scenario of traversing all four levels of the x86_64 hierarchical address translation.
If none of the address translation caching mechanisms (note 2) work, it requires:
5 trips to memory to load data mapped on a 4kB page,
4 trips to memory to load data mapped on a 2MB page, and
3 trips to memory to load data mapped on a 1GB page.
In each case the last trip to memory is to get the requested data, while the other trips are required to obtain the various parts of the page translation information.
The best description I have seen is in Section 5.3 of AMD's "AMD64 Architecture Programmer’s Manual Volume 2: System Programming" (publication 24593) http://support.amd.com/us/Embedded_TechDocs/24593.pdf
Note 1: The figures above are not really the worst case. Running under a virtual machine makes these numbers worse. Running in an environment that causes the memory holding the various levels of the page tables to get swapped to disk makes performance much worse.
Note 2: Unfortunately, even knowing this level of detail is not enough, because all modern processors have additional caches for the upper levels of the page translation hierarchy. As far as I can tell these are very poorly documented in public.

I tried to contrive some code which would maximise thrashing of the TLB with 4k pages in order to examine the gains possible from large pages. The stuff below runs 2.6 times faster (than 4K pages) when 2MByte pages are are provided by libhugetlbfs's malloc (Intel i7, 64bit Debian Lenny ); hopefully obvious what scoped_timer and random0n do.
volatile char force_result;
const size_t mb=512;
const size_t stride=4096;
std::vector<char> src(mb<<20,0xff);
std::vector<size_t> idx;
for (size_t i=0;i<src.size();i+=stride) idx.push_back(i);
random0n r0n(/*seed=*/23);
std::random_shuffle(idx.begin(),idx.end(),r0n);
{
scoped_timer t
("TLB thrash random",mb/static_cast<float>(stride),"MegaAccess");
char hash=0;
for (size_t i=0;i<idx.size();++i)
hash=(hash^src[idx[i]]);
force_result=hash;
}
A simpler "straight line" version with just hash=hash^src[i] only gained 16% from large pages, but (wild speculation) Intel's fancy prefetching hardware may be helping the 4K case when accesses are predictable (I suppose I could disable prefetching to investigate whether that's true).

I've seen improvement in some HPC/Grid scenarios - specifically physics packages with very, very large models on machines with lots and lots of RAM. Also the process running the model was the only thing active on the machine. I suspect, though have not measured, that certain DB functions (e.g. bulk imports) would benefit as well.
Personally, I think that unless you have a very well profiled/understood memory access profile and it does a lot of large memory access, it is unlikely that you will see any significant improvement.

This is getting esoteric, but Huge TLB pages make a significant difference on the Intel Xeon Phi (MIC) architecture when doing DMA memory transfers (from Host to Phi via PCIe). This Intel link describes how to enable huge pages. I found increasing DMA transfer sizes beyond 8 MB with normal TLB page size (4K) started to decrease performance, from about 3 GB/s to under 1 GB/s once the transfer size hit 512 MB.
After enabling huge TLB pages (2MB), the data rate continued to increase to over 5 GB/s for DMA transfers of 512 MB.

I get a ~5% speedup on servers with a lot of memory (>=64GB) running big processes.
e.g. for a 16GB java process, that's 4M x 4kB pages but only 4k x 4MB pages.

Related

Can all of L2/L3 cache be used by data? If so, why does the Graviton 3 bandwidth plot drop off after half the L2/L3 size, but only gradually?

Consider Graviton3, for example. It's a 64-core CPU with per-core caches 64KiB L1d and 1MiB L2. And a shared L3 of 64MiB across all cores. The RAM bandwidth per socket is 307GB/s (source).
In this plot (source),
we see that all-cores bandwidth drops off to roughly half, when the data exceeds 4MB. This makes sense: 64x 64KiB = 4 MiB is the size of the L1 data cache.
But why does the next cliff begin at 32MB? And why is the drop-off so gradual there? The private L2 caches of 64 cores is a total of 64 MiB, same as the shared L3 size.
It looks from the plot like they may not have tested any sizes between 32M and 64M. Looks like a straight line between those points on all 3 CPUs.
Since 64M is the total size of both L2 and L3, I'd expect a test like this to have slowed most of the way down at 64M. As Brendan says, page tables and a bit of code will take space, competing with the actual intended test data. If the benchmark loop is tight, stack won't come into play, except for interrupt handling.
Once you're evicting anything from a working set slightly larger than cache, you often evict almost everything before getting back to it, depending on pseudo-LRU luck. I'd expect a test size or 48 or even 56 MiB to be a lot closer to the 32 MiB data point than the 64 MiB data point.
Can all of L2/L3 cache be used by data?
In theory, yes; but only if there's no "non-data" (code) in the cache, only if you count "all data" (and don't just count a process' data and ignore things like stack and page tables), and only if there isn't any aliasing problems.
But why does the next cliff begin at 32MB? And why is the drop-off so gradual there?
For a fully associative cache I'd expect a sudden drop off at/near 32 MiB. However, large caches are almost never fully associative as it costs way to much to find anything in the cache.
As associativity decreases the chance of conflicts increases. For example, for an 8-way associative 64 MiB cache the pathological case is that everything conflicts and you're only able to effectively use 8 MiB of it.
More specifically, for a 64 MiB cache (with unknown associativity), and an "assumed Linux" environment that lacks support for cache coloring, it's reasonable to expect a smooth drop off that ends at 64 MiB.
Just to be clear, on a running Graviton 3 in AWS, an lscpu gives me 32MiB for L3 and not 64 MiB.
Caches (sum of all):
L1d: 4 MiB (64 instances)
L1i: 4 MiB (64 instances)
L2: 64 MiB (64 instances)
L3: 32 MiB (1 instance)
The original question is assuming an L3 of 64 MiB across all cores.
Blockquote
But why does the next cliff begin at 32MB? And why is the drop-off so gradual there? The private L2 caches of 64 cores is a total of 64 MiB, same as the shared L3 size.
Blockquote

Relation between computer architecture and cache block size

Suppose memory is byte addressable and cache block size is 4 bytes . So in one cache access 1 block is accessed. Does it means computer architecture is of 32 bit. My question is what derivation you can make about computer architecture if you are given about cache block size
No, usually cache block size is larger than the register width, to take advantage of spatial locality between nearby full-register-width loads / stores which is typical. Making cache as fine-grained a 4-byte chunks costs a large amount of overhead (tags and so on) compared to the amount of storage needed for the actual data. e.g. 20 tag bits, plus "dirty" and other MESI state per 32-bit cache line, might mean that a 32 kiB (usable space) cache needs more like 56 kiB of raw SRAM storage, and that's without considering ECC or parity.
If a CPU has a floating-point unit, it can often do 64-bit loads/stores, even if the integer register width is only 32-bit. (Or even wider with SIMD, or load-pair / store-pair instructions.)
Typical real-world cache sizes are 64 bytes on modern systems, and formerly 32 bytes on earlier CPUs like Pentium III. 64 bytes is the DDR SDRAM burst size, so it's a good choice for the size of off-chip memory accesses. (Recent Intel systems with AVX-512 SIMD can load/store a whole 64-byte (512-bit) cache line with a single instruction, though. SIMD vector width has caught up to cache line size. But integer accesses are still at most 8 bytes wide.)
There's no relationship between cache block size and architecture bitness. You definitely want the block size to be at least as wide as a normal load / store, but it would be possible to build a 64-bit machine with 32-bit cache blocks. That would mean 64-bit loads take two cache accesses to do it, so it would be a really bad idea unless your usual workload consisted of using 64-bit addresses in registers to access scattered 32-bit values, and you wanted to optimize for that without caring about efficiency of anything else.
Most 64-bit ISAs can work with 32 or 64-bit data equally efficiently. Some, notably x86-64, don't even have what you'd call a "word size". There's no one native access size that's most efficient on x86-64, and instructions are an unaligned byte stream, not like ISAs with aligned 32-bit instruction words like RISC-V or AArch64.
So if you knew that the cache block size was 32-bit, it would be a good guess that the register width was at most 32-bit, but could be 8 or 16-bit. (Or 4-bit or possibly even 6-bit or something? With sizes smaller than 32-bit, for historical CPUs it often becomes a question of what one means by bitness: ALU, register, bus, fixed-width instruction? Notice that in earlier parts of the answer, I just talked about register width, not "32-bit CPU".)
If this was a real commercial design instead of a computer science example, an 8-bit machine would be the most likely; a normal 32-bit machine would use larger cache blocks but you could plausibly imagine finer granularity on a machine that could only load 1 byte at a time. (Of course, being an 8-bit machine doesn't imply that restriction; you could have a load-pair instruction, or FP registers that allow 32-bit or 64-bit loads/stores.)

Is there a limit on the number of hugepage entries that can be stored in the TLB

I'm trying to analyze the network performance boosts that VMs get when they use hugepages. For this I configured the hypervisor to have several 1G hugepages (36) by changing the grub command line and rebooting and when launching the VMs I made sure the hugepages were being passed on to the VMs. On launching 8 VMs (each with 2 1G huge pages) and running network throughput tests between them, it was found that the throughput was drastically lower than when running without hugepages. That got me wondering, if it had something to do with the number of hugepages I was using. Is there a limit on the number of 1G hugepages that can be referenced using the TLB and if so, is it lower than the limit for regular sized pages? How do I know this information. In this scenario I was using an Ivy Bridge system, and using cpuid command, I saw something like
cache and TLB information (2):
0x63: data TLB: 1G pages, 4-way, 4 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xff: cache data is in CPUID 4
0xb5: instruction TLB: 4K, 8-way, 64 entries
0xf0: 64 byte prefetching
0xc1: L2 TLB: 4K/2M pages, 8-way, 1024 entries
Does it mean I can have only 4 1G hugepage mappings in the TLB at any time?
Yes, of course. Having an unbounded upper limit in the number of TLB entries would require an unbounded amount of physical space in the CPU die.
Every TLB in every architecture has an upper limit on the number of entries it can hold.
For the x86 case this number is less than what you probably expected: it is 4.
It was 4 in your Ivy Bridge and it is still 4 in my Kaby Lake, four generations later.
It's worth noting that 4 entries cover 4GiB of RAM (4x1GiB), that's seems enough to handle networking if properly used.
Finally, TLBs are core resources, each core has its set of TLBs.
If you disable SMT (e.g. Intel Hyper Threading) or assign both threads on a core to the same VM, the VMs won't be competing for the TLB entries.
However each VM can only have at most 4xC huge page entries cached, where C is the number of cores dedicated to that VM.
The ability of the VM to fully exploit these entries depends on how the Host OS, the hyper-visor and the guest OS work together and on the memory layout of the guest application of interest (pages shared across cores have duplicated TLB entries in each core).
It's hard (almost impossible?) to transparently use 1GiB pages, I'm not sure how the hyper-visor and the VM are going to use those pages - I'd say you need specific support for that but I'm not sure.
As Peter Cordes noted, 1GiB pages use a single-level TLB (and in Skylake, apparently there is also a second level TLB with 16 entries for 1GB pages).
A miss in the 1GiB TLB will result in a page walk so it's very important that all the software involved use page-aware code.

Windows Large Page Support other than 2MB?

I have read that the Intel chips support up to 1 GB virtual memory page sizes. Using VirtualAlloc with MEM_LARGE_PAGES gets you 2MB pages. Is there any way to get a different page size? We are currently using Server 2008 R2, but are planning to upgrade to Server 2012.
Doesn't look like it, the Large Page Support docs provide no mechanism for defining the size of the large pages. You're just required to make allocations that have a size (and alignment if explicitly requested) that are multiples of the minimum large page size.
I suppose it's theoretically possible that Windows could implement multiple large page sizes internally (the API function only tells you the minimum size), but they don't expose it at the API level. In practice, I'd expect diminishing returns for larger and larger pages; the overhead of TLB cache misses just won't matter as much when you're already reducing the TLB usage by several orders of magnitude.
In recent versions of Windows 10 (or 11 and later) it is finally possible to choose 1GB (as opposed to 2MB) pages to satisfy large allocations.
This is done by calling VirtualAlloc2 with specific set of flags (you will need recent SDK for the constants):
MEM_EXTENDED_PARAMETER extended {};
extended.Type = MemExtendedParameterAttributeFlags;
extended.ULong64 = MEM_EXTENDED_PARAMETER_NONPAGED_HUGE;
VirtualAlloc2 (GetCurrentProcess (), NULL, size,
MEM_LARGE_PAGES | MEM_RESERVE | MEM_COMMIT,
PAGE_READWRITE, &extended, 1);
If the 1GB page(s) cannot be allocated, the function fails.
It might not be necessary to explicitly request 1GB pages if your software already uses 2MB ones.
Quoting Windows Internals, Part 1, 7th Edition:
On Windows 10 version 1607 x64 and Server 2016 systems, large pages may also be mapped with huge pages, which are 1 GB in size. This is done automatically if the allocation size requested is larger than 1 GB, but it does not have to be a multiple of 1 GB. For example, an allocation of 1040 MB would result in using one huge page (1024 MB) plus 8 “normal” large pages (16 MB divided by 2 MB).
Side note:
Unfortunately the flags above only work for VirtualAlloc2, not for creating shared sections (CreateFileMapping2), where also new flag SEC_HUGE_PAGES exists, but always returns ERROR_INVALID_PARAMETER. But again, given the quote, Windows might be using 1GB pages transparently where appropriate anyway.

How to avoid TLB miss (and high Global Memory Replay Overhead) in CUDA GPUs?

The title might be more specific than my actual problem is, although I believe answering this question would solve a more general problem, which is: how to decrease the effect of high latency (~700 cycle) that comes from random (but coalesced) global memory access in GPUs.
In general if one accesses the global memory with coalesced load (eg. I read 128 consecutive bytes), but with very large distance (256KB-64MB) between coalesced accesses, one gets a high TLB (Translation Lookaside Buffer) miss rate. This high TLB miss rate is due to the limited number (~512) and size (~4KB) of the memory pages used in the TLB lookup table.
I suppose the high TLB miss rate because of the fact that virtual memory is used by NVIDIA, the fact that I get high (98%) Global Memory Replay Overhead and low throughput (45GB/s, with a K20c) in the profiler and the fact that partition camping is not an issue since Fermi.
Is it possible to avoid high TLB miss rate somehow? Would 3D texture cache help if I'm accessing a (X x Y x Z) cube coalesced along X dimension and with a X*Y "stride" along the Z dimension?
Any comment on this topic is appreciated.
Constraints: 1) global data can not be reordered/transposed; 2) kernel is communication bound.
You can only avoid TLB misses by changing your memory access pattern. A different layout of your data in memory can help with this.
A 3D texture will not improve your situation, as it trades improved spatial locality in two additional dimensions against reduced spatial locality in the third dimension. Thus you would unnecessarily read data of neighbors along the Y axis.
What you can do however is mitigate the impact of the resulting latency on throughput. In order to hide t = 700 cycles of latency at a global memory bandwidth of b = 250GB/s, you need to have memory transactions for b / t = 175 KB of data in flight at any time (or 12.5 KB for each of the 14 SMX). With a fully loaded memory interface and a high ratio of TLB misses, you will however find that latency gets closer to 2000 cycles, requiring roughly 32 KB of transactions in flight per sm.
As each word of a memory read transaction in flight requires one register where the value will be stored once it arrives, hiding memory latency has to be balances against register pressure. Keeping 32 KB of data in flight requires 8192 registers, or 12.5% of the total registers available on an SMX.
(Note that for above rough estimates I have neglected the difference between KiB and KB).

Resources