Windows Large Page Support other than 2MB?

Windows Large Page Support other than 2MB? - windows

I have read that the Intel chips support up to 1 GB virtual memory page sizes. Using VirtualAlloc with MEM_LARGE_PAGES gets you 2MB pages. Is there any way to get a different page size? We are currently using Server 2008 R2, but are planning to upgrade to Server 2012.

Doesn't look like it, the Large Page Support docs provide no mechanism for defining the size of the large pages. You're just required to make allocations that have a size (and alignment if explicitly requested) that are multiples of the minimum large page size.
I suppose it's theoretically possible that Windows could implement multiple large page sizes internally (the API function only tells you the minimum size), but they don't expose it at the API level. In practice, I'd expect diminishing returns for larger and larger pages; the overhead of TLB cache misses just won't matter as much when you're already reducing the TLB usage by several orders of magnitude.

In recent versions of Windows 10 (or 11 and later) it is finally possible to choose 1GB (as opposed to 2MB) pages to satisfy large allocations.
This is done by calling VirtualAlloc2 with specific set of flags (you will need recent SDK for the constants):
MEM_EXTENDED_PARAMETER extended {};
extended.Type = MemExtendedParameterAttributeFlags;
extended.ULong64 = MEM_EXTENDED_PARAMETER_NONPAGED_HUGE;
VirtualAlloc2 (GetCurrentProcess (), NULL, size,
MEM_LARGE_PAGES | MEM_RESERVE | MEM_COMMIT,
PAGE_READWRITE, &extended, 1);
If the 1GB page(s) cannot be allocated, the function fails.
It might not be necessary to explicitly request 1GB pages if your software already uses 2MB ones.
Quoting Windows Internals, Part 1, 7th Edition:
On Windows 10 version 1607 x64 and Server 2016 systems, large pages may also be mapped with huge pages, which are 1 GB in size. This is done automatically if the allocation size requested is larger than 1 GB, but it does not have to be a multiple of 1 GB. For example, an allocation of 1040 MB would result in using one huge page (1024 MB) plus 8 “normal” large pages (16 MB divided by 2 MB).
Side note:
Unfortunately the flags above only work for VirtualAlloc2, not for creating shared sections (CreateFileMapping2), where also new flag SEC_HUGE_PAGES exists, but always returns ERROR_INVALID_PARAMETER. But again, given the quote, Windows might be using 1GB pages transparently where appropriate anyway.

Related

Is there a limit on the number of hugepage entries that can be stored in the TLB

I'm trying to analyze the network performance boosts that VMs get when they use hugepages. For this I configured the hypervisor to have several 1G hugepages (36) by changing the grub command line and rebooting and when launching the VMs I made sure the hugepages were being passed on to the VMs. On launching 8 VMs (each with 2 1G huge pages) and running network throughput tests between them, it was found that the throughput was drastically lower than when running without hugepages. That got me wondering, if it had something to do with the number of hugepages I was using. Is there a limit on the number of 1G hugepages that can be referenced using the TLB and if so, is it lower than the limit for regular sized pages? How do I know this information. In this scenario I was using an Ivy Bridge system, and using cpuid command, I saw something like
cache and TLB information (2):
0x63: data TLB: 1G pages, 4-way, 4 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xff: cache data is in CPUID 4
0xb5: instruction TLB: 4K, 8-way, 64 entries
0xf0: 64 byte prefetching
0xc1: L2 TLB: 4K/2M pages, 8-way, 1024 entries
Does it mean I can have only 4 1G hugepage mappings in the TLB at any time?

Yes, of course. Having an unbounded upper limit in the number of TLB entries would require an unbounded amount of physical space in the CPU die.
Every TLB in every architecture has an upper limit on the number of entries it can hold.
For the x86 case this number is less than what you probably expected: it is 4.
It was 4 in your Ivy Bridge and it is still 4 in my Kaby Lake, four generations later.
It's worth noting that 4 entries cover 4GiB of RAM (4x1GiB), that's seems enough to handle networking if properly used.
Finally, TLBs are core resources, each core has its set of TLBs.
If you disable SMT (e.g. Intel Hyper Threading) or assign both threads on a core to the same VM, the VMs won't be competing for the TLB entries.
However each VM can only have at most 4xC huge page entries cached, where C is the number of cores dedicated to that VM.
The ability of the VM to fully exploit these entries depends on how the Host OS, the hyper-visor and the guest OS work together and on the memory layout of the guest application of interest (pages shared across cores have duplicated TLB entries in each core).
It's hard (almost impossible?) to transparently use 1GiB pages, I'm not sure how the hyper-visor and the VM are going to use those pages - I'd say you need specific support for that but I'm not sure.
As Peter Cordes noted, 1GiB pages use a single-level TLB (and in Skylake, apparently there is also a second level TLB with 16 entries for 1GB pages).
A miss in the 1GiB TLB will result in a page walk so it's very important that all the software involved use page-aware code.

How do I increase memory limit (contiguous as well as overall) in Matlab r2012b?

I am using Matlab r2012b on win7 32-bit with 4GB RAM.
However, the memory limit on Matlab process is pretty low. On memory command, I am getting the following output:
Maximum possible array: 385 MB (4.038e+08 bytes) *
Memory available for all arrays: 1281 MB (1.343e+09 bytes) **
Memory used by MATLAB: 421 MB (4.413e+08 bytes)
Physical Memory (RAM): 3496 MB (3.666e+09 bytes)
* Limited by contiguous virtual address space available.
** Limited by virtual address space available.
I need to increase the limit to as much as possible.
System: Windows 7 32 bit
RAM: 4 GB
Matlab: r2012b

For general guidance with memory management in MATLAB, see this MathWorks article. Some specific suggestions follow.
Set the /3GB switch in the boot.ini to increase the memory available to MATLAB. Or set it with a properties dialog if you are averse to text editors. This is mentioned in this section of the above MathWorks page.
Also use pack to increase the Maximum possible array by compacting the memory. The 32-bit MATLAB memory needs blocks of contiguous free memory, which is where this first value comes from. The pack command saves all the variables, clears the workspace, and reloads them so that they are contiguous in memory.
More on overall memory, try disabling the virtual machine, closing programs, stopping unnecessary Windows services. No easy answer for this part.

Page boundaries, implementing memory pool

I have decided to reinvent the wheel for a millionth time and write my own memory pool. My only question is about page size boundaries.
Let's say GetSystemInfo() call tells me that the page size is 4096 bytes. Now, I want to preallocate a memory area of 1MB (could be smaller, or larger), and divide this area into 128 byte blocks. HeapAlloc()/VirtualAlloc() will have an overhead between 8 and 16 bytes I guess. Might be some more, I've read posts talking about 60 bytes.
Question is, do I need to pay attention to not to have one of my 128 byte blocks across page boundaries?
Do I simply allocate 1MB in one chunk and divide it into my block size?
Or should I allocate many blocks of, say, 4000 bytes (to take into account HeapAlloc() overhead), and sub-divide this 4000 bytes into 128 byte blocks (4000 / 128 = 31 blocks, 128 bytes each) and not use the remaining bytes at all (4000 - 31x128 = 32 bytes in this example)?

Having a block cross a page boundary isn't a huge deal. It just means that if you try to access that block and it's completely swapped out, you'll get two page faults instead of one. The more important thing to worry about is the alignment of the block.
If you're using your small block to hold a structure that contains native types longer than 1 byte, you'll want to align it, otherwise you face potentially abysmal performance that will outweigh any performance gains you may have made by pooling.
The Windows pooling function ExAllocatePool describes its behaviour as follows:
If NumberOfBytes is PAGE_SIZE or greater, a page-aligned buffer is
allocated. Memory allocations of PAGE_SIZE or less do not cross page
boundaries. Memory allocations of less than PAGE_SIZE are not
necessarily page-aligned but are aligned to 8-byte boundaries in
32-bit systems and to 16-byte boundaries in 64-bit systems.
That's probably a reasonable model to follow.
I'm generally of the idea that larger is better when it comes to a pool. Within reason, of course, and depending on how you are going to use it. I don't see anything wrong with allocating 1 MB at a time (I've made pools that grow in 100 MB chunks). You want it to be worthwhile to have the pool in the first place. That is, have enough data in the same contiguous region of memory that you can take full advantage of cache locality.

I've found out that if I used _align_malloc(), I wouldn't need to worry wether spreading my sub-block to two pages would make any difference or not. An answer by Freddie to another thread (How to Allocate memory from a new virtual page in C?) also helped. Thanks Harry Johnston, I just wanted to use it as a memory pool object.

Fortran array memory management

I am working to optimize a fluid flow and heat transfer analysis program written in Fortran. As I try to run larger and larger mesh simulations, I'm running into memory limitation problems. The mesh, though, is not all that big. Only 500,000 cells and small-peanuts for a typical CFD code to run. Even when I request 80 GB of memory for my problem, it's crashing due to insufficient virtual memory.
I have a few guesses at what arrays are hogging up all that memory. One in particular is being allocated to (28801,345600). Correct me if I'm wrong in my calculations, but a double precision array is 8 bits per value. So the size of this array would be 28801*345600*8=79.6 GB?
Now, I think that most of this array ends up being zeros throughout the calculation so we don't need to store them. I think I can change the solution algorithm to only store the non-zero values to work on in a much smaller array. However, I want to be sure that I'm looking at the right arrays to reduce in size. So first, did I correctly calculate the array size above? And second, is there a way I can have Fortran show array sizes in MB or GB during runtime? In addition to printing out the most memory intensive arrays, I'd be interested in seeing how the memory requirements of the code are changing during runtime.

Memory usage is a quite vaguely defined concept on systems with virtual memory. You can have large amounts of memory allocated (large virtual memory size) but only a small part of it actually being actively used (small resident set size - RSS).
Unix systems provide the getrusage(2) system call that returns information about the amount of system resources in use by the calling thread/process/process children. In particular it provides the maxmimum value of the RSS ever reached since the process was started. You can write a simple Fortran callable helper C function that would call getrusage(2) and return the value of the ru_maxrss field of the rusage structure.
If you are running on Linux and don't care about portability, then you may just open and read from /proc/self/status. It is a simple text pseudofile that among other things contains several lines with statistics about the process virtual memory usage:
...
VmPeak: 9136 kB
VmSize: 7896 kB
VmLck: 0 kB
VmHWM: 7572 kB
VmRSS: 6316 kB
VmData: 5224 kB
VmStk: 88 kB
VmExe: 572 kB
VmLib: 1708 kB
VmPTE: 20 kB
...
Explanation of the various fields - here. You are mostly interested in VmData, VmRSS, VmHWM and VmSize. You can open /proc/self/status as a regular file with OPEN() and process it entirely in your Fortran code.
See also what memory limitations are set with ulimit -a and ulimit -aH. You may be exceeding the hard virtual memory size limit. If you are submitting jobs through a distributed resource manager (e.g. SGE/OGE, Torque/PBS, LSF, etc.) check that you request enough memory for the job.

In what circumstances can large pages produce a speedup?

Modern x86 CPUs have the ability to support larger page sizes than the legacy 4K (ie 2MB or 4MB), and there are OS facilities (Linux, Windows) to access this functionality.
The Microsoft link above states large pages "increase the efficiency of the translation buffer, which can increase performance for frequently accessed memory". Which isn't very helpful in predicting whether large pages will improve any given situation. I'm interested in concrete, preferably quantified, examples of where moving some program logic (or a whole application) to use huge pages has resulted in some performance improvement. Anyone got any success stories ?
There's one particular case I know of myself: using huge pages can dramatically reduce the time needed to fork a large process (presumably as the number of TLB records needing copying is reduced by a factor on the order of 1000). I'm interested in whether huge pages can also be a benefit in less exotic scenarios.

The biggest difference in performance will come when you are doing widely spaced random accesses to a large region of memory -- where "large" means much bigger than the range that can be mapped by all of the small page entries in the TLBs (which typically have multiple levels in modern processors).
To make things more complex, the number of TLB entries for 4kB pages is often larger than the number of entries for 2MB pages, but this varies a lot by processor. There is also a lot of variation in how many "large page" entries are available in the Level 2 TLB.
For example, on an AMD Opteron Family 10h Revision D ("Istanbul") system, cpuid reports:
L1 DTLB: 4kB pages: 48 entries; 2MB pages: 48 entries; 1GB pages: 48 entries
L2 TLB: 4kB pages: 512 entries; 2MB pages: 128 entries; 1GB pages: 16 entries
While on an Intel Xeon 56xx ("Westmere") system, cpuid reports:
L1 DTLB: 4kB pages: 64 entries; 2MB pages: 32 entries
L2 TLB: 4kB pages: 512 entries; 2MB pages: none
Both can map 2MB (512*4kB) using small pages before suffering level 2 TLB misses, while the Westmere system can map 64MB using its 32 2MB TLB entries and the AMD system can map 352MB using the 176 2MB TLB entries in its L1 and L2 TLBs. Either system will get a significant speedup by using large pages for random accesses over memory ranges that are much larger than 2MB and less than 64MB. The AMD system should continue to show good performance using large pages for much larger memory ranges.
What you are trying to avoid in all these cases is the worst case (note 1) scenario of traversing all four levels of the x86_64 hierarchical address translation.
If none of the address translation caching mechanisms (note 2) work, it requires:
5 trips to memory to load data mapped on a 4kB page,
4 trips to memory to load data mapped on a 2MB page, and
3 trips to memory to load data mapped on a 1GB page.
In each case the last trip to memory is to get the requested data, while the other trips are required to obtain the various parts of the page translation information.
The best description I have seen is in Section 5.3 of AMD's "AMD64 Architecture Programmer’s Manual Volume 2: System Programming" (publication 24593) http://support.amd.com/us/Embedded_TechDocs/24593.pdf
Note 1: The figures above are not really the worst case. Running under a virtual machine makes these numbers worse. Running in an environment that causes the memory holding the various levels of the page tables to get swapped to disk makes performance much worse.
Note 2: Unfortunately, even knowing this level of detail is not enough, because all modern processors have additional caches for the upper levels of the page translation hierarchy. As far as I can tell these are very poorly documented in public.

I tried to contrive some code which would maximise thrashing of the TLB with 4k pages in order to examine the gains possible from large pages. The stuff below runs 2.6 times faster (than 4K pages) when 2MByte pages are are provided by libhugetlbfs's malloc (Intel i7, 64bit Debian Lenny ); hopefully obvious what scoped_timer and random0n do.
volatile char force_result;
const size_t mb=512;
const size_t stride=4096;
std::vector<char> src(mb<<20,0xff);
std::vector<size_t> idx;
for (size_t i=0;i<src.size();i+=stride) idx.push_back(i);
random0n r0n(/*seed=*/23);
std::random_shuffle(idx.begin(),idx.end(),r0n);
{
scoped_timer t
("TLB thrash random",mb/static_cast<float>(stride),"MegaAccess");
char hash=0;
for (size_t i=0;i<idx.size();++i)
hash=(hash^src[idx[i]]);
force_result=hash;
}
A simpler "straight line" version with just hash=hash^src[i] only gained 16% from large pages, but (wild speculation) Intel's fancy prefetching hardware may be helping the 4K case when accesses are predictable (I suppose I could disable prefetching to investigate whether that's true).

I've seen improvement in some HPC/Grid scenarios - specifically physics packages with very, very large models on machines with lots and lots of RAM. Also the process running the model was the only thing active on the machine. I suspect, though have not measured, that certain DB functions (e.g. bulk imports) would benefit as well.
Personally, I think that unless you have a very well profiled/understood memory access profile and it does a lot of large memory access, it is unlikely that you will see any significant improvement.

This is getting esoteric, but Huge TLB pages make a significant difference on the Intel Xeon Phi (MIC) architecture when doing DMA memory transfers (from Host to Phi via PCIe). This Intel link describes how to enable huge pages. I found increasing DMA transfer sizes beyond 8 MB with normal TLB page size (4K) started to decrease performance, from about 3 GB/s to under 1 GB/s once the transfer size hit 512 MB.
After enabling huge TLB pages (2MB), the data rate continued to increase to over 5 GB/s for DMA transfers of 512 MB.

I get a ~5% speedup on servers with a lot of memory (>=64GB) running big processes.
e.g. for a 16GB java process, that's 4M x 4kB pages but only 4k x 4MB pages.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio