Allocate more than 2 GB of dma common buffer - memory-management

I am developing a driver for PCI Express in Windows environment.
I use Windows7 and Windows10 and the HW is i7-7700K, RAM: 16GBytes.
There is no problem in using up to 2GBytes buffer allocated so far.
But, more than 2GB is not allocated.
Here is the code snippet that succeeded in allocating 2GB dma common buffer.
DEVICE_DESCRIPTION dd;
RtlZeroMemory(&dd, sizeof(dd));
dd.Version = DEVICE_DESCRIPTION_VERSION;
dd.InterfaceType = InterfaceTypeUndefined;
dd.MaximumLength = 0x200000; // 2MB DMA transfer size at a time
dd.Dma32BitAddresses = FALSE;
dd.Dma64BitAddresses = TRUE;
dd.Master = TRUE;
pdx->AdapterObject = IoGetDmaAdapter(pdx->Pdo, &dd, &nMapRegisters);
pdx->vaCommonBuffer = (*pdx->AdapterObject->DmaOperations->AllocateCommonBuffer)
(pdx->AdapterObject, 0x80000000, &pdx->paCommonBuffer, FALSE);
What is the size limit for the DMA common buffer allocation and why?
If we changing length 0x80000000 (2GB) to 0xC0000000 (3GB) above causes the buffer allocation to fail.
How can I allocate up to 4GB and more than 4GB dma common buffer using AllocateCommonBuffer()?
Thank you very much for your valuable time and comments in advance.
Respectfully,
KJ

I believe that DMAv2 (which is the best Windows 7 supports) cannot do more than 4GB in a single allocation, so there is a hard limit.
But even before you hit the hard limit, you're going to start seeing nondeterministic allocation failures, because AllocateCommonBuffer gives you physically contiguous pages.
As a thought experiment: all it takes is 8 unmovable pages placed at pathologically-inconvenient locations before you can't find 2GB of contiguous pages out of 16GB of RAM. And a real system will have a lot more than 8 unmovable pages, although hopefully they aren't placed at such unfortunate intervals.
Compounding matters, DMAv2 refuses to allocate memory that straddles a 4GB boundary (for historical reasons). So of the 3,670,017 different pages that this allocation could start at, 50% of them are not even considered. At 3GB, 75% of possible allocations are not considered.
We've put a lot of work into the DMA subsystem over the years: DMAv3 is a more powerful API, with fewer weird quirks. Its implementation in Windows 10 and later has some fragmentation-resistance features, and the kernel's memory manager is better at moving pages. That doesn't eliminate the fundamental problem of fragmentation; but it does make it statistically less likely.
Very recent versions of the OS can actually take advantage of the IOMMU (if available and enabled) to further mitigate the problem: you don't need the physical pages to be contiguous anymore, if we can just use the IOMMU to make them seem contiguous to the device. (An MMU is exactly why usermode apps don't need to worry about physical memory fragmentation when they allocate massive buffers via malloc).
At the end of the day, though, you simply can't assume that the system has any 2 contiguous pages of memory for your device, especially if you need your device to operate on a wide variety of systems. I've seen production systems with oodles of RAM routinely fail to allocate 64KB of common buffer, which is "merely" 16 contiguous pages. You will need to either:
Fall back to using many, smaller allocations. They won't be contiguous between them, but you'll be more successful at allocating them.
Fall back to a smaller buffer. For example, in on my home turf of networking, a NIC can use a variety of buffer sizes, and the only visible effect of a smaller buffer might just be degraded throughput.
Ask the user to reboot the device (which is a very blunt way to defragment memory!).
Try to allocate memory once, early in boot, (before memory is fragmented), and never let go of it.
Update the OS, update to DMAv3 API, and ensure you have an IOMMU.

Related

Will Windows be still able to allocate memory when free space in physical memory is very low?

On Windows 32-bit system the application is being developed using Visual Studio:
Lets say lots of other application running on my machine and they have occupied almost all of physical memory and only 1 MB memory is left free. If my application (which has not yet allocated any memory) tries to allocate, say 2 MB, will the call be successful?
My guess: In theory, each Windows application has 2GB of virtual memory available.
So I believe this call should be successful (regardless how much physical memory is available). But I am not sure on this. That's why asking here.
Windows gives a rock-hard guarantee that this will always work. A process can only allocate virtual memory when Windows can commit space in the paging file for the allocation. If necessary, it will grow the paging file to make the space available. If that fails, for example when the paging file grows beyond the preset limit, then the allocation fails as well. Windows doesn't have the equivalent of the Linux "OOM killer", it does not support over-committing that may require an operating system to start randomly killing processes to find RAM.
Do note that the "always works" clause does have a sting. There is no guarantee on how long this will take. In very extreme circumstances the machine can start thrashing where just about every memory access in the running processes causes a page fault. Code execution slows down to a crawl, you can lose control with the mouse pointer frozen when Explorer or the mouse or video driver start thrashing as well. You are well past the point of shopping for RAM when that happens. Windows applies quotas to processes to prevent them from hogging the machine, but if you have enough processes running then that doesn't necessarily avoid the problem.
Of course. It would be lousy design if memory had to be wasted now in order to be used later. Operating systems constantly re-purpose memory to its most advantageous use at any moment. They don't have to waste memory by keeping it free just so that it can be used later.
This is one of the benefits of virtual memory with a page file. Because the memory is virtual, the system can allocate more virtual memory than physical memory. Virtual memory that cannot fit in physical memory, is pushed out to the page file.
So the fact that your system may be using all of the physical memory does not mean that your program will not be able to allocate memory. In the scenario that you describe, your 2MB memory allocation will succeed. If you then access that memory, the virtual memory will be paged in to physical memory and very likely some other pages (maybe in your process, maybe in another process) will be pushed out to the page file.
Well, it will succeed as long as there's some memory for it - apart from physical memory, there's also the page file.
However, once you reach the limit of both RAM and the page file, you're done for and that's when the out of memory situation really starts being fun.
Now, systems like Windows Vista will try to use all of your available RAM, pretty much for caching. That's a good thing, and when there's a request for memory from an application, the cache will be thrown away as needed.
As for virtual memory, you can request much more than you have available, regardless of your RAM or page file size. Only when you commit the memory does it actually need some backing - either RAM or the page file. On 64-bit, you can easily request terabytes of virtual memory - that doesn't mean you'll get it when you try to commit it, though :P
If your application is unable to allocate a physical memory (RAM) block to store information, the operating system takes over and 'pages' or stores sections that are in RAM on disk to free up physical memory so that your program is able to perform the allocation. This is done automatically and is completely invisible to your applications.
So, in your example, on a system that has 1MB RAM free, if your application tries to allocate memory, the operating system will page certain contents of physical memory to disk and free up RAM for your application. Your application will not crash in this case.
This, obviously is much more complicated than that.
There are several ways to configure a page file on Windows (fixed size, variable size and on which disk). If you run out of physical memory, and out of hard drive space (because your page file has grown very large due to excessive 'paging') or reach the limit of your paging file (if it is a static limit) then your applications will fail due out an out-of-memory exception. With today's systems with large local storage however, this is a rare event.
Be sure to read about paging for the full picture. Check out:
http://en.wikipedia.org/wiki/Paging
In certain cases, you will notice that you have sufficient free physical memory. Say 100MB and your program tries to allocate a 10MB block to store a large object but fails. This is caused by physical memory fragmentation. Although the total free memory is 100MB, there is no single contiguous block of 10MB that can be used to store your object. This will result in an exception that needs to be handled in your code. If you allocate large objects in your code you may want to separate the allocation into smaller blocks to facilitate allocation, and then aggregate them back in your code logic. For example, instead of having a single 10m vector, you can declare 10 x 1m vectors in an array and allocate memory for each individual one.

Large memory block allocation and 4K blocks

Consider this quote from Mark Russiniovich's books on Windows internals. This is about large-page allocation mechanism, intended for allocating large non-paged memory blocks in physical memory
http://books.google.com/books?id=CdxMRjJksScC&pg=PA194&lpg=PA194#v=onepage
Attempts to allocate large pages may fail after the operating system
has been running for an extended period, because the physical memory
for each large page must occupy a significant number (see Table 10-1)
of physically contiguous small pages, and this extent of physical
pages must furthermore begin on a large page boundary. (For example,
physical pages 0 through 511 could be used as a large page on an x64
system, as could physical pages 512 through 1,023, but pages 10
through 521 could not.) Free physical memory does become fragmented as
the system runs. This is not a problem for allocations using small
pages but can cause large page allocations to fail.
If I understand this correctly, he's saying that fragmentation produced by scattered 4K pages can prevent successful allocation of large 2M pages in physical memory. But why? Ordinary 4K physical pages are easily relocatable and can also be easily swapped out. In other words, if we have a physical memory region not occupied by other 2M pages, we can always "clean it up": make it available by relocating any interfering 4K pages from that physical memory region to some other location. I.e. from the "naive" point of view, 2M allocations should "always succeed", as long as we have enough free physical RAM.
What is wrong with my logic? What exactly is Mark talking about when he says that physical memory fragmentation caused by 4K pages can prevent successful allocation of large pages?
It actually worked this way in Windows XP. But the cost was too prohibitive and a design change in Vista disabled this approach. Explained well in this blog post, I'll quote the essential part:
In Windows Vista, the memory manager folks recognized that these long delays made very large pages less attractive for applications, so they changed the behavior so requests for very large pages from applications went through the "easy parts" of looking for contiguous physical memory, but gave up before the memory manager went into desperation mode, preferring instead just to fail.
He's talking about a specific problem that exists with allocating and freeing contiguous memory blocks over time, and you're describing a solution. Nothing is wrong with your logic and that's roughly what the .NET Garbage Collector does to reduce memory fragmentation. You're spot on.
If you have 10 seats per row at a baseball game and seats 2, 4, 6, and 8 are taken (fragmented), you will never be able to get 3 seats in that row for you and your friends unless you ask someone to move (compacted).
There's nothing special about the 4k blocks he's describing.

Is contiguous memory easier to get in a 64-bit address space? If so why?

A comment in this blog states:
We know how to make chunked heaps, but there would be some overhead to
using them. We have more requests for faster storage management than
we do for larger heaps in the 32-bit JVM. If you really want large
heaps, switch to the 64-bit JVM. We still need contiguous memory,
but it's much easier to get in a 64-bit address space.
This implication of the above statement is that it is easier to get contiguous memory in a 64-bit address space. Is this true? If so why?
That's very true. A process must allocate memory from the virtual memory address space. Which stores both code and data and whose size is restricted by the addressing capability of the architecture. You can never address more than 2^32 bytes in a 32-bit process, not counting bank-switching tricks. That's 4 gigabytes. The operating system typically takes a big chunk out of that as well, on 32-bit Windows for example that cuts down the addressable VM size to 2 gigabytes.
Ideally, allocations are made so that they fit snugly together. That very rarely works out in practice. Shared libraries or DLLs in particular need to pick a preferred load address and that has to be guessed up front when the library is built.
So in practice, the allocations are made from the holes in between existing ones and the largest possible contiguous allocation you can get is restricted by the size of the largest hole. Usually much smaller than the addressable VM size, on Windows it is typically around 650 megabytes. That tends to go down-hill from there as the available address space is getting fragmented by allocations. Particularly by native code that can't afford to have allocations moved by a compacting garbage collector. If you use Windows then you can get insight in the VM allocations with the SysInternals' VMMap utility.
This problem completely disappears in a 64-bit process. The theoretical addressable virtual memory size is 2^64, an enormous number. So large that current processors don't implement it, they can go up to 2^48. Further restricted by the operating system version you have and its willingness to keep page mapping tables for that much VM. Eight terabytes is a typical limit. By implication, the holes between allocations are huge. Your program will keel over on paging file thrashing before it dies from OOM.
I can't speak for how the JVM is implemented obviously, but from a purely theoretical viewpoint, if you have a significantly larger virtual address space (eg 64-bit as compared with 32-bit) it should be significantly easier to find a large block of contiguous memory which is available for allocation (going to extremes - you've got no chance of finding a contiguous 4GB of free memory in a 32-bit address space, but a significant chance of finding this space in a full 64-bit address space).
It should be noted that whatever the virtual address space size, this is still going to be implemented by allocation of (probably) non-contiguous physical memory pages, particularly if the requested allocation is large - the larger virtual address space just means there are a likely to be a lot more contiguous virtual addresses available for use.

Why is the kernel concerned about issuing PHYSICALLY contiguous pages?

When a process requests physical memory pages from the Linux kernel, the kernel does its best to provide a block of pages that are physically contiguous in memory. I was wondering why it matters that the pages are PHYSICALLY contiguous; after all, the kernel can obscure this fact by simply providing pages that are VIRTUALLY contiguous.
Yet the kernel certainly tries its hardest to provide pages that are PHYSICALLY contiguous, so I'm trying to figure out why physical contiguity matters so much. I did some research &, across a few sources, uncovered the following reasons:
1) makes better use of the cache & achieves lower avg memory access times (GigaQuantum: I don’t understand: how?)
2) you have to fiddle with the kernel page tables in order to map pages that AREN’T physically contiguous (GigaQuantum: I don’t understand this one: isn’t each page mapped separately? What fiddling has to be done?)
3) mapping pages that aren’t physically contiguous leads to greater TLB thrashing (GigaQuantum: I don’t understand: how?)
Per the comments I inserted, I don't really understand these 3 reasons. Nor did any of my research sources adequately explain/justify these 3 reasons. Can anyone explain these in a little more detail?
Thanks! Will help me to better understand the kernel...
The main answer really lies in your second point. Typically, when memory is allocated within the kernel, it isn't mapped at allocation time - instead, the kernel maps as much physical memory as it can up-front, using a simple linear mapping. At allocation time it just carves out some of this memory for the allocation - since the mapping isn't changed, it has to already be contiguous.
The large, linear mapping of physical memory is efficient: both because large pages can be used for it (which take up less space for page table entries and less TLB entries), and because altering the page tables is a slow process (so you want to avoid doing this at allocation/deallocation time).
Allocations that are only logically linear can be requested, using the vmalloc() interface rather than kmalloc().
On 64 bit systems the kernel's mapping can encompass the entireity of physical memory - on 32 bit systems (except those with a small amount of physical memory), only a proportion of physical memory is directly mapped.
Actually the behavior of memory allocation you describe is common for many OS kernels and the main reason is kernel physical pages allocator. Typically, kernel has one physical pages allocator that is used for allocation of pages for both kernel space (including pages for DMA) and user space. In kernel space you need continuos memory, because it's expensive (for in-kernel code) to map pages every time you need them. On x86_64, for example, it's completely worthless because kernel can see the whole address space (on 32bit systems there's 4G limitation of virtual address space, so typically top 1G are dedicated to kernel and bottom 3G to user-space).
Linux kernel uses buddy algorithm for page allocation, so that allocation of bigger chunk takes fewer iterations than allocation of smaller chunk (well, smaller chunks are obtained by splitting bigger chunks). Moreover, using of one allocator for both kernel space and user space allows the kernel to reduce fragmentation. Imagine that you allocate pages for user space by 1 page per iteration. If user space needs N pages, you make N iterations. What happens if kernel wants some continuos memory then? How can it build big enough continuos chunk if you stole 1 page from each big chunk and gave them to user space?
[update]
Actually, kernel allocates continuos blocks of memory for user space not as frequently as you might think. Sure, it allocates them when it builds ELF image of a file, when it creates readahead when user process reads a file, it creates them for IPC operations (pipe, socket buffers) or when user passes MAP_POPULATE flag to mmap syscall. But typically kernel uses "lazy" page loading scheme. It gives continuos space of virtual memory to user-space (when user does malloc first time or does mmap), but it doesn't fill the space with physical pages. It allocates pages only when page fault occurs. The same is true when user process does fork. In this case child process will have "read-only" address space. When child modifies some data, page fault occurs and kernel replaces the page in child address space with a new one (so that parent and child have different pages now). Typically kernel allocates only one page in these cases.
Of course there's a big question of memory fragmentation. Kernel space always needs continuos memory. If kernel would allocate pages for user-space from "random" physical locations, it'd be much more hard to get big chunk of continuos memory in kernel after some time (for example after a week of system uptime). Memory would be too fragmented in this case.
To solve this problem kernel uses "readahead" scheme. When page fault occurs in an address space of some process, kernel allocates and maps more than one page (because there's possibility that process will read/write data from the next page). And of course it uses physically continuos block of memory (if possible) in this case. Just to reduce potential fragmentation.
A couple of that I can think of:
DMA hardware often accesses memory in terms of physical addresses. If you have multiple pages worth of data to transfer from hardware, you're going to need a contiguous chunk of physical memory to do so. Some older DMA controllers even require that memory to be located at low physical addresses.
It allows the OS to leverage large pages. Some memory management units allow you to use a larger page size in your page table entries. This allows you to use fewer page table entries (and TLB slots) to access the same quantity of virtual memory. This reduces the likelihood of a TLB miss. Of course, if you want to allocate a 4MB page, you're going to need 4MB of contiguous physical memory to back it.
Memory-mapped I/O. Some devices could be mapped to I/O ranges that require a contiguous range of memory that spans multiple frames.
Contiguous or Non-Contiguous Memory Allocation request from the kernel depends on your application.
E.g. of Contiguous memory allocation: If you require a DMA operation to be performed then you will be requesting the contiguous memory through kmalloc() call as DMA operation requires a memory which is also physically contiguous , as in DMA you will provide only the starting address of the memory chunk and the other device will read or write from that location.
Some of the operation do not require the contiguous memory so you can request a memory chunk through vmalloc() which gives the pointer to non contagious physical memory.
So it is entirely dependent on the application which is requesting the memory.
Please remember that it is a good practice that if you are requesting the contiguous memory than it should be need based only as kernel is trying best to allocation the memory which is physically contiguous.Well kmalloc() and vmalloc() has their limits also.
Placing things we are going to be reading a lot physically close together takes advantage of spacial locality, things we need are more likely to be cached.
Not sure about this one
I believe this means if pages are not contiguous, the TLB has to do more work to find out where they all are. If they are contigous, we can express all the pages for a processes as PAGES_START + PAGE_OFFSET. If they aren't, we need to store a seperate index for all of the pages of a given processes. Because the TLB has a finite size and we need to access more data, this means we will be swapping in and out a lot more.
kernel does not need physically contiguous pages actually it just needs efficencies ans stabilities.
monolithic kernel tends to have one page table for kernel space shared among processes
and does not want page faults on kernel space that makes kernel designs too complex
so usual implementations on 32 bit architecture is always 3g/1g split for 4g address space
for 1g kernel space, normal mappings of code and data should not generate recursive page faults that is too complex to manage:
you need to find empty page frames, create mapping on mmu, and handle tlb flush for new mappings on every kernel side page fault
kernel is already busy of doing user side page faults
furthermore, 1:1 linear mapping could have much less page table entries because it can utilize bigger size of page unit (>4kb)
less entries leads to less tlb misses.
so buddy allocator on kernel linear address space always provides physically contiguous page frames
even most codes doesn't need contiguous frames
but many device drivers which need contiguous page frames already believe that allocated buffers through general kernel allocator are physically contiguous

Why is Available Physical Memory (dwAvailPhys) > Available Virtual Memory (dwAvailVirtual) in call GlobalMemoryStatus on Windows Vista x64

I am playing with an MSDN sample to do memory stress testing (see: http://msdn.microsoft.com/en-us/magazine/cc163613.aspx) and an extension of that tool that specifically eats physical memory (see http://www.donationcoder.com/Forums/bb/index.php?topic=14895.0;prev_next=next). I am obviously confused though on the differences between Virtual and Physical Memory. I thought each process has 2 GB of virtual memory (although I also read 1.5 GB because of "overhead". My understanding was that some/all/none of this virtual memory could be physical memory, and the amount of physical memory used by a process could change over time (memory could be swapped out to disc, etc.)I further thought that, in general, when you allocate memory, the operating system could use physical memory or virtual memory. From this, I conclude that dwAvailVirtual should always be equal to or greater than dwAvailPhys in the call GlobalMemoryStatus. However, I often (always?) see the opposite. What am I missing.
I apologize in advance if my question is not well formed. I'm still trying to get my head around the whole memory management system in Windows. Tutorials/Explanations/Book recs are most welcome!
Andrew
That was only true in the olden days, back when RAM was expensive. The operating system maps pages of virtual memory to RAM as needed. If there isn't enough RAM to satisfy a program's request, it starts unmapping pages to make room. If such a page contains data instead of code, it gets written to the paging file. Whenever the program accesses that page again, it generates a paging fault, letting the operating system read the page back from disk.
If the machine has little RAM and lots of processes consuming virtual memory pages, that can cause a very unpleasant effect called "thrashing". The operating system is constantly accessing the disk and machine performance slows down to a crawl.
More RAM means less disk access. There's very little reason not to use 3 or 4 GB of RAM on a 32-bit operating system, it's cheap. Even if you do not get to use all 4 GB, not all of it will be addressable due hardware devices taking space on the address bus (video, mostly). But that won't change the size of the virtual memory accessible by user code, it is still 2 Gigabytes.
Windows Internals is a good book.
The amount of virtual memory is limited by size of the address space - which is 4GB per process on a 32-bit system. And you have to subtract from this the size of regions reserved for system use and the amount of VM used already by your process (including all the libraries mapped to its address space).
On the other hand, the total amount of physical memory may be higher than the amount of virtual memory space the system has left free for your process to use (and these days it often is).
This means that if you have more than ~2GB or RAM, you can't use all your physical memory in one process (since there's not enough virtual memory space to map it to), but it can be used by many processes. Note that this limitation is removed in a 64-bit system.
I don't know if this is your issue, but the MSDN page for the GlobalMemoryStatus function contains the following warning:
On computers with more than 4 GB of memory, the GlobalMemoryStatus function can return incorrect information, reporting a value of –1 to indicate an overflow. For this reason, applications should use the GlobalMemoryStatusEx function instead.
Additionally, that page says:
On Intel x86 computers with more than 2 GB and less than 4 GB of memory, the GlobalMemoryStatus function will always return 2 GB in the dwTotalPhys member of the MEMORYSTATUS structure. Similarly, if the total available memory is between 2 and 4 GB, the dwAvailPhys member of the MEMORYSTATUS structure will be rounded down to 2 GB. If the executable is linked using the /LARGEADDRESSAWARE linker option, then the GlobalMemoryStatus function will return the correct amount of physical memory in both members.
Since you're referring to members like dwAvailPhys instead of ullAvailPhys, it sounds like you're using a MEMORYSTATUS structure instead of a MEMORYSTATUSEX structure. I don't know the consequences of that on a 64-bit platform, but on a 32-bit platform that definitely could cause incorrect memory sizes to be reported.

Resources