What pitfalls should I be wary of when memory mapping BIG files?

What pitfalls should I be wary of when memory mapping BIG files? - performance

I have a bunch of big files, each file can be over 100GB, the total amount of data can be 1TB and they are all read-only files (just have random reads).
My program does small reads in these files on a computer with about 8GB main memory.
In order to increase performance (no seek() and no buffer copying) i thought about using memory mapping, and basically memory-map the whole 1TB of data.
Although it sounds crazy at first, as main memory << disk, with an insight on how virtual memory works you should see that on 64bit machines there should not be problems.
All the pages read from disk to answer to my read()s will be considered "clean" from the OS, as these pages are never overwritten. This means that all these pages can go directly to the list of pages that can be used by the OS without writing back to disk OR swapping (wash them). This means that the operating system could actually store in physical memory just the LRU pages and would operate just reads() when the page is not in main memory.
This would mean no swapping and no increase in i/o because of the huge memory mapping.
This is theory; what I'm looking for is any of you who has every tried or used such an approach for real in production and can share his experience: are there any practical issues with this strategy?

What you are describing is correct. With a 64-bit OS you can map 1TB of address space to a file and let the OS manage reading and writing to the file.
You didn't mention what CPU architecture you are on but most of them (including amd64) the CPU maintains a bit in each page table entry as to whether data in the page has been written to. The OS can indeed use that flag to avoid writing pages that haven't been modified back to disk.
There would be no increase in IO just because the mapping is large. The amount of data you actually access would determine that. Most OSes, including Linux and Windows, have a unified page cache model in which cached blocks use the same physical pages of memory as memory mapped pages. I wouldn't expect the OS to use more memory with memory mapping than with cached IO. You're just getting direct access to the cached pages.
One concern you may have is with flushing modified data to disk. I'm not sure what the policy is on your OS specifically but the time between modifying a page and when the OS will actually write that data to disk may be a lot longer than your expecting. Use a flush API to force the data to be written to disk if it's important to have it written by a certain time.
I haven't used file mappings quite that large in the past but I would expect it to work well and at the very least be worth trying.

Related

Is it possible to "gracefully" use virtual memory in a program whose regular use would consume all physical RAM?

I am intending to write a program to create huge relational networks out of unstructured data - the exact implementation is irrelevant but imagine a GPT-3-style large language model. Training such a model would require potentially 100+ gigabytes of available random access memory as links get reinforced between new and existing nodes in the graph. Only a small portion of the entire model would likely be loaded at any given time, but potentially any region of memory may be accessed randomly.
I do not have a machine with 512 Gb of physical RAM. However, I do have one with a 512 Gb NVMe SSD that I can dedicate for the purpose. I see two potential options for making this program work without specialized hardware:
I can write my own memory manager that would swap pages between "hot" resident memory and "cold" on the hard disk, probably using memory-mapped files or some similar construct. This would require me coding all memory accesses in the modeling program to use this custom memory manager, and coding the page cache and concurrent access handlers and all of the other low-level stuff that comes along with it, which would take days and very likely introduce bugs. Also performance would likely be poor. Or,
I can configure the operating system to use the entire SSD as a page file / SWAP filesystem, and then just have the program reserve as much virtual memory as it needs - the same as any other normal program, relying on the kernel's memory manager which is already doing the page mapping + swapping + caching for me.
The problem I foresee with #2 is making the operating system understand what I am trying to do in a "cooperative" way. Ideally I would like to hint to the OS that I would only like a specific fraction of resident memory and swap the rest, to keep overall system RAM usage below 90% or so. Otherwise the OS will allocate 99% of physical RAM and then start aggressively compacting and cutting down memory from other background programs, which ends up making the whole system unresponsive. Linux apparently just starts sacrificing entire processes if it gets too bad.
Does there exist a kernel command in any language or operating system that would let me tell the OS to chill out and proactively swap user memory to disk? I have looked through VMM functions in kernel32.dll and the Linux paging and swap daemon (kswapd) documentation, but nothing looks like what I need. Perhaps some way to reserve, say, 1Gb of pages and then "donate" them back to the kernel to make sure they get used for processes that aren't my own? Some way to configure memory pressure or limits or make kswapd work more aggressively for just my process?

Does GetWriteWatch work with Memory-Mapped FIles?

outI'm working with memory mapped files (MMF) with very large datasets (depending on the input file), where each file has ~50GB and there are around 40 files open at the same time. Of course this depends, I can also have smaller files, but I can also have larger files - so the system should scale itself.
The MMF is acting as a backing buffer, so as long as I have enough free memory there shoud occur no paging. The problem is that the windows memory manager and my application are two autonomous processes. In good conditions everything is working fine, but the memory manager obviously is too slow in conditions where I'm entering low memory conditions, the memory is full and then the system starts to page (which is good), but I'm still allocating memory, because I don't get any information about the paging.
In the end I'm entering a state where the system stalls, the memory manager pages and I'm allocating.
So I came to the point where I need to advice the memory manager, check current memory conditions and invoke the paging myself. For that reason I wanted to use the GetWriteWatch to inspect the memory region I can flush.
Interestingly the GetWriteWatch does not work in my situation, it returns a -1 without filling the structures. So my question is does GetWriteWatch work with MMFs?

Does GetWriteWatch work with Memory-Mapped Files?
I don't think so.
GetWriteWatch accepts memory allocated via VirtualAlloc function using MEM_WRITE_WATCH.
File mapping are mapped using MapViewOfFile* functions that do not have this flag.

Does mmap directly access the page cache, or a copy of the page cache?

To ask the question another way, can you confirm that when you mmap() a file that you do in fact access the exact physical pages that are already in the page cache?
I ask because I’m doing testing on a 192 core machine with 1TB of RAM, on a 400GB data file that is pre-cached into the page cache prior to the test (by just dropping the cache, then doing md5sum on the file).
Initially, I had all 192 threads each mmap the file separately, on the assumption that they would all get (basically) the same memory region back (or perhaps the same memory region but somehow mapped multiple times). Accordingly, I assumed two threads using two different mappings to the same file would both have direct access to the same pages. (Let’s ignore NUMA for this example, though obviously it’s significant at higher thread counts.)
However, in practice I found performance would get terrible at higher thread counts when each thread separately mmapped the file. When we removed that and instead just did a single mmap that was passed into the thread (such that all threads just directly access the same memory region), then performance improved dramatically.
That’s all great, but I’m trying to figure out why. If in fact mmapping a file just grants direct access to the existing page cache, then I would think that it shouldn’t matter how many times you map it — it should all go to the exact same place.
But given that there was such a performance cost, it seemed to me that in fact each mmap was being independently and redundantly populated (perhaps by copying from the page cache, or perhaps by reading again from disk).
Can you comment on why I was seeing such different performance between shared access to the same memory, versus mmapping the same file?
Thanks, I appreciate your help!

I think I found my answer, and it deals with the page directory. The answer is yes, two mmapped regions of the same file will access the same underlying page cache data. However, each mapping needs to independently map each of the virtual pages to the physical pages -- meaning 2x as many entries in the page directory to access the same RAM.
Basically, each mmap() creates a new range in virtual memory. Every page of that range corresponds to a page of physical memory, and that mapping is stored in a hierarchical page directory -- with one entry per 4KB page. So every mmap() of a large region generates a huge number of entries in the page directory.
My guess is it doesn't actually define them all up front, which is why mmap() is instant to call even for a giant file. But over time it probably has to establish those entries as there are faults on the mmapped range, meaning over the course of time it gets filled out. This extra work to populate the page directory is probably why threads using different mmaps are slower than threads sharing the same mmap. And I bet the kernel needs to erase all those entries when unmapping the range -- which is why unmmap() is so slow.
(There's also the translation lookaside buffer, but that's per-CPU, and so small I don't think that matters much here.)
Anyway, it sounds like re-mapping the same region just adds extra overhead, for what seems to me like no gain.

caches vs paging

So I'm in a computer architecture class, and I guess I'm having a hard time differentiating between caching and pages.
The only explanation I can come up with is that pages are the OS's way of tricking a program that it's doing all it's work in a specified region of memory, vs a cache memory is the hardware's way of tricking the OS that it's reading from one specified region of memory, when it's really not.
Does the os direct the hardware that it needs a "new page" or is that taken care of by the os trying to read the address that is "out of range" of the current cache "page" (for lack of a better term).
Am I on the right track or am I completely crazy?

Caching and pages are orthogonal concepts.
A cache is a high-speed "memory" that acts to minimise the number of accesses to a large low-speed "memory". In the most general sense, the high-speed "memory" could be your hard disk acting to cache web pages fetched from the web (low-speed "memory"). Of course, in the context of computer architecture, the term "cache" is more likely to refer to physical RAM used to speed up access to slower RAM or disk.
Pages, OTOH, are simply a unit of management for the contents of RAM or disk.
These two concepts come together in implementing virtual memory systems. A process may allocate 500 MB of memory. This may be more that the physical RAM available to give to the process, so the operating system allocates blocks on disk called pages, which will hold the contents of certain logical pages in the process's address-space.
When the process accesses a location in its address-space, and the associated page isn't currently mapped into physical memory, the CPU signals a page fault, and the OS responds by fetching the page from disk while the process is in a suspended state. Once the page is mapped, the process resumes and is able to access that memory location as if it was there all along.
The common view that virtual memory is a way of tricking the process into thinking it has tons of RAM isn't the only way to think about this. You could also think of a process's address-space as being logically stored on disk pages, with the OS-assisted mapping into RAM being just a way to cache the contents of those pages such that the process isn't continually accessing the hard drive. In this sense, caching and paged virtual memory are logically the same thing. Just keep in mind that, while this viewpoint may help to understand the relationship between the two concepts, it isn't entirely accurate, since it is possible to run without virtual memory at all, just physical memory (in fact, most embedded systems run this way).

I think Paging is bring instruction data from disk or secondary memory to the main memory while caching is bring instruction data from main memory to CPU

Memory mapped files causes low physical memory

I have a 2GB RAM and running a memory intensive application and going to low available physical memory state and system is not responding to user actions, like opening any application or menu invocation etc.
How do I trigger or tell the system to swap the memory to pagefile and free physical memory?
I'm using Windows XP.
If I run the same application on 4GB RAM machine it is not the case, system response is good. After getting choked of available physical memory system automatically swaps to pagefile and free physical memory, not that bad as 2GB system.
To overcome this problem (on 2GB machine) attempted to use memory mapped files for large dataset which are allocated by application. In this case virtual memory of the application(process) is fine but system cache is high and same problem as above that physical memory is less.
Even though memory mapped file is not mapped to process virtual memory system cache is high. why???!!! :(
Any help is appreciated.
Thanks.

If your data access pattern for using the memory mapped file is sequential, you might get slightly better page recycling by specifying the FILE_FLAG_SEQUENTIAL_SCAN flag when opening the underlying file. If your data pattern accesses the mapped file in random order, this won't help.
You should consider decreasing the size of your map view. That's where all the memory is actually consumed and cached. Since it appears that you need to handle files that are larger than available contiguous free physical memory, you can probably do a better job of memory management than the virtual memory page swapper since you know more about how you're using the memory than the virtual memory manager does. If at all possible, try to adjust your design so that you can operate on portions of the large file using a smaller view.
Even if you can't get rid of the need for full random access across the entire range of the underlying file, it might still be beneficial to tear down and recreate the view as needed to move the view to the section of the file that the next operation needs to access. If your data access patterns tend to cluster around parts of the file before moving on, then you won't need to move the view as often. You'll take a hit to tear down and recreate the view object, but since tearing down the view also releases all the cached pages associated with the view, it seems likely you'd see a net gain in performance because the smaller view significantly reduces memory pressure and page swapping system wide. Try setting the size of the view based on a portion of the installed system RAM and move the view around as needed by your file processing. The larger the view, the less you'll need to move it around, but the more RAM it will consume potentially impacting system responsiveness.

As I think you are hinting in your post, the slow response time is probably at least partially due to delays in the system while the OS writes the contents of memory to the pagefile to make room for other processes in physical memory.
The obvious solution (and possibly not practical) is to use less memory in your application. I'll assume that is not an option or at least not a simple option. The alternative is to try to proactively flush data to disk to continually keep available physical memory for other applications to run. You can find the total memory on the machine with GlobalMemoryStatusEx. And GetProcessMemoryInfo will return current information about your own application's memory usage. Since you say you are using a memory mapped file, you may need to account for that in addition. For example, I believe the PageFileUsage information returned from that API will not include information about your own memory mapped file.
If your application is monitoring the usage, you may be able to use FlushViewOfFile to proactively force data to disk from memory. There is also an API (EmptyWorkingSet) that I think attempts to write as many dirty pages to disk as possible, but that seems like it would very likely hurt performance of your own application significantly. Although, it could be useful in a situation where you know your application is going into some kind of idle state.
And, finally, one other API that might be useful is SetProcessWorkingSetSizeEx. You might consider using this API to give a hint on an upper limit for your application's working set size. This might help preserve more memory for other applications.
Edit: This is another obvious statement, but I forgot to mention it earlier. It also may not be practical for you, but it sounds like one of the best things you might do considering that you are running into 32-bit limitations is to build your application as 64-bit and run it on a 64-bit OS (and throw a little bit more memory at the machine).

Well, it sounds like your program needs more than 2GB of working set.
Modern operating systems are designed to use most of the RAM for something at all times, only keeping a fairly small amount free so that it can be immediately handed out to processes that need more. The rest is used to hold memory pages and cached disk blocks that have been used recently; whatever hasn't been used recently is flushed back to disk to replenish the pool of free pages. In short, there isn't supposed to be much free physical memory.
The principle difference between using a normal memory allocation and memory mapped a files is where the data gets stored when it must be paged out of memory. It doesn't necessarily have any effect on when the memory will be paged out, and will have little effect on the time it takes to page it out.
The real problem you are seeing is probably not that you have too little free physical memory, but that the paging rate is too high.
My suggestion would be to attempt to reduce the amount of storage needed by your program, and see if you can increase the locality of reference to reduce the amount of paging needed.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio