I'm looking for a way to allocate huge pages (2M or 1G) in a kernel module (I'm using kernel version 4.15.0).
In user space, I can mount the hugetlbfs file system, and then allocate huge pages using mmap (see, e.g., https://blog.kevinhu.me/2018/07/01/01-Linux-Hugepages/). Is there a similar way to do this in kernel space?
I'm aware that I could allocate them in user space first, and then pass them to the kernel using get_user_pages, as described in Sequential access to hugepages in kernel driver. However, I'm looking for a more direct way to allocate them, as I only need them in kernel space.
Something similar to
kmalloc(0x200000, GFP_KERNEL | __GFP_COMP)
should work.
As explained in this LWN article:
A compound page is simply a grouping of two or more physically contiguous pages into a unit that can, in many ways, be treated as a single, larger page. They are most commonly used to create huge pages, used within hugetlbfs or the transparent huge pages subsystem, but they show up in other contexts as well. Compound pages can serve as anonymous memory or be used as buffers within the kernel; they cannot, however, appear in the page cache, which is only prepared to deal with singleton pages.
This makes the assumption that huge pages are configured and available.
Related
outI'm working with memory mapped files (MMF) with very large datasets (depending on the input file), where each file has ~50GB and there are around 40 files open at the same time. Of course this depends, I can also have smaller files, but I can also have larger files - so the system should scale itself.
The MMF is acting as a backing buffer, so as long as I have enough free memory there shoud occur no paging. The problem is that the windows memory manager and my application are two autonomous processes. In good conditions everything is working fine, but the memory manager obviously is too slow in conditions where I'm entering low memory conditions, the memory is full and then the system starts to page (which is good), but I'm still allocating memory, because I don't get any information about the paging.
In the end I'm entering a state where the system stalls, the memory manager pages and I'm allocating.
So I came to the point where I need to advice the memory manager, check current memory conditions and invoke the paging myself. For that reason I wanted to use the GetWriteWatch to inspect the memory region I can flush.
Interestingly the GetWriteWatch does not work in my situation, it returns a -1 without filling the structures. So my question is does GetWriteWatch work with MMFs?
Does GetWriteWatch work with Memory-Mapped Files?
I don't think so.
GetWriteWatch accepts memory allocated via VirtualAlloc function using MEM_WRITE_WATCH.
File mapping are mapped using MapViewOfFile* functions that do not have this flag.
To ask the question another way, can you confirm that when you mmap() a file that you do in fact access the exact physical pages that are already in the page cache?
I ask because I’m doing testing on a 192 core machine with 1TB of RAM, on a 400GB data file that is pre-cached into the page cache prior to the test (by just dropping the cache, then doing md5sum on the file).
Initially, I had all 192 threads each mmap the file separately, on the assumption that they would all get (basically) the same memory region back (or perhaps the same memory region but somehow mapped multiple times). Accordingly, I assumed two threads using two different mappings to the same file would both have direct access to the same pages. (Let’s ignore NUMA for this example, though obviously it’s significant at higher thread counts.)
However, in practice I found performance would get terrible at higher thread counts when each thread separately mmapped the file. When we removed that and instead just did a single mmap that was passed into the thread (such that all threads just directly access the same memory region), then performance improved dramatically.
That’s all great, but I’m trying to figure out why. If in fact mmapping a file just grants direct access to the existing page cache, then I would think that it shouldn’t matter how many times you map it — it should all go to the exact same place.
But given that there was such a performance cost, it seemed to me that in fact each mmap was being independently and redundantly populated (perhaps by copying from the page cache, or perhaps by reading again from disk).
Can you comment on why I was seeing such different performance between shared access to the same memory, versus mmapping the same file?
Thanks, I appreciate your help!
I think I found my answer, and it deals with the page directory. The answer is yes, two mmapped regions of the same file will access the same underlying page cache data. However, each mapping needs to independently map each of the virtual pages to the physical pages -- meaning 2x as many entries in the page directory to access the same RAM.
Basically, each mmap() creates a new range in virtual memory. Every page of that range corresponds to a page of physical memory, and that mapping is stored in a hierarchical page directory -- with one entry per 4KB page. So every mmap() of a large region generates a huge number of entries in the page directory.
My guess is it doesn't actually define them all up front, which is why mmap() is instant to call even for a giant file. But over time it probably has to establish those entries as there are faults on the mmapped range, meaning over the course of time it gets filled out. This extra work to populate the page directory is probably why threads using different mmaps are slower than threads sharing the same mmap. And I bet the kernel needs to erase all those entries when unmapping the range -- which is why unmmap() is so slow.
(There's also the translation lookaside buffer, but that's per-CPU, and so small I don't think that matters much here.)
Anyway, it sounds like re-mapping the same region just adds extra overhead, for what seems to me like no gain.
We have an app that requires ~1MB buffers for a hardware device to fill, therefore we wrote a kernel module that allocates buffers using kmalloc(). We did not use dma_alloc_coherent() as we need to manipulative the buffers and therefore wanted them to be cached (we flush the cache when needed). One of the manipulations that is done is the kernel module copies one buffer to another buffer. In timing these copies we see it takes about ~2ms to copy a buffer. The time does not include any cache flushing.
As this seemed slow we wrote a standard userspace test app, that used malloc() to create 1MB buffers and copied them. The userspace copies took about .5ms, which is about the correct time to move this amount of memory on the processor/memory config we are using.
Thinks we tried: To make sure it wasn't a different memcpy() in kernel space and user space we wrote our own NEON optimized copy, but made no difference. Changed the buffer size from 100KB to 10MB and made no difference. All times were over 10 copies, but always very very consistent. Time routine used gettimeofday() in userspace.
Only thing we can think of is that the data cache is setup up different for kmalloc()'ed memory then malloc()'ed memory???
We are working on iMX6 ARM, Linaro kerne.
The kmalloc() memory will be contiguous in physical space. The user space will definitely not (mlock() may result in closer to contiguous). If you have several SDRAM chips, it is possible that your memory controller allow pipelining or multiple issue reads/writes to different chips simultaneously. It may even be faster with multiple banks. vmalloc() will not use contiguous pages.Ref You should be able to write a test to swap kmalloc() with vmalloc(). If something has changed with the newer ARMs and the cache is not VIVT, the difference in physical addresses could cause cache (aliasing?) effects on some processors.
I do not think that the cache are setup differently for kernel memory versus user memory; at least with 2.6.34 variants; but they may come from different pools. Also, for a memcpy() a large cache is not needed; you just need enough to make sure the SDRAM will burst.
Another issue is peripherals. For instance, a large graphics buffer on one chip maybe stealing cycles via DMA. If you can change your machine file or device table to disable as many drivers as possible, this can be eliminated. This combined with the pipelining could account for the type of slow-down observed.
I believe this is a platform issue. If it was strictly Linux, I think that one of the millions of users may have encountered it. However, you haven't given a specific version of Linux. It could be an ARM based issue; so I tagged it as such. I think it is your platform/ARM combination; simply because others would observe this. Can you also provide a specific machine file or device table that your design was based upon and the Linux version.
I'm implementing IPC between two processes on the same machine (Linux x86_64 shmget and friends), and I'm trying to maximize the throughput of the data between the processes: for example I have restricted the two processes to only run on the same CPU, so as to take advantage of hardware caching.
My question is, does it matter where in the virtual address space each process puts the shared object? For example would it be advantageous to map the object to the same location in both processes? Why or why not?
It doesn't matter as long as the OS is concerned. It would have been advantageous to use the same base address in both processes if the TLB cache wasn't flushed between context switches. The Translation Lookaside Buffer (TLB) cache is a small buffer that caches virtual to physical address translations for individual pages in order to reduce the number of expensive memory reads from the process page table. Whenever a context switch occurs, the TLB cache is flushed - you don't want processes to be able to read a small portion of the memory of other processes, just because its page table entries are still cached in the TLB.
Context switch does not occur between processes running on different cores. But then each core has its own TLB cache and its content is completely uncorrelated with the content of the TLB cache of the other core. TLB flush does not occur when switching between threads from the same process. But threads share their whole virtual address space nevertheless.
It only makes sense to attach the shared memory segment at the same virtual address if you pass around absolute pointers to areas inside it. Imagine, for example, a linked list structure in shared memory. The usual practice is to use offsets from the beginning of the block instead of aboslute pointers. But this is slower as it involves additional pointer arithmetic. That's why you might get better performance with absolute pointers, but finding a suitable place in the virtual address space of both processes might not be an easy task (at least not doing it in a portable way), even on platforms with vast VA spaces like x86-64.
I'm not an expert here, but seeing as there are no other answers I will give it a go. I don't think it will really make a difference, because the virutal address does not necessarily correspond to the physical address. Said another way, the underlying physical address the OS maps your virtual address to is not dependent on the virtual address the OS gives you.
Again, I'm not a memory master. Sorry if I am way off here.
The function CreateFileMapping can be used to allocate space in the pagefile (if the first argument is INVALID_HANDLE_VALUE). The allocated space can later be memory mapped into the process virtual address space.
Why would I want to do this instead of using just VirtualAlloc?
It seems that both functions do almost the same thing. Memory allocated by VirtualAlloc may at some point be pushed out to the pagefile. Why should I need an API that specifically requests that my pages be allocated there in the first instance? Why should I care where my private pages live?
Is it just a hint to the OS about my expected memory usage patterns? (Ie, the former is a hint to swap out those pages more aggressively.)
Or is it simply a convenience method when working with very large datasets on 32-bit processes? (Ie, I can use CreateFileMapping to make >4Gb allocations, then memory map smaller chunks of the space as needed. Using the pagefile saves me the work of manually managing my own set of files to "swap" to.)
PS. This question is sparked by an article I read recently: http://blogs.technet.com/markrussinovich/archive/2008/11/17/3155406.aspx
From the CreateFileMappingFunction:
A single file mapping object can be shared by multiple processes.
Can the Virtual memory be shared across multiple processes?
One reason is to share memory among different processes. Different processes by only knowing the name of the mapping object can communicate over page file. This is preferable over creating a real file and doing the communications. Of course there may be other use cases. You can refer to Using a File Mapping for IPC at MSDN for more information.