mmap only needed pages of kernel buffer to user space - linux-kernel

See also this answer: https://stackoverflow.com/a/10770582/1284631
I need something similar, but without having to allocate a buffer: the buffer is large, in theory, but the user space program only needs to access some parts of it, so a limited number of pages.
The question is:
What would be the body of the my_vm_ops.fault() method and what page to return through vmf->page? (it needs to allocate the needed page, but not from a pre-existing buffer)

What you want to do is already possible by calling mmap with MAP_ANONYMOUS.
Alternatively, call mmap on /dev/zero.

Related

Difference between copying user space memory and mapping userspace memory

What is the difference between copying from user space buffer to kernel space buffer and, mapping user space buffer to kernel space buffer and then copying kernel space buffer to another kernel data structure?
What I meant to say is:
The first method is copy_from_user() function.
The second method is say, a user space buffer is mapped to kernel space and the kernel is passed with physical address(say using /proc/self/pagemap), then kernel space calls phys_to_virt() on the passed physical address to get it's corresponding kernel virtual address. Then kernel copies the data from one of its data structures say skb_buff to the kernel virtual address it got from the call to phys_to_virt() call.
Note: phys_to_virt() adds an offset of 0xc0000000 to the passed physical address to get kernel virtual address, right?
The second method describes the functionality in DPDK for KNI module and they say in documentation that it eliminates the overhead of copying from user space to kernel space. Please explain me how.
It really depends on what you're trying to accomplish, but still some differences I can think about?
To begin with, copy_from_user has some built-in security checks that should be considered.
While mapping your data "manually" to kernel space enables you to read from it continuously, and maybe monitor something that the user process is doing to the data in that page, while using the copy_to_user method will require constantly calling it to be aware of changes.
Can you elaborate on what you are trying to do?

kernel and user space sync

I have memory area mapped to user space with do_mmap_pgoff() and remap_pfn_range() and I have the same area mapped to kernel with ioremap().
When I write to this area from user space and then read from kernel space I see that not all bytes was written to memory area.
When I write from user space then read from user and after that read from kernel everything fine. Reading from user space pushing changes made previously.
I understand that cache or buffer exist between kernel and user spaces. I understand that I need to implement some flush-invalidate or buffer dump to memory area.
I tried to make this VMA uncached with pgprot_uncached(), I tried to implement outer cache range flush-invalidate, VMA cache range flush, VMA tlb range flush but it all dont work as I expected. All flush-inval operations just clears memory area but I need to apply changes made from user space. Using uncached memory slows up the process of data transferring.
How to do that synchronization between user and kernel correctly?
I have nearly the same question as you.
I use a shared memory region to pass data between kernel and user space. In kernel, I directly use physical address to access data. In user space, I open /dev/mem and mmap it to read/write.
And problem comes: When I write data to address A from user space, the kernel may not receive the data, and even covers data in A with it's previous value. I think CPU cache may cause this problem.
Here is my solution:
I open /dev/mem like this:
fd = open("/dev/mem", O_RDWR);
NOT this:
fd = open("/dev/mem", O_RDWR | O_SYNC);
And problem solved.

Allocating a buffer of more a page size on stack will corrupt memory?

In Windows, stack is implemented as followed: a specified page is followed committed stack pages. It's protection flag is as guarded. So when thead references an address on the guared page, an memory fault rises which makes memory manager commits the guarded page to the stack and clean the page's guarded flag, then it reserves a new page as guarded.
when I allocate an buffer which size is more than one page(4KB), however, an expected error haven't happen. Why?
Excellent question (+1).
There's a trick, and few people know about it (besides driver writers).
When you allocate large buffer on the stack - the compiler automatically adds so-called stack probes. It's an extra code (implemented in CRT usually), which probes the allocated region, page-by-page, in the needed order.
EDIT:
The function is _chkstk.
The fault doesn't reach your program - it is handled by the operating system. Similar thing happens when your program tries to read memory that happens to be written into the swap file - a trap occurs and the operating system unswaps the page and your program continues.

How can I access memory at known physical address inside a kernel module?

I am trying to work on this problem: A user spaces program keeps polling a buffer to get requests from a kernel module and serves it and then responses to the kernel.
I want to make the solution much faster, so instead of creating a device file and communicating via it, I allocate a memory buffer from the user space and mark it as pinned, so the memory pages never get swapped out. Then the user space invokes a special syscall to tell the kernel about the memory buffer so that the kernel module can get the physical address of that buffer. (because the user space program may be context-switched out and hence the virtual address means nothing if the kernel module accesses the buffer at that time.)
When the module wants to send request, it needs put the request to the buffer via physical address. The question is: How can I access the buffer inside the kernel module via its physical address.
I noticed there is get_user_pages, but don't know how to use it, or maybe there are other better methods?
Thanks.
You are better off doing this the other way around - have the kernel allocate the buffer, then allow the userspace program to map it into its address space using mmap().
Finally I figured out how to handle this problem...
Quite simple but may be not safe.
use phys_to_virt, which calls __va(pa), to get the virtual address in kernel and I can access that buffer. And because the buffer is pinned, that can guarantee the physical location is available.
What's more, I don't need s special syscall to tell the kernel the buffer's info. Instead, a proc file is enough because I just need tell the kernel once.

Can address space be recycled for multiple calls to MapViewOfFileEx without chance of failure?

Consider a complex, memory hungry, multi threaded application running within a 32bit address space on windows XP.
Certain operations require n large buffers of fixed size, where only one buffer needs to be accessed at a time.
The application uses a pattern where some address space the size of one buffer is reserved early and is used to contain the currently needed buffer.
This follows the sequence:
(initial run) VirtualAlloc -> VirtualFree -> MapViewOfFileEx
(buffer changes) UnMapViewOfFile -> MapViewOfFileEx
Here the pointer to the buffer location is provided by the call to VirtualAlloc and then that same location is used on each call to MapViewOfFileEx.
The problem is that windows does not (as far as I know) provide any handshake type operation for passing the memory space between the different users.
Therefore there is a small opportunity (at each -> in my above sequence) where the memory is not locked and another thread can jump in and perform an allocation within the buffer.
The next call to MapViewOfFileEx is broken and the system can no longer guarantee that there will be a big enough space in the address space for a buffer.
Obviously refactoring to use smaller buffers reduces the rate of failures to reallocate space.
Some use of HeapLock has had some success but this still has issues - something still manages to steal some memory from within the address space.
(We tried Calling GetProcessHeaps then using HeapLock to lock all of the heaps)
What I'd like to know is there anyway to lock a specific block of address space that is compatible with MapViewOfFileEx?
Edit: I should add that ultimately this code lives in a library that gets called by an application outside of my control
You could brute force it; suspend every thread in the process that isn't the one performing the mapping, Unmap/Remap, unsuspend the suspended threads. It ain't elegant, but it's the only way I can think of off-hand to provide the kind of mutual exclusion you need.
Have you looked at creating your own private heap via HeapCreate? You could set the heap to your desired buffer size. The only remaining problem is then how to get MapViewOfFileto use your private heap instead of the default heap.
I'd assume that MapViewOfFile internally calls GetProcessHeap to get the default heap and then it requests a contiguous block of memory. You can surround the call to MapViewOfFile with a detour, i.e., you rewire the GetProcessHeap call by overwriting the method in memory effectively inserting a jump to your own code which can return your private heap.
Microsoft has published the Detour Library that I'm not directly familiar with however. I know that detouring is surprisingly common. Security software, virus scanners etc all use such frameworks. It's not pretty, but may work:
HANDLE g_hndPrivateHeap;
HANDLE WINAPI GetProcessHeapImpl() {
return g_hndPrivateHeap;
}
struct SDetourGetProcessHeap { // object for exception safety
SDetourGetProcessHeap() {
// put detour in place
}
~SDetourGetProcessHeap() {
// remove detour again
}
};
void MapFile() {
g_hndPrivateHeap = HeapCreate( ... );
{
SDetourGetProcessHeap d;
MapViewOfFile(...);
}
}
These may also help:
How to replace WinAPI functions calls in the MS VC++ project with my own implementation (name and parameters set are the same)?
How can I hook Windows functions in C/C++?
http://research.microsoft.com/pubs/68568/huntusenixnt99.pdf
Imagine if I came to you with a piece of code like this:
void *foo;
foo = malloc(n);
if (foo)
free(foo);
foo = malloc(n);
Then I came to you and said, help! foo does not have the same address on the second allocation!
I'd be crazy, right?
It seems to me like you've already demonstrated clear knowledge of why this doesn't work. There's a reason that the documention for any API that takes an explicit address to map into lets you know that the address is just a suggestion, and it can't be guaranteed. This also goes for mmap() on POSIX.
I would suggest you write the program in such a way that a change in address doesn't matter. That is, don't store too many pointers to quantities inside the buffer, or if you do, patch them up after reallocation. Similar to the way you'd treat a buffer that you were going to pass into realloc().
Even the documentation for MapViewOfFileEx() explicitly suggests this:
While it is possible to specify an address that is safe now (not used by the operating system), there is no guarantee that the address will remain safe over time. Therefore, it is better to let the operating system choose the address. In this case, you would not store pointers in the memory mapped file, you would store offsets from the base of the file mapping so that the mapping can be used at any address.
Update from your comments
In that case, I suppose you could:
Not map into contiguous blocks. Perhaps you could map in chunks and write some intermediate function to decide which to read from/write to?
Try porting to 64 bit.
As the earlier post suggests, you can suspend every thread in the process while you change the memory mappings. You can use SuspendThread()/ResumeThread() for that. This has the disadvantage that your code has to know about all the other threads and hold thread handles for them.
An alternative is to use the Windows debug API to suspend all threads. If a process has a debugger attached, then every time the process faults, Windows will suspend all of the process's threads until the debugger handles the fault and resumes the process.
Also see this question which is very similar, but phrased differently:
Replacing memory mappings atomically on Windows

Resources