What is the difference between the vma_flags VM_IO and VM_RESERVED? How should they be used?
The comments in linux kernel source code is very confusing
http://lxr.free-electrons.com/source/include/linux/mm.h?v=3.4;a=arm#L104
http://lxr.free-electrons.com/source/include/linux/mm.h?v=3.4;a=arm#L96
#define VM_IO 0x00004000 /* Memory mapped I/O or similar */
#define VM_RESERVED 0x00080000 /* Count as reserved_vm like IO */
Thanks
From: http://www.makelinux.net/ldd3/chp-15-sect-1
"VM_IO marks a VMA as being a memory-mapped I/O region. Among other things, the VM_IO flag prevents the region from being included in process core dumps.
VM_RESERVED tells the memory management system not to attempt to swap out this VMA; it should be set in most device mappings."
For better understanding of how and why these evolved and what is the linux kernel community's recommendation, read these lwn articles:
http://lwn.net/Articles/161204/
http://lwn.net/Articles/162860/
UPDATE:
The VM_RESERVED flag has been discontinued. See Linus' patch
Related
I have written a driver whose purpose is allow a userspace program to pin its pages and get the physical addresses for them.
Specifically, I do this with a call to get_user_pages_fast in my kernel module.
For reference, the source code to this module can be found here: https://github.com/UofT-HPRC/mpsoc_drivers/tree/master/pinner
Using /dev/mem (and yes, my kernel does allow unsafe /dev/mem accesses) I have confirmed that the physical addresses are correct.
However, I have some external hardware (an AXI DMA in an FPGA, to be precise) which is not working, and it looks like it might be a cache coherency problem. On lines 329-337 of the above linked code, I do this: (in this code, cm.usr_buf is a user virtual address)
//Find the VMA containing the user's buffer
struct vm_area_struct *vma = find_vma(current->mm, (unsigned long)cmd.usr_buf);
if (!vma) {
printk(KERN_ALERT "pinner: unrecognized user virtual address\n");
return -EINVAL;
}
flush_cache_range(vma, (unsigned long) cmd.usr_buf, (unsigned long) cmd.usr_buf + cmd.usr_buf_sz);
This doesn't appear to help. I have also tried the more general flush_cache_mm function.
Is there a correct way to flush the cache of user pages?
I tried a different API for flushing the cache. Laurent Pinchard gave a talk called "Mastering the DMA and IOMMU APIs", and in it he explains that the functions in <asm/cacheflush.h> shouldn't be used. Instead, you can use things like dma_map_sg and dma_unmap_sg when pinning user memory. I took a quick look in the kernel sources, and these functions eventually call assembly routines specific to each architecture, which are possibly responsible for disabling the cache in certain memory regions.
Also, dma_sync_sg_for_cpu and dma_sync_sg_for_device can be used to force cache flushes if you try to access the memory between DMA transfers.
I rewrote my kernel driver to use these functions, and it works.
When I am browsing some code in one device driver in linux, I found the flag PF_MEMALLOC is being set in the thread (process). I found the definition of this flag in header file, which saying that "Allocating Memory"
#define PF_MEMALLOC 0x00000800 /* Allocating memory */
So, my doubt here is, what exactly the use of this flag when set it in a process/thread like code current->flags |= PF_MEMALLOC;
This flag is used within the kernel to indicate a thread that is current executing with a memory-allocation path, and therefore is allowed to recursively allocate any memory it requires ignoring watermarks and without being forced to write out dirty pages.
This is to ensure that if the code that is attempting to free pages in order to satisfy an original allocation request itself has to allocate a small amount of memory to proceed, that code won't then recursively try to free pages.
Most drivers should not require this flag.
I was reading section 'Part Id' of the following document I'm not sure how relevant this document to kernel 2.6.35 for instance; specifically it says:
..the DMA address of the memory must be within the dma_mask of the device..
and they recommend to pass certain flags, such as GFP_DMA, to kmalloc, so that it ensures the memory will fall within DMA mask provided.
However if the memory is allocated from cache pool created by kmem_cache_create, and with kmem_cache_alloc(.. GFP_ATOMIC), this doesn't meet requirements outlined in DMA-API.txt ?
On the other hand, LDD talks about __GFP_DMA flag with regard to legacy ISA devices, therefore I'm not sure this is applicable to PCI/PCIe devices.
This is x86 64-bit platform if it matters:
pci_set_dma_mask(dev, 0xffffffffffffffffULL);
pci_set_consistent_dma_mask(dev, 0xffffffffffffffffULL);
I would appreciate to hear some explanations on it.
For GFP_* for DMA
On x86:
ISA - when using kmalloc() need to bitwise-or GFP_DMA with GFP_KERNEL (or _ATOMIC) because of the following:
GFP_DMA guarantees:
(1) physical addresses are consecutive when get_free_page returns more than one page and
(2) only addresses lower than MAX_DMA_ADDRESS are returned. MAX_DMA_ADDRESS is 16MB on the PC because of ISA constraings
PCI - don't need to use GFP_DMA because there is no MAX_DMA_ADDRESS limit
The dma_mask is checked by the device when calling dma_map_* or dma_alloc_coherent.
dma_alloc_coherent ensures the memory allocated is able to be used by dma_map_* which gives other benifits too. (the implementation may choose to ignore flags that affect the location of the returned memory, like GFP_DMA)
You can refer to http://coweb.cc.gatech.edu/sysHackfest/uploads/58/DMA_howto.1.txt
I'm intrigued by the DISCARDABLE flag in the section flags in PE files, specifically in the context of Windows drivers (in this case NDIS). I noticed that the INIT section was marked as RWX in a driver I'm reviewing, which seems odd - good security practice says you should adopt a W^X policy.
The dump of the section is as follows:
Name Virtual Size Virtual Addr Raw Size Raw Addr Reloc Addr LineNums RelocCount LineNumCount Characteristics
INIT 00000B7E 0000E000 00000C00 0000B200 00000000 00000000 0000 0000 E2000020
The characteristics map to:
IMAGE_SCN_MEM_EXECUTE
IMAGE_SCN_MEM_READ
IMAGE_SCN_MEM_WRITE
IMAGE_SCN_MEM_DISCARDABLE
IMAGE_SCN_CNT_CODE
The INIT section seems to contain the driver entry, which implies that it might be used to ensure that the driver entry function resides in nonpaged memory, whereas the rest of the code is allowed to be paged. I'm not entirely sure, though. I can see no evidence in the driver code to say that the developers explicitly set the page flags, or forced the driver entry into a separate section, so it looks like the compiler did it automatically. I also manually flipped the writeable flag in the driver binary to test it out, and it works fine without writing enabled, so that implies that having it RWX is unnecessary.
So, my questions are:
What is the INIT section used for in the context of a Windows driver and why is it marked discardable?
How are discardable sections treated in the Windows kernel? I have some idea of how ReactOS handles them but that's still fuzzy and not massively helpful.
Why would the compiler move the driver entry to an INIT section?
Why would the compiler mark the section as RWX, when RX is sufficient and RWX may constitute a security issue?
References I've looked at so far:
What happens when you mark a section as DISCARDABLE? - The Old New Thing
Windows Executable Files - x86 Disassembly Book
Pageable and Discardable Code in a Protocol Driver - MSDN
EDIT, 2022: I forgot to update this, but a while after I posted this question I passed it on to Microsoft and it did turn out to be a bug in the MSVC linker. They were mistakenly marking the discard section that contained DriverEntry as RWX. The issue was fixed in VS2015.
What is the INIT section used for in the context of a Windows...
It is normally used for the DriverEntry() function.
How are discardable sections treated in the Windows kernel?
It allows the page(s) that contain the DriverEntry() function code to be discarded. They are no longer needed after the driver is initialized.
Why would the compiler move the driver entry to an INIT section?
An NDIS driver normally contains
#pragma NDIS_INIT_FUNCTION(DriverEntry)
Which is a macro in the WDK's inc/ddk/ndis.h header file:
#define NDIS_INIT_FUNCTION(_F) alloc_text(INIT,_F)
#pragma alloc_text is one of the ways to move a function into a particular section. Another common way it is done is by bracketing the DriverEntry function with #pragma code_seg(INIT) and #pragma code_seg().
Why would the compiler mark the section as RWX
That requires an archeological dig. Many drivers were started a long time ago and are likely to still use ~VS6, back when life was still uncomplicated and programmers wore white hats. Or perhaps the programmer used #pragma section, yet another way to name sections, it permits setting the attributes directly. A modern toolchain certainly won't do this, you get RX from #pragma alloc_text. There very little point in fretting about it, given that DriverEntry() lives for a very short time and any malware code that runs with ring0 privileges can do a lot more practical damage.
I passed this information on to Microsoft and it did turn out to be a bug in the MSVC linker. They were mistakenly marking the discard section that contained DriverEntry as RWX. This issue was fixed in Visual Studio 2015.
I wrote about the issue in more detail here.
The 3rd parameter of VirtualProtect can use flags as follow:
PAGE_EXECUTE
PAGE_NOACCESS
PAGE_READWRITE
PAGE_READONLY
...
At the first I think VirtualProtect may achieve it by using PTE's flag. But when I read the structure of PTE, I cannot find the flag in PTE which record this function's 3rd parameter.
The PTE's structure as follow:
Sorry i cannot post images (for don't have 10 reputation! ), you can find it from Google.
I want to find where the Windows record the protection flag of a virtual memory page, Is not PTE?
After read some material, I Noticed that when a PTE is invalid, the meaning of PTE's fields have changed! And then have 5-bits for protection flag.
The available ProtectionFlags are a super-set of what an Intel processor supports. Keep in mind that Windows was written to run on a variety of processors, it once supported MIPS, Itanium, Alpha and PowerPC as well. A mere footnote today, AMD/Intel won by a landslide with ARM popular on mobile devices.
An Intel processor has pretty limited support for the page protection attributes. A page table entry has:
bit 1 for (R/W), a 1 allows write access, a 0 only allows read access
bit 2 for (U/S), user/supervisory, not relevant to user mode code
bit 63 for (XD), eXecute Disabled. A late addition to AMD cores, originally marketed as "Enhanced Virus Protection", adopted by Intel. All processors you'll find today support it.
So the kernel maps the protection flags like this:
PAGE_NOACCESS: the page simply won't be mapped to RAM
PAGE_READONLY: R/W = 0, XD = 1
PAGE_READWRITE: R/W = 1, XD = 1
PAGE_EXECUTE: R/W = 0, XD = 0