Linux `alloc_pages_node` not incrementing `_refcount` for all allocated pages

Linux `alloc_pages_node` not incrementing `_refcount` for all allocated pages - linux-kernel

When allocating physically-contiguous memory with alloc_pages_node in Linux v6.0, the _refcount in struct page for all of the allocated pages is not incremented. Only the first page of the allocation has its _refcount correctly incremented.
Is this correct/intended behavior?
Is this function only intended to be used in particular use cases/in a particular way such that the incorrect _refcount is accounted for?
Context: alloc_pages* are a series of functions in the kernel intended for allocating a physically contiguous set of pages Documentation. These functions return a pointer to the struct page corresponding to the first page of the allocated region.
I am using this function during early boot (in fact while setting up the stacks for the init process and for kthreadd).
By this point, the buddy-allocator is functional and usable.
Similar APIs (ignoring the need for physical contiguity) such as vmalloc increment the _refcount for all allocated pages.
This is the code I am running. The output is also listed below.
Code
order = get_order(nr_pages << PAGE_SIZE);
p = alloc_pages_node(node, gfp_mask, order);
if (!p)
return;
for(i = 0; i < nr_pages; i++, p++)
printk("_refcount = %d", p->_refcount);
Output
_refcount = 1
_refcount = 0
_refcount = 0
...
Arguments
gfp_mask is (THREADINFO_GFP & ~__GFP_ACCOUNT) | __GFP_NOWARN | __GFP_HIGHMEM.
The first part THREADINFO_GFP & ~__GFP_ACCOUNT of this is sent by alloc_thread_stack_node
__vmalloc_area_node adds __GFP_NOWARN | __GFP_HIGHMEM
order = get_order(nr_pages << PAGE_SIZE) = 2 since nr_pages is 4.

Is this correct/intended behavior?
Yes, this is normal. Page allocations of order higher than 0 are effectively considered as a single "high-order" page(1) by the the buddy allocator, so functions such as alloc_pages() and __free_pages(), which operate on both order-0 and high-order pages, only care about the reference count of the first page.
Upon allocation (alloc_pages), only the first struct page of the group gets its refcount initialized. Upon deallocation (__free_pages), the refcount of the first page is decremented and tested: if it reaches zero, the whole group of pages gets actually freed(2). When this happens, a sanity check is also performed on every single page to ensure that the reference count is zero.
If you intend to allocate multiple pages at once, but then manage them separately, you will need to split them using split_page(), which effectively "enables" reference counting for every single struct page and initializes its refcount to 1. You can then use __free_pages(p, 0) (or __free_page()) on each page separately.(3)
Similar APIs (ignoring the need for physical contiguity) such as vmalloc increment the _refcount for all allocated pages.
Whether to allocate single order-0 pages or do a higher-order allocation is a choice that depends on the semantics of the specific memory allocation API. Problem is, these semantics can often change based on the actual API usage in kernel code(4). Indeed as of now vmalloc() splits the high-order page obtained from alloc_pages() using split_page(), but this was only a recent change done because some of its callers were relying on the allocated pages to be independent (e.g., doing their own reference counting).
(1) Not to be confused with compound pages, although their refcounting is performed in the same way, i.e. only the first page (PageHead()) is refcounted.
(2) It is actually a little bit more complex than that, all pages except the first are freed regardless of the refcount of the first, to avoid memory leaks in rare situations, see this relevant commit. The refcount sanity check on all the freed pages is done anyway.
(3) Note that allocating high-order pages and then splitting them into order-0 pages is generally not a good idea, as you can guess from the comment on top of split_pages(): "Note: this is probably too low level an operation for use in drivers. Please consult with lkml before using this in your driver." - This is because high-order allocations are harder to satisfy than order-0 allocations, and breaking high-order page blocks only makes it even harder.
(4) Welcome to the magic world of kernel APIs I guess. Much like Hogwarts' staircases, they like to change.

Related

What is in the PTE address field for an anonymously zero-fill-on-demand mapped page?

When a program calls mmap to allocate an anonymous page, also known as a demand-zero page, what appears in the address field of the corresponding page table entry (PTE)? I am assuming that the kernel does not create a zero-initialized page in physical memory (and enter that physical page's page number into the PTE) until the requesting process actually touches the page — hence the term demand-zero. Since it would not be a disk address, and would not be 0 (which is for unallocated pages), what value would appear there? As a different but related question, how does the kernel "know" that this page is to be handled as a demand-zero page, i.e., that the fault handler should find a physical page and initialize it with 0 rather than copy a page from disk?

I am assuming that the kernel does not create a zero-initialized page in physical memory
Indeed, this is usually the case. Unless special cases, like for example if MAP_POPULATE is specified to explicitly request the page to be initialized (also called "pre-fauting").
what appears in the address field of the corresponding page table entry (PTE)?
Right after mmap you don't even have a PTE allocated for the page (or in general, you don't have any entry at any page table level). For what the CPU is concerned, the page doesn't even exist. If you were to walk the page table you would just get to a point (at an arbitrary level) where the corresponding entry is marked as "not present".
Since it would not be a disk address, and would not be 0 (which is for unallocated pages), what value would appear there?
For what the CPU is concerned, the page is unallocated. At the first page fault, two things can happen:
For a read page fault, the PTE is updated to point to the zero page: this is a special page that is always entirely zeroed-out and is pointed to by the PTEs of any anonymous (demand-zero) page in the system that has not been modified yet.
For a write page fault, an actual physical page will be allocated and the corresponding PTE updated to point to its physical address.
Quoting directly from the documentation:
The anonymous memory or anonymous mappings represent memory that is not backed by a filesystem. Such mappings are implicitly created for program’s stack and heap or by explicit calls to mmap(2) system call. Usually, the anonymous mappings only define virtual memory areas that the program is allowed to access. The read accesses will result in creation of a page table entry that references a special physical page filled with zeroes. When the program performs a write, a regular physical page will be allocated to hold the written data. The page will be marked dirty and if the kernel decides to repurpose it, the dirty page will be swapped out.
how does the kernel "know" that this page is to be handled as a demand-zero page, i.e., that the fault handler should find a physical page and initialize it with 0 rather than copy a page from disk?
When a page fault occurs, the kernel page fault handler (architecture-dependent) determines to which VMA the page belongs to, and retrieves the corresponding struct vm_area_struct (which was created earlier either by the kernel itself or by a mmap syscall). This structure is then passed on to architecture-independent code (do_fault()) along with the needed fault information (struct vm_fault).
The vm_area_struct then contains all the remaining necessary information to handle the fault (for example the ->vm_file field which is != NULL in case of a file-backed mapping). The field ->vm_ops points to a struct vm_operations_struct which defines a set of function pointers to call in different occasions. In particular anonymous VMAs have ->vm_ops == NULL.
For other kind of pages, ->fault() is the function used when handling a page fault. This function knows what to check and how to actually handle the fault.
B & O also describe the VMA, but do not explain how the kernel could use the VMA to distinguish between, say, an unallocated page and an allocated page to be created and zero-initialized.
Simple, just check vma->vm_ops == NULL and in such case you know that the page is a demand-zero anon page. Then on a page fault act as needed (read fault -> update PTE to point to global zero page, write fault -> allocate a page and update PTE).

Checking for valid user memory in kernel mode with copy_to_user

So, I tried using this:
copy_to_user(p, q, 0)
I want to copy from q to p and if it doesn't work, then I want to know if p points to an invalid address.
copy_to_user returns the number of bytes that weren't copied successfully but in this case, there are 0 bytes and I can't know for sure if p points to an invalid address.
Is there another way to check if p points to a valid user memory?

Yes. You need to check passing size value manually each time before calling copy_to_user(). If it's 0 or not in valid range -- you shouldn't call copy_to_user() at all. This way you can rely on copy_to_user() return value.

the method copy_to_user defined at /usr/src/linux-3.0.6-gentoo/include/asm-generic/uaccess.h
static inline long copy_to_user(void __user *to,
const void *from, unsigned long n)
{
might_fault();
if (access_ok(VERIFY_WRITE, to, n))
return __copy_to_user(to, from, n);
else
return n;
}
the method access_ok checks the accessibility of to(user memory). So you can use the method access_ok to check memory is valid or not(to is not NULL / it's in user space)?
Argument VERIFY_READ or VERIFY_WRITE. VERIFY_READ: identifies whether memory region is readable, VERIFY_WRITE: identifies whether the memory region is readable as well as writable.
source of method access_ok

And what do you consider 'valid user memory'? What do you need this for?
Let's say we only care about the target buffer residing in userspace range (for archs with joint address spaces). From this alone we see that testing the address without the size is pointless - what if the address is the last byte of userspace? Appropriate /range/ check is done by access_ok.
Second part is whether there is a page there or a read/write can be performed without servicing a page fault. Is this of any concern for you? If you read copy_from/whatever you will see it performs the read/write and only catches the fault. There is definitely KPI to check whether the target page can be written to without a fault, but you would need to hold locks (mmap_sem and likely more) over your check and whatever you are going to do next, which is likely not what you wanted to do.
So far it seems you are trying

Atomic operations in ARM

I've been working on an embedded OS for ARM, However there are a few things i didn't understand about the architecture even after referring to ARMARM and linux source.
Atomic operations.
ARM ARM says that Load and Store instructions are atomic and it's execution is guaranteed to be complete before interrupt handler executes. Verified by looking at
arch/arm/include/asm/atomic.h :
#define atomic_read(v) (*(volatile int *)&(v)->counter)
#define atomic_set(v,i) (((v)->counter) = (i))
However, the problem comes in when i want to manipulate this value atomically using the cpu instructions (atomic_inc, atomic_dec, atomic_cmpxchg etc..) which use LDREX and STREX for ARMv7 (my target).
ARMARM doesn't say anything about interrupts being blocked in this section so i assume an interrupt can occur in between the LDREX and STREX. The thing it does mention is about locking the memory bus which i guess is only helpful for MP systems where there can be more CPUs trying to access same location at same time. But for UP (and possibly MP), If a timer interrupt (or IPI for SMP) fires in this small window of LDREX and STREX, Exception handler executes possibly changes cpu context and returns to the new task, however the shocking part comes in now, it executes 'CLREX' and hence removing any exclusive lock held by previous thread. So how better is using LDREX and STREX than LDR and STR for atomicity on a UP system ?
I did read something about an Exclusive lock monitor, so I've a possible theory that when the thread resumes and executes the STREX, the os monitor causes this call to fail which can be detected and the loop can be re-executed using the new value in the process (branch back to LDREX), Am i right here ?

The idea behind the load-linked/store-exclusive paradigm is that if if the store follows very soon after the load, with no intervening memory operations, and if nothing else has touched the location, the store is likely to succeed, but if something else has touched the location the store is certain to fail. There is no guarantee that stores will not sometimes fail for no apparent reason; if the time between load and store is kept to a minimum, however, and there are no memory accesses between them, a loop like:
do
{
new_value = __LDREXW(dest) + 1;
} while (__STREXW(new_value, dest));
can generally be relied upon to succeed within a few attempts. If computing the new value based on the old value required some significant computation, one should rewrite the loop as:
do
{
old_value = *dest;
new_value = complicated_function(old_value);
} while (CompareAndStore(dest, new_value, old_value) != 0);
... Assuming CompareAndStore is something like:
uint32_t CompareAndStore(uint32_t *dest, uint32_t new_value, uint_32 old_value)
{
do
{
if (__LDREXW(dest) != old_value) return 1; // Failure
} while(__STREXW(new_value, dest);
return 0;
}
This code will have to rerun its main loop if something changes *dest while the new value is being computed, but only the small loop will need to be rerun if the __STREXW fails for some other reason [which is hopefully not too likely, given that there will only be about two instructions between the __LDREXW and __STREXW]
Addendum
An example of a situation where "compute new value based on old" could be complicated would be one where the "values" are effectively a references to a complex data structure. Code may fetch the old reference, derive a new data structure from the old, and then update the reference. This pattern comes up much more often in garbage-collected frameworks than in "bare metal" programming, but there are a variety of ways it can come up even when programming bare metal. Normal malloc/calloc allocators are not generally thread-safe/interrupt-safe, but allocators for fixed-size structures often are. If one has a "pool" of some power-of-two number of data structures (say 255), one could use something like:
#define FOO_POOL_SIZE_SHIFT 8
#define FOO_POOL_SIZE (1 << FOO_POOL_SIZE_SHIFT)
#define FOO_POOL_SIZE_MASK (FOO_POOL_SIZE-1)
void do_update(void)
{
// The foo_pool_alloc() method should return a slot number in the lower bits and
// some sort of counter value in the upper bits so that once some particular
// uint32_t value is returned, that same value will not be returned again unless
// there are at least (UINT_MAX)/(FOO_POOL_SIZE) intervening allocations (to avoid
// the possibility that while one task is performing its update, a second task
// changes the thing to a new one and releases the old one, and a third task gets
// given the newly-freed item and changes the thing to that, such that from the
// point of view of the first task, the thing never changed.)
uint32_t new_thing = foo_pool_alloc();
uint32_t old_thing;
do
{
// Capture old reference
old_thing = foo_current_thing;
// Compute new thing based on old one
update_thing(&foo_pool[new_thing & FOO_POOL_SIZE_MASK],
&foo_pool[old_thing & FOO_POOL_SIZE_MASK);
} while(CompareAndSwap(&foo_current_thing, new_thing, old_thing) != 0);
foo_pool_free(old_thing);
}
If there will not often be multiple threads/interrupts/whatever trying to update the same thing at the same time, this approach should allow updates to be performed safely. If a priority relationship will exist among the things that may try to update the same item, the highest-priority one is guaranteed to succeed on its first attempt, the next-highest-priority one will succeed on any attempt that isn't preempted by the highest-priority one, etc. If one was using locking, the highest-priority task that wanted to perform the update would have to wait for the lower-priority update to finish; using the CompareAndSwap paradigm, the highest-priority task will be unaffected by the lower one (but will cause the lower one to have to do wasted work).

Okay, got the answer from their website.
If a context switch schedules out a process after the process has performed a Load-Exclusive but before it performs the Store-Exclusive, the Store-Exclusive returns a false negative result when the process resumes, and memory is not updated. This does not affect program functionality, because the process can retry the operation immediately.

Partial unmap of Win32 memory-mapped file

I have some code (which I cannot change) that I need to get working in a native Win32 environment. This code calls mmap() and munmap(), so I have created those functions using CreateFileMapping(), MapViewOfFile(), etc., to accomplish the same thing. Initially this works fine, and the code is able to access files as expected. Unfortunately the code goes on to munmap() selected parts of the file that it no longer needs.
x = mmap(0, size, PROT_READ, MAP_SHARED, fd, 0);
...
munmap(x, hdr_size);
munmap(x + foo, bar);
...
Unfortunately, when you pass a pointer into the middle of the mapped range to UnmapViewOfFile() it destroys the entire mapping. Even worse, I can't see how I would be able to detect that this is a partial un-map request and just ignore it.
I have tried calling VirtualFree() on the range but, unsurprisingly, this produces ERROR_INVALID_PARAMETER.
I'm beginning to think that I will have to use static/global variables to track all the open memory mappings so that I can detect and ignore partial unmappings, but I hope you have a better idea...
edit:
Since I wasn't explicit enough above: the docs for UnMapViewOfFile do not accurately reflect the behavior of that function.
Un-mapping the whole view and remapping pieces is not a good solution because you can only suggest a base address for a new mapping, you can't really control it. The semantics of munmap() don't allow for a change to the base address of the still-mapped portion.
What I really need is a way to find the base address and size of a already-mapped memory area.
edit2: Now that I restate the problem that way, it looks like the VirtualQuery() function will suffice.

It is quite explicit in the MSDN Library docs for UnmapViewOfFile:
lpBaseAddress A pointer to the
base address of the mapped view of a
file that is to be unmapped. This
value must be identical to the value
returned by a previous call to the
MapViewOfFile or MapViewOfFileEx
function.
You changing the mapping by unmapping the old one and creating a new one. Unmapping bits and pieces isn't well supported, nor would it have any useful side-effects from a memory management point of view. You don't want to risk getting the address space fragmented.
You'll have to do this differently.

You could keep track each mapping and how many pages of it are still allocated by the client and only free the mapping when that counter reaches zero. The middle sections would still be mapped, but it wouldn't matter since the client wouldn't be accessing that memory anyway.
Create a global dictionary of memory mappings through this interface. When a mapping request comes through, record the address, size and number of pages that are in the range. When a unmap request is made, find out which mapping owns that address and decrease the page count by the number of pages that are being freed. When that count reaches zero, really unmap the view.

Can address space be recycled for multiple calls to MapViewOfFileEx without chance of failure?

Consider a complex, memory hungry, multi threaded application running within a 32bit address space on windows XP.
Certain operations require n large buffers of fixed size, where only one buffer needs to be accessed at a time.
The application uses a pattern where some address space the size of one buffer is reserved early and is used to contain the currently needed buffer.
This follows the sequence:
(initial run) VirtualAlloc -> VirtualFree -> MapViewOfFileEx
(buffer changes) UnMapViewOfFile -> MapViewOfFileEx
Here the pointer to the buffer location is provided by the call to VirtualAlloc and then that same location is used on each call to MapViewOfFileEx.
The problem is that windows does not (as far as I know) provide any handshake type operation for passing the memory space between the different users.
Therefore there is a small opportunity (at each -> in my above sequence) where the memory is not locked and another thread can jump in and perform an allocation within the buffer.
The next call to MapViewOfFileEx is broken and the system can no longer guarantee that there will be a big enough space in the address space for a buffer.
Obviously refactoring to use smaller buffers reduces the rate of failures to reallocate space.
Some use of HeapLock has had some success but this still has issues - something still manages to steal some memory from within the address space.
(We tried Calling GetProcessHeaps then using HeapLock to lock all of the heaps)
What I'd like to know is there anyway to lock a specific block of address space that is compatible with MapViewOfFileEx?
Edit: I should add that ultimately this code lives in a library that gets called by an application outside of my control

You could brute force it; suspend every thread in the process that isn't the one performing the mapping, Unmap/Remap, unsuspend the suspended threads. It ain't elegant, but it's the only way I can think of off-hand to provide the kind of mutual exclusion you need.

Have you looked at creating your own private heap via HeapCreate? You could set the heap to your desired buffer size. The only remaining problem is then how to get MapViewOfFileto use your private heap instead of the default heap.
I'd assume that MapViewOfFile internally calls GetProcessHeap to get the default heap and then it requests a contiguous block of memory. You can surround the call to MapViewOfFile with a detour, i.e., you rewire the GetProcessHeap call by overwriting the method in memory effectively inserting a jump to your own code which can return your private heap.
Microsoft has published the Detour Library that I'm not directly familiar with however. I know that detouring is surprisingly common. Security software, virus scanners etc all use such frameworks. It's not pretty, but may work:
HANDLE g_hndPrivateHeap;
HANDLE WINAPI GetProcessHeapImpl() {
return g_hndPrivateHeap;
}
struct SDetourGetProcessHeap { // object for exception safety
SDetourGetProcessHeap() {
// put detour in place
}
~SDetourGetProcessHeap() {
// remove detour again
}
};
void MapFile() {
g_hndPrivateHeap = HeapCreate( ... );
{
SDetourGetProcessHeap d;
MapViewOfFile(...);
}
}
These may also help:
How to replace WinAPI functions calls in the MS VC++ project with my own implementation (name and parameters set are the same)?
How can I hook Windows functions in C/C++?
http://research.microsoft.com/pubs/68568/huntusenixnt99.pdf

Imagine if I came to you with a piece of code like this:
void *foo;
foo = malloc(n);
if (foo)
free(foo);
foo = malloc(n);
Then I came to you and said, help! foo does not have the same address on the second allocation!
I'd be crazy, right?
It seems to me like you've already demonstrated clear knowledge of why this doesn't work. There's a reason that the documention for any API that takes an explicit address to map into lets you know that the address is just a suggestion, and it can't be guaranteed. This also goes for mmap() on POSIX.
I would suggest you write the program in such a way that a change in address doesn't matter. That is, don't store too many pointers to quantities inside the buffer, or if you do, patch them up after reallocation. Similar to the way you'd treat a buffer that you were going to pass into realloc().
Even the documentation for MapViewOfFileEx() explicitly suggests this:
While it is possible to specify an address that is safe now (not used by the operating system), there is no guarantee that the address will remain safe over time. Therefore, it is better to let the operating system choose the address. In this case, you would not store pointers in the memory mapped file, you would store offsets from the base of the file mapping so that the mapping can be used at any address.
Update from your comments
In that case, I suppose you could:
Not map into contiguous blocks. Perhaps you could map in chunks and write some intermediate function to decide which to read from/write to?
Try porting to 64 bit.

As the earlier post suggests, you can suspend every thread in the process while you change the memory mappings. You can use SuspendThread()/ResumeThread() for that. This has the disadvantage that your code has to know about all the other threads and hold thread handles for them.
An alternative is to use the Windows debug API to suspend all threads. If a process has a debugger attached, then every time the process faults, Windows will suspend all of the process's threads until the debugger handles the fault and resumes the process.
Also see this question which is very similar, but phrased differently:
Replacing memory mappings atomically on Windows

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio