Why ioremap allocated areas have such large aligment? - linux-kernel

I have a code that fails calling ioremap() for 4M region. Trying to debug the reason, I've found out that if you call ioremap it will try to allocate continuous addresses with a very large alignment (depending on the size of the area you want to allocate). The code that computes this alignment is in __get_vm_area_node() function (mm/vmalloc.c) and it looks like this:
if (flags & VM_IOREMAP) {
int bit = fls(size);
if (bit > IOREMAP_MAX_ORDER)
bit = IOREMAP_MAX_ORDER;
else if (bit < PAGE_SHIFT)
bit = PAGE_SHIFT;
align = 1ul << bit;
}
On ARM, IOREMAP_MAX_ORDER is defined as 23. This means that in my case, ioremap needs not only 4M of continues addressing in vmalloc area but it also has to be aligned to 4M.
I wasn't able to find any information on why this alignment is needed. I even tried using git blame to see the commit that introduces this change but it seems the code is older than git history so I couldn't find anything.

Related

Failing to Import Cuda memory into Vulkan

I'm trying to use the VK_EXT_external_memory_host extension https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_EXT_external_memory_host.html. I'm not sure what the difference is between vk::ExternalMemoryHandleTypeFlagBits::eHostAllocationEXT and eHostMappedForeignMemoryEXT but I've been failing to get either to work. (I'm using VulkanHpp).
void* data_ptr = getTorchDataPtr();
uint32_t MEMORY_TYPE_INDEX;
auto EXTERNAL_MEMORY_TYPE = vk::ExternalMemoryHandleTypeFlagBits::eHostAllocationEXT;
// or vk::ExternalMemoryHandleTypeFlagBits::eHostMappedForeignMemoryEXT;
vk::MemoryAllocateInfo memoryAllocateInfo(SIZE_BYTES, MEMORY_TYPE_INDEX);
vk::ImportMemoryHostPointerInfoEXT importMemoryHostPointerInfoEXT(
MEMORY_FLAG,
data_ptr);
memoryAllocateInfo.pNext = &importMemoryHostPointerInfoEXT;
vk::raii::DeviceMemory deviceMemory( device, memoryAllocateInfo );
I'm getting Result::eErrorOutOfDeviceMemory when the constructor of DeviceMemory calls vkAllocateMemory if EXTERNAL_MEMORY_TYPE = eHostAllocationEXT and zeros in the memory if EXTERNAL_MEMORY_TYPE = eHostMappedForeignMemoryEXT (I've checked the py/libtorch tensor I'm importing is non-zero, and that my code successfully copies and readbacks a different buffer).
All values of MEMORY_TYPE_INDEX produce the same behaviour (except when MEMORY_TYPE_INDEX overflows).
The set bits of the bitmask returned by getMemoryHostPointerPropertiesEXT is suppose to give the valid values for MEMORY_TYPE_INDEX.
auto pointerProperties = device.getMemoryHostPointerPropertiesEXT(
EXTERNAL_MEMORY_TYPE,
data_ptr);
std::cout << "memoryTypeBits " << std::bitset<32>(pointerProperties.memoryTypeBits) << std::endl;
}
But if EXTERNAL_MEMORY_TYPE = eHostMappedForeignMemoryEXT then vkGetMemoryHostPointerPropertiesEXT returns Result::eErrorInitializationFailed, and if EXTERNAL_MEMORY_TYPE = eHostAllocationEXT, then the 8th and 9th bits are set. But this is the same regardless of whether data_ptr is a cuda pointer 0x7ffecf400000 or a cpu pointer 0x2be7c80 so I'm feeling something has gone wrong.
I'm also unable to get the extension VK_KHR_external_memory_capabilities which is required by VK_KHR_external_memory which is a requirement of the extension we are using VK_EXT_external_memory_host. I'm using vulkan version 1.2.162.0.
The eErrorOutOfDeviceMemory is strange as we are not supposed to be allocating any memory, I'd be glad if someone could speculate about this.
I believe that host memory is cpu memory, thus:
vk::ExternalMemoryHandleTypeFlagBits::eHostAllocationEXT wont work because the pointer is to device memory (gpu).
vk::ExternalMemoryHandleTypeFlagBits::eHostMappedForeignMemoryEXT wont work because the memory is not mapped by the host (cpu).
Is there anyway to import local device memory into vulkan? does it have to be host mapped?
Probably not https://stackoverflow.com/a/54801938/11998382.
I think the best option, for me, is to map some vulkan memory, and copy the pytorch cpu tensor across. The same data would be uploaded to the gpu twice but this doesn't really matter I suppose.

mapping device memory and kernel allocated memory into the same vma

I'm working on a driver where ranges of device memory are mapped into the user space (via IOCTL) for the application to write to. It works:
vma->vm_flags |= VM_DONTCOPY;
vma->vm_flags |= VM_DONTEXPAND;
down_write(&current->mm->mmap_sem);
ret = vm_iomap_memory(vma, from, sz_required);
up_write(&current->mm->mmap_sem);
where from is a physical address obtained from pci_resource_start() with some offset added to it.
The application also needs to read from the device so I increase the size of the region mmapped by application by PAGE_SIZE, allocate a page with dma_alloc_coherent(), and try to insert it at the end of the vma but that returns EBUSY. What do I do wrong? I should be able to stitch together multiple physical ranges into a single vma, both real memory and device mapped, or is that not supported?
In the new code a page is allocated like that, dma_addr is passed to the device so it knows where to write to:
dma = dma_alloc_coherent(&device, PAGE_SIZE, &dma_addr, GFP_KERNEL);
memset(dma, 0xfe, PAGE_SIZE);
set_memory_wb((unsigned long)dma, 1);
And the mapping code is changed to:
vma->vm_flags |= VM_DONTCOPY;
vma->vm_flags |= VM_DONTEXPAND;
vma->vm_flags |= VM_MIXEDMAP;
down_write(&current->mm->mmap_sem);
ret = vm_iomap_memory(vma, from, sz_required);
up_write(&current->mm->mmap_sem);
down_write(&current->mm->mmap_sem);
ret = vm_insert_page(vma, vma->vm_end - PAGE_SIZE, virt_to_page(dma));
up_write(&current->mm->mmap_sem);
The kernel is 4.15 on x86_64
Got it working by following the "hack" in Map multiple kernel buffer into contiguous userspace buffer?
Before vm_iomap_memory() I decrement vma->vm_end by PAGE_SIZE and restore the old value afterwards. Also, I switched from dma_alloc_coherent() to alloc_page() following by dma_map_page()
Not the solution I'm satisfied with though. There has to be a better way, perhaps a fault handler in vm_ops? Although that seems counter-productive considering I know exactly what I will be mapping and where.
It appears to be working on x86_64 and aarch64

Explicitly set starting stack pointer with linker script

I'd like to create a program with a special section at the end of Virtual Memory. So I wanted to do a linker script something like this:
/* ... */
.section_x 0xffff0000 : {
_start_section_x = .;
. = . + 0xffff;
_end_section_x = .;
}
The problem is that gcc/ld/glibc seem to load the stack at this location by default for a 32 bit application, even if it overlaps a known section. The above code zero's out the stack causing an exception. Is there any way to tell the linker to use another VM memory location for the stack? (As well, I'd like to ensure the heap doesn't span this section of virtual memory...).
I hate answers that presume or even ask if the question is wrong but, if you need a 64k segment, why can't you just allocate one on startup?
Why could you possibly need a fixed address within your process address space? I've been doing a lot of different kinds of coding for almost 30 years, and I haven't seen the need for a fixed address since the advent of protected memory.

How can I force MacOS to release MADV_FREE'd pages?

My program has a custom allocator which gets memory from the OS using mmap(MAP_ANON | MAP_PRIVATE). When it no longer needs memory, the allocator calls either munmap or madvise(MADV_FREE). MADV_FREE keeps the mapping around, but tells the OS that it can throw away the physical pages associated with the mapping.
Calling MADV_FREE on pages you're going to need again eventually is much faster than calling munmap and later calling mmap again.
This almost works perfectly for me. The only problem is that, on MacOS, MADV_FREE is very lazy about getting rid of the pages I've asked it to free. In fact, it only gets rid of them when there's memory pressure from another application. Until it gets rid of the pages I've freed, MacOS reports that my program is still using that memory; in the Activity Monitor, its "Real Memory" column doesn't reflect the freed memory.
This makes it difficult for me to measure how much memory my program is actually using. (This difficulty in measuring RSS is keeping us from landing the custom allocator on 10.5.)
I could allocate a whole bunch of memory to force the OS to free up these pages, but in addition to taking a long time, that could have other side-effects, such as causing parts of my program to be paged out to disk.
On a lark, I tried the purge command, but that has no effect.
How can I force MacOS to clean out these MADV_FREE'd pages? Or, how can I ask MacOS how many MADV_FREE'd pages my process has in memory?
Here's a test program, if it helps. The Activity Monitor's "Real Memory" column shows 512MB after the program goes to sleep. On my Linux box, top shows 256MB of RSS, as desired.
#include <sys/mman.h>
#include <stdio.h>
#include <unistd.h>
#define SIZE (512 * 1024 * 1024)
// We use MADV_FREE on Mac and MADV_DONTNEED on Linux.
#ifndef MADV_FREE
#define MADV_FREE MADV_DONTNEED
#endif
int main()
{
char *x = mmap(0, SIZE, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, -1, 0);
// Touch each page we mmap'ed so it gets a physical page.
int i;
for (i = 0; i < SIZE; i += 1024) {
x[i] = i;
}
madvise(x, SIZE / 2, MADV_FREE);
fprintf(stderr, "Sleeping. Now check my RSS. Hopefully it's %dMB.\n", SIZE / (2 * 1024 * 1024));
sleep(1024);
return 0;
}
mprotect(addr, length, PROT_NONE);
mprotect(addr, length, PROT_READ | PROT_WRITE);
Note as you say, madvise is lazier, and that is probably better for performance (just in case anyone is tempted to use this for performance rather than measurement).
Use MADV_FREE_REUSABLE on macOS. According to Apple's magazine_malloc implementation:
On OS X we use MADV_FREE_REUSABLE, which signals the kernel to remove the given pages from the memory statistics for our process. However, on returning that memory to use we have to signal that it has been reused.
https://opensource.apple.com/source/libmalloc/libmalloc-53.1.1/src/magazine_malloc.c.auto.html
Chromium, for example, also uses it:
MADV_FREE_REUSABLE is similar to MADV_FREE, but also marks the pages with the reusable bit, which allows both Activity Monitor and memory-infra to correctly track the pages.
https://github.com/chromium/chromium/blob/master/base/memory/discardable_shared_memory.cc#L377
I've looked and looked, and I don't think this is possible. :\
We're solving the problem by adding code to the allocator which explicitly decommits MADV_FREE'd pages when we ask it to.

How to align stack at 32 byte boundary in GCC?

I'm using MinGW64 build based on GCC 4.6.1 for Windows 64bit target. I'm playing around with the new Intel's AVX instructions. My command line arguments are -march=corei7-avx -mtune=corei7-avx -mavx.
But I started running into segmentation fault errors when allocating local variables on the stack. GCC uses the aligned moves VMOVAPS and VMOVAPD to move __m256 and __m256d around, and these instructions require 32-byte alignment. However, the stack for Windows 64bit has only 16 byte alignment.
How can I change the GCC's stack alignment to 32 bytes?
I have tried using -mstackrealign but to no avail, since that aligns only to 16 bytes. I couldn't make __attribute__((force_align_arg_pointer)) work either, it aligns to 16 bytes anyway. I haven't been able to find any other compiler options that would address this. Any help is greatly appreciated.
EDIT:
I tried using -mpreferred-stack-boundary=5, but GCC says that 5 is not supported for this target. I'm out of ideas.
I have been exploring the issue, filed a GCC bug report, and found out that this is a MinGW64 related problem. See GCC Bug#49001. Apparently, GCC doesn't support 32-byte stack alignment on Windows. This effectively prevents the use of 256-bit AVX instructions.
I investigated a couple ways how to deal with this issue. The simplest and bluntest solution is to replace of aligned memory accesses VMOVAPS/PD/DQA by unaligned alternatives VMOVUPS etc. So I learned Python last night (very nice tool, by the way) and pulled off the following script that does the job with an input assembler file produced by GCC:
import re
import fileinput
import sys
# fix aligned stack access
# replace aligned vmov* by unaligned vmov* with 32-byte aligned operands
# see Intel's AVX programming guide, page 39
vmova = re.compile(r"\s*?vmov(\w+).*?((\(%r.*?%ymm)|(%ymm.*?\(%r))")
aligndict = {"aps" : "ups", "apd" : "upd", "dqa" : "dqu"};
for line in fileinput.FileInput(sys.argv[1:],inplace=1):
m = vmova.match(line)
if m and m.group(1) in aligndict:
s = m.group(1)
print line.replace("vmov"+s, "vmov"+aligndict[s]),
else:
print line,
This approach is pretty safe and foolproof. Though I observed a performance penalty on rare occasions. When the stack is unaligned, the memory access crosses the cache line boundary. Fortunately, the code performs as fast as aligned accesses most of the time. My recommendation: inline functions in critical loops!
I also attempted to fix the stack allocation in every function prolog using another Python script, trying to align it always at the 32-byte boundary. This seems to work for some code, but not for other. I have to rely on the good will of GCC that it will allocate aligned local variables (with respect to the stack pointer), which it usually does. This is not always the case, especially when there is a serious register spilling due to the necessity to save all ymm register before a function call. (All ymm registers are callee-save). I can post the script if there's an interest.
The best solution would be to fix GCC MinGW64 build. Unfortunately, I have no knowledge of its internal workings, just started using it last week.
You can get the effect you want by
Declaring your variables not as variables, but as fields in a struct
Declaring an array that is larger than the structure by an appropriate amount of padding
Doing pointer/address arithmetic to find a 32 byte aligned address in side the array
Casting that address to a pointer to your struct
Finally using the data members of your struct
You can use the same technique when malloc() does not align stuff on the heap appropriately.
E.g.
void foo() {
struct I_wish_these_were_32B_aligned {
vec32B foo;
char bar[32];
}; // not - no variable definition, just the struct declaration.
unsigned char a[sizeof(I_wish_these_were_32B_aligned) + 32)];
unsigned char* a_aligned_to_32B = align_to_32B(a);
I_wish_these_were_32B_aligned* s = (I_wish_these_were_32B_aligned)a_aligned_to_32B;
s->foo = ...
}
where
unsigned char* align_to_32B(unsiged char* a) {
uint64_t u = (unit64_t)a;
mask_aligned32B = (1 << 5) - 1;
if (u & mask_aligned32B == 0) return (unsigned char*)u;
return (unsigned char*)((u|mask_aligned_32B) + 1);
}
I just ran in the same issue of having segmentation faults when using AVX inside my functions. And it was also due to the stack misalignment. Given the fact that this is a compiler issue (and the options that could help are not available in Windows), I worked around the stack usage by:
Using static variables (see this issue). Given the fact that they are not stored in the stack, you can force their alignment by using __attribute__((align(32))) in your declaration. For example: static __m256i r __attribute__((aligned(32))).
Inlining the functions/methods receiving/returning AVX data. You can force GCC to inline your function/method by adding inline and __attribute__((always_inline)) to your function prototype/declaration. Inlining your functions increase the size of your program, but they also prevent the function from using the stack (and hence, avoids the stack-alignment issue). Example: inline __m256i myAvxFunction(void) __attribute__((always_inline));.
Be aware that the usage of static variables is no thread-safe, as mentioned in the reference. If you are writing a multi-threaded application you may have to add some protection for your critical paths.

Resources